87 lines
5.6 KiB
Plaintext
87 lines
5.6 KiB
Plaintext
# Program {#sec-program-intro .unnumbered}
|
|
|
|
```{r}
|
|
#| results: "asis"
|
|
#| echo: false
|
|
source("_common.R")
|
|
```
|
|
|
|
In this part of the book, you'll improve your programming skills.
|
|
Programming is a cross-cutting skill needed for all data science work: you must use a computer to do data science; you cannot do it in your head, or with pencil and paper.
|
|
|
|
```{r}
|
|
#| label: fig-ds-program
|
|
#| echo: false
|
|
#| fig-cap: >
|
|
#| Programming is the water in which all other components of the data
|
|
#| science process swims.
|
|
#| fig-alt: >
|
|
#| Our model of the data science process with program (import, tidy,
|
|
#| transform, visualize, model, and communicate, i.e. everything)
|
|
#| highlighted in blue.
|
|
#| out.width: NULL
|
|
|
|
knitr::include_graphics("diagrams/data-science/program.png", dpi = 270)
|
|
```
|
|
|
|
Programming produces code, and code is a tool of communication.
|
|
Obviously code tells the computer what you want it to do.
|
|
But it also communicates meaning to other humans.
|
|
Thinking about code as a vehicle for communication is important because every project you do is fundamentally collaborative.
|
|
Even if you're not working with other people, you'll definitely be working with future-you!
|
|
Writing clear code is important so that others (like future-you) can understand why you tackled an analysis in the way you did.
|
|
That means getting better at programming also involves getting better at communicating.
|
|
Over time, you want your code to become not just easier to write, but easier for others to read.
|
|
|
|
Writing code is similar in many ways to writing prose.
|
|
One parallel which we find particularly useful is that in both cases rewriting is the key to clarity.
|
|
The first expression of your ideas is unlikely to be particularly clear, and you may need to rewrite multiple times.
|
|
After solving a data analysis challenge, it's often worth looking at your code and thinking about whether or not it's obvious what you've done.
|
|
If you spend a little time rewriting your code while the ideas are fresh, you can save a lot of time later trying to recreate what your code did.
|
|
But this doesn't mean you should rewrite every function: you need to balance what you need to achieve now with saving time in the long run.
|
|
(But the more you rewrite your functions the more likely your first attempt will be clear.)
|
|
|
|
In the following three chapters, you'll learn skills that will allow you to both tackle new programs and to solve existing problems with greater clarity and ease:
|
|
|
|
1. Copy-and-paste is a powerful tool, but you should avoid doing it more than twice.
|
|
Repeating yourself in code is dangerous because it can easily lead to errors and inconsistencies.
|
|
Instead, in @sec-functions, you'll learn how to write **functions** which let you extract out repeated code so that it can be easily reused.
|
|
|
|
2. As you start to write more powerful functions, you'll need a solid grounding in R's **data structures**, provided by vectors, which we discuss in @sec-vectors. You must master the four common atomic vectors, the three important S3 classes built on top of them, and understand the mysteries of the list and data frame.
|
|
|
|
3. Functions extract out repeated code, but you often need to repeat the same actions on different inputs.
|
|
You need tools for **iteration** that let you do similar things again and again.
|
|
These tools include for loops and functional programming, which you'll learn about in @sec-iteration.
|
|
|
|
A common theme throughout these chapters is the idea of reducing duplication in your code.
|
|
Reducing code duplication has three main benefits:
|
|
|
|
1. It's easier to see the intent of your code, because your eyes are drawn to what's different, not what stays the same.
|
|
|
|
2. It's easier to respond to changes in requirements.
|
|
As your needs change, you only need to make changes in one place, rather than remembering to change every place that you copied-and-pasted the code.
|
|
|
|
3. You're likely to have fewer bugs because each line of code is used in more places.
|
|
|
|
One tool for reducing duplication is functions, which reduce duplication by identifying repeated patterns of code and extract them out into independent pieces that can be easily reused and updated.
|
|
Another tool for reducing duplication is iteration, which helps you when you need to do the same thing to multiple inputs: repeating the same operation on different columns, or on different datasets.
|
|
|
|
## Learning more
|
|
|
|
The goal of these chapters is to teach you the minimum about programming that you need to practice data science.
|
|
Once you have mastered the material in this book, we strongly believe you should continue to invest in your programming skills.
|
|
Learning more about programming is a long-term investment: it won't pay off immediately, but in the long term it will allow you to solve new problems more quickly, and let you reuse your insights from previous problems in new scenarios.
|
|
|
|
To learn more you need to study R as a programming language, not just an interactive environment for data science.
|
|
We have written two books that will help you do so:
|
|
|
|
- [*Hands on Programming with R*](https://rstudio-education.github.io/hopr/), by Garrett Grolemund.
|
|
This is an introduction to R as a programming language and is a great place to start if R is your first programming language.
|
|
It covers similar material to these chapters, but with a different style and different motivation examples (based in the casino).
|
|
It's a useful complement if you find that these four chapters go by too quickly.
|
|
|
|
- [*Advanced R*](https://adv-r.hadley.nz/) by Hadley Wickham.
|
|
This dives into the details of R the programming language.
|
|
This is a great place to start if you have existing programming experience.
|
|
It's also a great next step once you've internalized the ideas in these chapters.
|