Workflow chapter adjustments

* Combine scripts + projects
* New getting help chapter
This commit is contained in:
Hadley Wickham 2022-02-21 15:46:01 -06:00
parent 155aaf0593
commit 200f0fb725
5 changed files with 207 additions and 207 deletions

View File

@ -17,7 +17,7 @@ rmd_files: [
"data-import.Rmd",
"workflow-scripts.Rmd",
"EDA.Rmd",
"workflow-projects.Rmd",
"workflow-help.Rmd",
"transform.Rmd",
"tibble.Rmd",

View File

@ -219,60 +219,6 @@ Throughout the book we use a consistent set of conventions to refer to code:
`nycflights13::flights`.
This is also valid R code.
## Getting help and learning more
This book is not an island; there is no single resource that will allow you to master R.
As you start to apply the techniques described in this book to your own data you will soon find questions that we do not answer.
This section describes a few tips on how to get help, and to help you keep learning.
If you get stuck, start with Google.
Typically adding "R" to a query is enough to restrict it to relevant results: if the search isn't useful, it often means that there aren't any R-specific results available.
Google is particularly useful for error messages.
If you get an error message and you have no idea what it means, try googling it!
Chances are that someone else has been confused by it in the past, and there will be help somewhere on the web.
(If the error message isn't in English, run `Sys.setenv(LANGUAGE = "en")` and re-run the code; you're more likely to find help for English error messages.)
If Google doesn't help, try [Stack Overflow](http://stackoverflow.com).
Start by spending a little time searching for an existing answer, including `[R]` to restrict your search to questions and answers that use R.
If you don't find anything useful, prepare a minimal reproducible example or **reprex**.
A good reprex makes it easier for other people to help you, and often you'll figure out the problem yourself in the course of making it.
There are three things you need to include to make your example reproducible: required packages, data, and code.
1. **Packages** should be loaded at the top of the script, so it's easy to see which ones the example needs.
This is a good time to check that you're using the latest version of each package; it's possible you've discovered a bug that's been fixed since you installed the package.
For packages in the tidyverse, the easiest way to check is to run `tidyverse_update()`.
2. The easiest way to include **data** in a question is to use `dput()` to generate the R code to recreate it.
For example, to recreate the `mtcars` dataset in R, I'd perform the following steps:
1. Run `dput(mtcars)` in R
2. Copy the output
3. In my reproducible script, type `mtcars <-` then paste.
Try and find the smallest subset of your data that still reveals the problem.
3. Spend a little bit of time ensuring that your **code** is easy for others to read:
- Make sure you've used spaces and your variable names are concise, yet informative.
- Use comments to indicate where your problem lies.
- Do your best to remove everything that is not related to the problem.\
The shorter your code is, the easier it is to understand, and the easier it is to fix.
Finish by checking that you have actually made a reproducible example by starting a fresh R session and copying and pasting your script in.
You should also spend some time preparing yourself to solve problems before they occur.
Investing a little time in learning R each day will pay off handsomely in the long run.
One way is to follow what Hadley, Garrett, and everyone else at RStudio are doing on the [RStudio blog](https://blog.rstudio.org).
This is where we post announcements about new packages, new IDE features, and in-person courses.
You might also want to follow Hadley ([\@hadleywickham](https://twitter.com/hadleywickham)) or Garrett ([\@statgarrett](https://twitter.com/statgarrett)) on Twitter, or follow [\@rstudiotips](https://twitter.com/rstudiotips) to keep up with new features in the IDE.
To keep up with the R community more broadly, we recommend reading <http://www.r-bloggers.com>: it aggregates over 500 blogs about R from around the world.
If you're an active Twitter user, follow the ([`#rstats`](https://twitter.com/search?q=%23rstats)) hashtag.
Twitter is one of the key tools that Hadley uses to keep up with new developments in the community.
## Acknowledgements
This book isn't just the product of Hadley, Mine, and Garrett, but is the result of many conversations (in person and online) that we've had with the many people in the R community.

53
workflow-help.Rmd Normal file
View File

@ -0,0 +1,53 @@
# Workflow: Getting help
This book is not an island; there is no single resource that will allow you to master R.
As you start to apply the techniques described in this book to your own data you will soon find questions that we do not answer.
This section describes a few tips on how to get help, and to help you keep learning.
If you get stuck, start with Google.
Typically adding "R" to a query is enough to restrict it to relevant results: if the search isn't useful, it often means that there aren't any R-specific results available.
Google is particularly useful for error messages.
If you get an error message and you have no idea what it means, try googling it!
Chances are that someone else has been confused by it in the past, and there will be help somewhere on the web.
(If the error message isn't in English, run `Sys.setenv(LANGUAGE = "en")` and re-run the code; you're more likely to find help for English error messages.)
If Google doesn't help, try [Stack Overflow](http://stackoverflow.com).
Start by spending a little time searching for an existing answer, including `[R]` to restrict your search to questions and answers that use R.
If you don't find anything useful, prepare a minimal reproducible example or **reprex**.
A good reprex makes it easier for other people to help you, and often you'll figure out the problem yourself in the course of making it.
There are three things you need to include to make your example reproducible: required packages, data, and code.
1. **Packages** should be loaded at the top of the script, so it's easy to see which ones the example needs.
This is a good time to check that you're using the latest version of each package; it's possible you've discovered a bug that's been fixed since you installed the package.
For packages in the tidyverse, the easiest way to check is to run `tidyverse_update()`.
2. The easiest way to include **data** in a question is to use `dput()` to generate the R code to recreate it.
For example, to recreate the `mtcars` dataset in R, I'd perform the following steps:
1. Run `dput(mtcars)` in R
2. Copy the output
3. In my reproducible script, type `mtcars <-` then paste.
Try and find the smallest subset of your data that still reveals the problem.
3. Spend a little bit of time ensuring that your **code** is easy for others to read:
- Make sure you've used spaces and your variable names are concise, yet informative.
- Use comments to indicate where your problem lies.
- Do your best to remove everything that is not related to the problem.\
The shorter your code is, the easier it is to understand, and the easier it is to fix.
Finish by checking that you have actually made a reproducible example by starting a fresh R session and copying and pasting your script in.
You should also spend some time preparing yourself to solve problems before they occur.
Investing a little time in learning R each day will pay off handsomely in the long run.
One way is to follow what Hadley, Garrett, and everyone else at RStudio are doing on the [RStudio blog](https://blog.rstudio.org).
This is where we post announcements about new packages, new IDE features, and in-person courses.
You might also want to follow Hadley ([\@hadleywickham](https://twitter.com/hadleywickham)) or Garrett ([\@statgarrett](https://twitter.com/statgarrett)) on Twitter, or follow [\@rstudiotips](https://twitter.com/rstudiotips) to keep up with new features in the IDE.
To keep up with the R community more broadly, we recommend reading <http://www.r-bloggers.com>: it aggregates over 500 blogs about R from around the world.
If you're an active Twitter user, follow the ([`#rstats`](https://twitter.com/search?q=%23rstats)) hashtag.
Twitter is one of the key tools that Hadley uses to keep up with new developments in the community.

View File

@ -1,151 +0,0 @@
# Workflow: projects
One day you will need to quit R, go do something else and return to your analysis the next day.
One day you will be working on multiple analyses simultaneously that all use R and you want to keep them separate.
One day you will need to bring data from the outside world into R and send numerical results and figures from R back out into the world.
To handle these real life situations, you need to make two decisions:
1. What about your analysis is "real", i.e. what will you save as your lasting record of what happened?
2. Where does your analysis "live"?
## What is real?
As a beginning R user, it's OK to consider your environment (i.e. the objects listed in the environment pane) "real".
However, in the long run, you'll be much better off if you consider your R scripts as "real".
With your R scripts (and your data files), you can recreate the environment.
It's much harder to recreate your R scripts from your environment!
You'll either have to retype a lot of code from memory (making mistakes all the way) or you'll have to carefully mine your R history.
To foster this behaviour, I highly recommend that you instruct RStudio not to preserve your workspace between sessions:
```{r, echo = FALSE, out.width = "75%"}
knitr::include_graphics("screenshots/rstudio-workspace.png")
```
This will cause you some short-term pain, because now when you restart RStudio it will not remember the results of the code that you ran last time.
But this short-term pain will save you long-term agony because it forces you to capture all important interactions in your code.
There's nothing worse than discovering three months after the fact that you've only stored the results of an important calculation in your workspace, not the calculation itself in your code.
There is a great pair of keyboard shortcuts that will work together to make sure you've captured the important parts of your code in the editor:
1. Press Cmd/Ctrl + Shift + F10 to restart RStudio.
2. Press Cmd/Ctrl + Shift + S to rerun the current script.
I use this pattern hundreds of times a week.
## Where does your analysis live?
R has a powerful notion of the **working directory**.
This is where R looks for files that you ask it to load, and where it will put any files that you ask it to save.
RStudio shows your current working directory at the top of the console:
```{r, echo = FALSE, out.width = "50%"}
knitr::include_graphics("screenshots/rstudio-wd.png")
```
And you can print this out in R code by running `getwd()`:
```{r eval = FALSE}
getwd()
#> [1] "/Users/hadley/Documents/r4ds/r4ds"
```
As a beginning R user, it's OK to let your home directory, documents directory, or any other weird directory on your computer be R's working directory.
But you're six chapters into this book, and you're no longer a rank beginner.
Very soon now you should evolve to organising your analytical projects into directories and, when working on a project, setting R's working directory to the associated directory.
**I do not recommend it**, but you can also set the working directory from within R:
```{r eval = FALSE}
setwd("/path/to/my/CoolProject")
```
But you should never do this because there's a better way; a way that also puts you on the path to managing your R work like an expert.
## Paths and directories
Paths and directories are a little complicated because there are two basic styles of paths: Mac/Linux and Windows.
There are three chief ways in which they differ:
1. The most important difference is how you separate the components of the path.
Mac and Linux uses slashes (e.g. `plots/diamonds.pdf`) and Windows uses backslashes (e.g. `plots\diamonds.pdf`).
R can work with either type (no matter what platform you're currently using), but unfortunately, backslashes mean something special to R, and to get a single backslash in the path, you need to type two backslashes!
That makes life frustrating, so I recommend always using the Linux/Mac style with forward slashes.
2. Absolute paths (i.e. paths that point to the same place regardless of your working directory) look different.
In Windows they start with a drive letter (e.g. `C:`) or two backslashes (e.g. `\\servername`) and in Mac/Linux they start with a slash "/" (e.g. `/users/hadley`).
You should **never** use absolute paths in your scripts, because they hinder sharing: no one else will have exactly the same directory configuration as you.
3. The last minor difference is the place that `~` points to.
`~` is a convenient shortcut to your home directory.
Windows doesn't really have the notion of a home directory, so it instead points to your documents directory.
## RStudio projects
R experts keep all the files associated with a project together --- input data, R scripts, analytical results, figures.
This is such a wise and common practice that RStudio has built-in support for this via **projects**.
Let's make a project for you to use while you're working through the rest of this book.
Click File \> New Project, then:
```{r, echo = FALSE, out.width = "50%"}
knitr::include_graphics("screenshots/rstudio-project-1.png")
knitr::include_graphics("screenshots/rstudio-project-2.png")
knitr::include_graphics("screenshots/rstudio-project-3.png")
```
Call your project `r4ds` and think carefully about which *subdirectory* you put the project in.
If you don't store it somewhere sensible, it will be hard to find it in the future!
Once this process is complete, you'll get a new RStudio project just for this book.
Check that the "home" directory of your project is the current working directory:
```{r eval = FALSE}
getwd()
#> [1] /Users/hadley/Documents/r4ds/r4ds
```
Whenever you refer to a file with a relative path it will look for it here.
Now enter the following commands in the script editor, and save the file, calling it "diamonds.R".
Next, run the complete script which will save a PDF and CSV file into your project directory.
Don't worry about the details, you'll learn them later in the book.
```{r toy-line, eval = FALSE}
library(tidyverse)
ggplot(diamonds, aes(carat, price)) +
geom_hex()
ggsave("diamonds.pdf")
write_csv(diamonds, "diamonds.csv")
```
Quit RStudio.
Inspect the folder associated with your project --- notice the `.Rproj` file.
Double-click that file to re-open the project.
Notice you get back to where you left off: it's the same working directory and command history, and all the files you were working on are still open.
Because you followed my instructions above, you will, however, have a completely fresh environment, guaranteeing that you're starting with a clean slate.
In your favorite OS-specific way, search your computer for `diamonds.pdf` and you will find the PDF (no surprise) but *also the script that created it* (`diamonds.R`).
This is a huge win!
One day you will want to remake a figure or just understand where it came from.
If you rigorously save figures to files **with R code** and never with the mouse or the clipboard, you will be able to reproduce old work with ease!
## Summary
In summary, RStudio projects give you a solid workflow that will serve you well in the future:
- Create an RStudio project for each data analysis project.
- Keep data files there; we'll talk about loading them into R in [data import].
- Keep scripts there; edit them, run them in bits or as a whole.
- Save your outputs (plots and cleaned data) there.
- Only ever use relative paths, not absolute paths.
Everything you need is in one place, and cleanly separated from all the other projects that you are working on.

View File

@ -1,4 +1,4 @@
# Workflow: scripts
# Workflow: scripts and projects
So far you've been using the console to run code.
That's a great place to start, but you'll find it gets cramped pretty quickly as you create more complex ggplot2 graphics and dplyr pipes.
@ -70,6 +70,158 @@ RStudio will also let you know about potential problems:
knitr::include_graphics("screenshots/rstudio-diagnostic-warn.png")
```
# Workflow: projects
One day you will need to quit R, go do something else and return to your analysis the next day.
One day you will be working on multiple analyses simultaneously that all use R and you want to keep them separate.
One day you will need to bring data from the outside world into R and send numerical results and figures from R back out into the world.
To handle these real life situations, you need to make two decisions:
1. What about your analysis is "real", i.e. what will you save as your lasting record of what happened?
2. Where does your analysis "live"?
## What is real?
As a beginning R user, it's OK to consider your environment (i.e. the objects listed in the environment pane) "real".
However, in the long run, you'll be much better off if you consider your R scripts as "real".
With your R scripts (and your data files), you can recreate the environment.
It's much harder to recreate your R scripts from your environment!
You'll either have to retype a lot of code from memory (making mistakes all the way) or you'll have to carefully mine your R history.
To foster this behaviour, I highly recommend that you instruct RStudio not to preserve your workspace between sessions:
```{r, echo = FALSE, out.width = "75%"}
knitr::include_graphics("screenshots/rstudio-workspace.png")
```
This will cause you some short-term pain, because now when you restart RStudio it will not remember the results of the code that you ran last time.
But this short-term pain will save you long-term agony because it forces you to capture all important interactions in your code.
There's nothing worse than discovering three months after the fact that you've only stored the results of an important calculation in your workspace, not the calculation itself in your code.
There is a great pair of keyboard shortcuts that will work together to make sure you've captured the important parts of your code in the editor:
1. Press Cmd/Ctrl + Shift + F10 to restart RStudio.
2. Press Cmd/Ctrl + Shift + S to rerun the current script.
I use this pattern hundreds of times a week.
## Where does your analysis live?
R has a powerful notion of the **working directory**.
This is where R looks for files that you ask it to load, and where it will put any files that you ask it to save.
RStudio shows your current working directory at the top of the console:
```{r, echo = FALSE, out.width = "50%"}
knitr::include_graphics("screenshots/rstudio-wd.png")
```
And you can print this out in R code by running `getwd()`:
```{r eval = FALSE}
getwd()
#> [1] "/Users/hadley/Documents/r4ds/r4ds"
```
As a beginning R user, it's OK to let your home directory, documents directory, or any other weird directory on your computer be R's working directory.
But you're six chapters into this book, and you're no longer a rank beginner.
Very soon now you should evolve to organising your analytical projects into directories and, when working on a project, setting R's working directory to the associated directory.
**I do not recommend it**, but you can also set the working directory from within R:
```{r eval = FALSE}
setwd("/path/to/my/CoolProject")
```
But you should never do this because there's a better way; a way that also puts you on the path to managing your R work like an expert.
## Paths and directories
Paths and directories are a little complicated because there are two basic styles of paths: Mac/Linux and Windows.
There are three chief ways in which they differ:
1. The most important difference is how you separate the components of the path.
Mac and Linux uses slashes (e.g. `plots/diamonds.pdf`) and Windows uses backslashes (e.g. `plots\diamonds.pdf`).
R can work with either type (no matter what platform you're currently using), but unfortunately, backslashes mean something special to R, and to get a single backslash in the path, you need to type two backslashes!
That makes life frustrating, so I recommend always using the Linux/Mac style with forward slashes.
2. Absolute paths (i.e. paths that point to the same place regardless of your working directory) look different.
In Windows they start with a drive letter (e.g. `C:`) or two backslashes (e.g. `\\servername`) and in Mac/Linux they start with a slash "/" (e.g. `/users/hadley`).
You should **never** use absolute paths in your scripts, because they hinder sharing: no one else will have exactly the same directory configuration as you.
3. The last minor difference is the place that `~` points to.
`~` is a convenient shortcut to your home directory.
Windows doesn't really have the notion of a home directory, so it instead points to your documents directory.
## RStudio projects
R experts keep all the files associated with a project together --- input data, R scripts, analytical results, figures.
This is such a wise and common practice that RStudio has built-in support for this via **projects**.
Let's make a project for you to use while you're working through the rest of this book.
Click File \> New Project, then:
```{r, echo = FALSE, out.width = "50%"}
knitr::include_graphics("screenshots/rstudio-project-1.png")
knitr::include_graphics("screenshots/rstudio-project-2.png")
knitr::include_graphics("screenshots/rstudio-project-3.png")
```
Call your project `r4ds` and think carefully about which *subdirectory* you put the project in.
If you don't store it somewhere sensible, it will be hard to find it in the future!
Once this process is complete, you'll get a new RStudio project just for this book.
Check that the "home" directory of your project is the current working directory:
```{r eval = FALSE}
getwd()
#> [1] /Users/hadley/Documents/r4ds/r4ds
```
Whenever you refer to a file with a relative path it will look for it here.
Now enter the following commands in the script editor, and save the file, calling it "diamonds.R".
Next, run the complete script which will save a PDF and CSV file into your project directory.
Don't worry about the details, you'll learn them later in the book.
```{r toy-line, eval = FALSE}
library(tidyverse)
ggplot(diamonds, aes(carat, price)) +
geom_hex()
ggsave("diamonds.pdf")
write_csv(diamonds, "diamonds.csv")
```
Quit RStudio.
Inspect the folder associated with your project --- notice the `.Rproj` file.
Double-click that file to re-open the project.
Notice you get back to where you left off: it's the same working directory and command history, and all the files you were working on are still open.
Because you followed my instructions above, you will, however, have a completely fresh environment, guaranteeing that you're starting with a clean slate.
In your favorite OS-specific way, search your computer for `diamonds.pdf` and you will find the PDF (no surprise) but *also the script that created it* (`diamonds.R`).
This is a huge win!
One day you will want to remake a figure or just understand where it came from.
If you rigorously save figures to files **with R code** and never with the mouse or the clipboard, you will be able to reproduce old work with ease!
## Summary
In summary, RStudio projects give you a solid workflow that will serve you well in the future:
- Create an RStudio project for each data analysis project.
- Keep data files there; we'll talk about loading them into R in \[data import\].
- Keep scripts there; edit them, run them in bits or as a whole.
- Save your outputs (plots and cleaned data) there.
- Only ever use relative paths, not absolute paths.
Everything you need is in one place, and cleanly separated from all the other projects that you are working on.
## Exercises
1. Go to the RStudio Tips Twitter account, <https://twitter.com/rstudiotips> and find one tip that looks interesting.