334 lines
14 KiB
Plaintext
334 lines
14 KiB
Plaintext
# Workflow: scripts and projects {#workflow-scripts-projects}
|
|
|
|
```{r, results = "asis", echo = FALSE}
|
|
status("restructuring")
|
|
```
|
|
|
|
So far, you have used the console to run code.
|
|
That's a great place to start, but you'll find it gets cramped pretty quickly as you create more complex ggplot2 graphics and dplyr pipes.
|
|
To give yourself more room to work, it's a great idea to use the script editor.
|
|
Open it up either by clicking the File menu, and selecting New File, then R script, or using the keyboard shortcut Cmd/Ctrl + Shift + N.
|
|
Now you'll see four panes:
|
|
|
|
```{r}
|
|
#| echo: false
|
|
#| out.width: "75%"
|
|
#| fig-alt: >
|
|
#| RStudio IDE with Editor, Console, and Output highlighted.
|
|
|
|
knitr::include_graphics("diagrams/rstudio-editor.png")
|
|
```
|
|
|
|
The script editor is a great place to put code you care about.
|
|
Keep experimenting in the console, but once you have written code that works and does what you want, put it in the script editor.
|
|
RStudio will automatically save the contents of the editor when you quit RStudio, and will automatically load it when you re-open.
|
|
Nevertheless, it's a good idea to save your scripts regularly and to back them up.
|
|
|
|
## Naming files
|
|
|
|
Saving your code in a script requires creating a new file that you will need to name.
|
|
It might be tempting to name this file `code.R` or `myscript.R`, but you should think a bit harder before choosing a name for your file.
|
|
Three important principles for file naming are as follows:
|
|
|
|
1. File names should be machine readable: Avoid spaces, punctuation, symbols, and accented character. Do not rely on case sensitivity to distinguish files. Make deliberate use of delimiters.
|
|
2. File names should be human readable: Use file names that describe what is in the file.
|
|
3. File names should play well with default ordering: Start file names with numbers that allow them to be sorted in the order they get used.
|
|
|
|
Suppose you have the following files in a project folder.
|
|
|
|
run-first.R
|
|
alternative model.R
|
|
code for exploratory analysis.R
|
|
finalreport.qmd
|
|
FinalReport.qmd
|
|
fig 1.png
|
|
Figure_02.png
|
|
model_first_try.R
|
|
temp.txt
|
|
|
|
There are a variety of problems here: the files are misordered, file names contain spaces, there are two files with basically the same name but different capitalization (`finalreport` vs. `FinalReport`), and some file names don't reflect their contents (`run-first` and `temp`).
|
|
|
|
Below is an alternative way of naming and organizing the same set of files.
|
|
|
|
01-load-data.R
|
|
02-exploratory-analysis.R
|
|
03-model-approach-1.R
|
|
04-model-approach-2.R
|
|
fig-01.png
|
|
fig-02.png
|
|
notes-on-report-draft.txt
|
|
report-2022-03-20.qmd
|
|
report-2022-04-02.qmd
|
|
|
|
Numbering and descriptive names that are similarly formatted allow for a more useful organization of the R scripts.
|
|
Additionally, the figures are labelled similarly, the reports are distinguished by dates included in the file names, and `temp` is renamed to `notes-on-report-draft` to better describe its contents.
|
|
|
|
## Running code
|
|
|
|
The script editor is also a great place to build up complex ggplot2 plots or long sequences of dplyr manipulations.
|
|
The key to using the script editor effectively is to memorize one of the most important keyboard shortcuts: Cmd/Ctrl + Enter.
|
|
This executes the current R expression in the console.
|
|
For example, take the code below.
|
|
If your cursor is at █, pressing Cmd/Ctrl + Enter will run the complete command that generates `not_cancelled`.
|
|
It will also move the cursor to the next statement (beginning with `not_cancelled |>`).
|
|
That makes it easy to step through your complete script by repeatedly pressing Cmd/Ctrl + Enter.
|
|
|
|
```{r}
|
|
#| eval: false
|
|
|
|
library(dplyr)
|
|
library(nycflights13)
|
|
|
|
not_cancelled <- flights |>
|
|
filter(!is.na(dep_delay)█, !is.na(arr_delay))
|
|
|
|
not_cancelled |>
|
|
group_by(year, month, day) |>
|
|
summarize(mean = mean(dep_delay))
|
|
```
|
|
|
|
Instead of running your code expression-by-expression, you can also execute the complete script in one step: Cmd/Ctrl + Shift + S.
|
|
Doing this regularly is a great way to ensure that you've captured all the important parts of your code in the script.
|
|
|
|
I recommend that you always start your script with the packages that you need.
|
|
That way, if you share your code with others, they can easily see which packages they need to install.
|
|
Note, however, that you should never include `install.packages()` or `setwd()` in a script that you share.
|
|
It's very antisocial to change settings on someone else's computer!
|
|
|
|
When working through future chapters, I highly recommend starting in the script editor and practicing your keyboard shortcuts.
|
|
Over time, sending code to the console in this way will become so natural that you won't even think about it.
|
|
|
|
## RStudio diagnostics
|
|
|
|
The script editor will also highlight syntax errors with a red squiggly line and a cross in the sidebar:
|
|
|
|
```{r}
|
|
#| echo: false
|
|
#| out.width: NULL
|
|
#| fig-alt: >
|
|
#| Script editor with the script `x y <- 10`. A red X indicates that there is
|
|
#| syntax error. The syntax error is also highlighted with a red squiggly line.
|
|
|
|
knitr::include_graphics("screenshots/rstudio-diagnostic.png")
|
|
```
|
|
|
|
Hover over the cross to see what the problem is:
|
|
|
|
```{r}
|
|
#| echo: false
|
|
#| out.width: NULL
|
|
#| fig.alt: >
|
|
#| Script editor with the script `x y <- 10`. A red X indicates that there is
|
|
#| syntax error. The syntax error is also highlighted with a red squiggly line.
|
|
#| Hovering over the X shows a text box with the text 'unexpected token y' and
|
|
#| unexpected token <-'.
|
|
|
|
knitr::include_graphics("screenshots/rstudio-diagnostic-tip.png")
|
|
```
|
|
|
|
RStudio will also let you know about potential problems:
|
|
|
|
```{r}
|
|
#| echo: false
|
|
#| out.width: NULL
|
|
#| fig.alt: >
|
|
#| Script editor with the script `3 == NA`. A yellow exclamation park
|
|
#| indicates that there may be a potential problem. Hovering over the
|
|
#| exclamation mark shows a text box with the text 'use is.na to check
|
|
#| whether expression evaluates to NA'.
|
|
|
|
knitr::include_graphics("screenshots/rstudio-diagnostic-warn.png")
|
|
```
|
|
|
|
## Workflow: projects
|
|
|
|
One day, you will need to quit R, go do something else, and return to your analysis later.
|
|
One day, you will be working on multiple analyses simultaneously that all use R and you want to keep them separate.
|
|
One day, you will need to bring data from the outside world into R and send numerical results and figures from R back out into the world.
|
|
To handle these real life situations, you need to make two decisions:
|
|
|
|
1. What about your analysis is "real", i.e. what will you save as your lasting record of what happened?
|
|
|
|
2. Where does your analysis "live"?
|
|
|
|
## What is real?
|
|
|
|
As a beginning R user, it's OK to consider your environment (i.e. the objects listed in the environment pane) "real".
|
|
However, in the long run, you'll be much better off if you consider your R scripts as "real".
|
|
|
|
With your R scripts (and your data files), you can recreate the environment.
|
|
It's much harder to recreate your R scripts from your environment!
|
|
You'll either have to retype a lot of code from memory (inevitably, making mistakes along the way) or you'll have to carefully mine your R history.
|
|
|
|
To encourage this behavior, I highly recommend that you instruct RStudio not to preserve your workspace between sessions:
|
|
|
|
```{r}
|
|
#| echo: false
|
|
#| out.width: "75%"
|
|
#| fig.alt: >
|
|
#| RStudio preferences window where the option 'Restore .RData into workspace
|
|
#| at startup' is not checked. Also, the option 'Save workspace to .RData
|
|
#| on exit' is set to 'Never'.
|
|
|
|
knitr::include_graphics("screenshots/rstudio-workspace.png")
|
|
```
|
|
|
|
This will cause you some short-term pain, because now when you restart RStudio, it will no longer remember the results of the code that you ran last time.
|
|
But this short-term pain will save you long-term agony because it forces you to capture all important interactions in your code.
|
|
There's nothing worse than discovering three months after the fact that you've only stored the results of an important calculation in your workspace, not the calculation itself in your code.
|
|
|
|
There is a great pair of keyboard shortcuts that will work together to make sure you've captured the important parts of your code in the editor:
|
|
|
|
1. Press Cmd/Ctrl + Shift + F10 to restart RStudio.
|
|
2. Press Cmd/Ctrl + Shift + S to rerun the current script.
|
|
|
|
I use this pattern hundreds of times a week.
|
|
|
|
## Where does your analysis live?
|
|
|
|
R has a powerful notion of the **working directory**.
|
|
This is where R looks for files that you ask it to load, and where it will put any files that you ask it to save.
|
|
RStudio shows your current working directory at the top of the console:
|
|
|
|
```{r}
|
|
#| echo: false
|
|
#| out.width: "50%"
|
|
#| fig-alt: >
|
|
#| The Console tab shows the current working directory as
|
|
#| '~/Documents/r4ds/r4ds'.
|
|
|
|
knitr::include_graphics("screenshots/rstudio-wd.png")
|
|
```
|
|
|
|
And you can print this out in R code by running `getwd()`:
|
|
|
|
```{r}
|
|
#| eval: false
|
|
|
|
getwd()
|
|
#> [1] "/Users/hadley/Documents/r4ds/r4ds"
|
|
```
|
|
|
|
As a beginning R user, it's OK to let your home directory, documents directory, or any other weird directory on your computer be R's working directory.
|
|
But you're six chapters into this book, and you're no longer a rank beginner.
|
|
Very soon now you should evolve to organizing your analytical projects into directories and, when working on a project, setting R's working directory to the associated directory.
|
|
|
|
**I do not recommend it**, but you can also set the working directory from within R:
|
|
|
|
```{r}
|
|
#| eval: false
|
|
|
|
setwd("/path/to/my/CoolProject")
|
|
```
|
|
|
|
But you should never do this because there's a better way; a way that also puts you on the path to managing your R work like an expert.
|
|
|
|
## Paths and directories
|
|
|
|
Paths and directories are a little complicated because there are two basic styles of paths: Mac/Linux and Windows.
|
|
There are three chief ways in which they differ:
|
|
|
|
1. The most important difference is how you separate the components of the path.
|
|
Mac and Linux uses slashes (e.g. `plots/diamonds.pdf`) and Windows uses backslashes (e.g. `plots\diamonds.pdf`).
|
|
R can work with either type (no matter what platform you're currently using), but unfortunately, backslashes mean something special to R, and to get a single backslash in the path, you need to type two backslashes!
|
|
That makes life frustrating, so I recommend always using the Linux/Mac style with forward slashes.
|
|
|
|
2. Absolute paths (i.e. paths that point to the same place regardless of your working directory) look different.
|
|
In Windows they start with a drive letter (e.g. `C:`) or two backslashes (e.g. `\\servername`) and in Mac/Linux they start with a slash "/" (e.g. `/users/hadley`).
|
|
You should **never** use absolute paths in your scripts, because they hinder sharing: no one else will have exactly the same directory configuration as you.
|
|
|
|
3. The last minor difference is the place that `~` points to.
|
|
`~` is a convenient shortcut to your home directory.
|
|
Windows doesn't really have the notion of a home directory, so it instead points to your documents directory.
|
|
|
|
## RStudio projects
|
|
|
|
R experts keep all the files associated with a given project together --- input data, R scripts, analytical results, and figures.
|
|
This is such a wise and common practice that RStudio has built-in support for this via **projects**.
|
|
|
|
Let's make a project for you to use while you're working through the rest of this book.
|
|
Click File \> New Project, then:
|
|
|
|
```{r}
|
|
#| echo: false
|
|
#| out.width: "50%"
|
|
#| fig-alt: >
|
|
#| There are three screenshots of the New Project menu. In the first screenshot,
|
|
#| the `Create Project` window is shown and 'New Directory' is selected.
|
|
#| In the second screenshot, the `Project Type` window is shown and
|
|
#| 'Empty Project' is selected. In the third screenshot, the 'Create New Project'
|
|
#| window is shown and the directory name is given as 'r4ds' and the project
|
|
#| is being created as subdirectory of the Desktop.
|
|
|
|
knitr::include_graphics("screenshots/rstudio-project-1.png")
|
|
knitr::include_graphics("screenshots/rstudio-project-2.png")
|
|
knitr::include_graphics("screenshots/rstudio-project-3.png")
|
|
```
|
|
|
|
Call your project `r4ds` and think carefully about which *subdirectory* you put the project in.
|
|
If you don't store it somewhere sensible, it will be hard to find it in the future!
|
|
|
|
Once this process is complete, you'll get a new RStudio project just for this book.
|
|
Check that the "home" directory of your project is the current working directory:
|
|
|
|
```{r}
|
|
#| eval: false
|
|
|
|
getwd()
|
|
#> [1] /Users/hadley/Documents/r4ds/r4ds
|
|
```
|
|
|
|
Whenever you refer to a file using a relative path, R will look for it here.
|
|
|
|
Now enter the following commands in the script editor, and save the file, calling it "diamonds.R".
|
|
Next, run the complete script which will save a PDF and CSV file into your project directory.
|
|
Don't worry about the details, you'll learn them later in the book.
|
|
|
|
```{r}
|
|
#| label: toy-line
|
|
#| eval: false
|
|
|
|
library(tidyverse)
|
|
|
|
ggplot(diamonds, aes(carat, price)) +
|
|
geom_hex()
|
|
ggsave("diamonds.pdf")
|
|
|
|
write_csv(diamonds, "diamonds.csv")
|
|
```
|
|
|
|
Quit RStudio.
|
|
Inspect the folder associated with your project --- notice the `.Rproj` file.
|
|
Double-click that file to re-open the project.
|
|
Notice you get back to where you left off: it's the same working directory and command history, and all the files you were working on are still open.
|
|
Because you followed my instructions above, you will, however, have a completely fresh environment, guaranteeing that you're starting with a clean slate.
|
|
|
|
In your favorite OS-specific way, search your computer for `diamonds.pdf` and you will find the PDF (no surprise) but *also the script that created it* (`diamonds.R`).
|
|
This is a huge win!
|
|
One day, you will want to remake a figure or just understand where it came from.
|
|
If you rigorously save figures to files **with R code** and never with the mouse or the clipboard, you will be able to reproduce old work with ease!
|
|
|
|
## Summary
|
|
|
|
In summary, RStudio projects give you a solid workflow that will serve you well in the future:
|
|
|
|
- Create an RStudio project for each data analysis project.
|
|
|
|
- Keep data files there; we'll talk about loading them into R in \@ref(data-import).
|
|
|
|
- Keep scripts there; edit them, run them in bits or as a whole.
|
|
|
|
- Save your outputs (plots and cleaned data) there.
|
|
|
|
- Only ever use relative paths, not absolute paths.
|
|
|
|
Everything you need is in one place and cleanly separated from all the other projects that you are working on.
|
|
|
|
## Exercises
|
|
|
|
1. Go to the RStudio Tips Twitter account, <https://twitter.com/rstudiotips> and find one tip that looks interesting.
|
|
Practice using it!
|
|
|
|
2. What other common mistakes will RStudio diagnostics report?
|
|
Read <https://support.rstudio.com/hc/en-us/articles/205753617-Code-Diagnostics> to find out.
|