UK -> US spelling, multi-line alt text, YAML chunk opts

This commit is contained in:
Mine Çetinkaya-Rundel 2022-05-08 01:32:25 -04:00
parent e6b958b196
commit ec502237e2
13 changed files with 608 additions and 195 deletions

View File

@ -4,8 +4,6 @@
status("restructuring")
```
<!--# TO DO: This chapter got moved here from the wrangle section, make sure it makes sense in this new location, doesn't assume anything that comes after it. -->
## Introduction
Working with data provided by R packages is a great way to learn the tools of data science, but at some point you want to stop learning and start working with your own data.
@ -17,7 +15,10 @@ We'll finish with a few pointers to packages that are useful for other types of
In this chapter, you'll learn how to load flat files in R with the **readr** package, which is part of the core tidyverse.
```{r setup, message = FALSE}
```{r}
#| label: setup
#| message: false
library(tidyverse)
```
@ -42,21 +43,30 @@ Not only are csv files one of the most common forms of data storage, but once yo
Here is what a simple CSV file with a row for column names (also commonly referred to as the header row) and six rows of data looks like.
```{r echo = FALSE, message = FALSE}
```{r}
#| echo: false
#| message: false
read_lines("data/students.csv") |> cat(sep = "\n")
```
Note that the `,`s separate the columns.
Table \@ref(tab:students-table) shows a representation of the same data as a table.
```{r students-table, echo = FALSE, message = FALSE}
```{r}
#| label: students-table
#| echo: false
#| message: false
read_csv("data/students.csv") |>
knitr::kable(caption = "Data from the students.csv file as a table.")
```
The first argument to `read_csv()` is the most important: it's the path to the file to read.
```{r, message = TRUE}
```{r}
#| message: true
students <- read_csv("data/students.csv")
```
@ -67,7 +77,9 @@ This message is an important part of readr, which we'll come back to in Section
You can also supply an inline csv file.
This is useful for experimenting with readr and for creating reproducible examples to share with others:
```{r message = FALSE}
```{r}
#| message: false
read_csv("a,b,c
1,2,3
4,5,6")
@ -79,7 +91,9 @@ There are two cases where you might want to tweak this behavior:
1. Sometimes there are a few lines of metadata at the top of the file.
You can use `skip = n` to skip the first `n` lines; or use `comment = "#"` to drop all lines that start with (e.g.) `#`.
```{r message = FALSE}
```{r}
#| message: false
read_csv("The first line of metadata
The second line of metadata
x,y,z
@ -93,7 +107,9 @@ There are two cases where you might want to tweak this behavior:
2. The data might not have column names.
You can use `col_names = FALSE` to tell `read_csv()` not to treat the first row as headings, and instead label them sequentially from `X1` to `Xn`:
```{r message = FALSE}
```{r}
#| message: false
read_csv("1,2,3\n4,5,6", col_names = FALSE)
```
@ -101,13 +117,17 @@ There are two cases where you might want to tweak this behavior:
Alternatively you can pass `col_names` a character vector which will be used as the column names:
```{r message = FALSE}
```{r}
#| message: false
read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z"))
```
Another option that commonly needs tweaking is `na`: this specifies the value (or values) that are used to represent missing values in your file:
```{r message = FALSE}
```{r}
#| message: false
read_csv("a,b,c\n1,2,.", na = ".")
```
@ -121,7 +141,9 @@ Let's take another look at the `students` data.
In the `favourite.food` column, there are a bunch of food items and then the character string `N/A`, which should have been an real `NA` that R will recognize as "not available".
This is something we can address using the `na` argument.
```{r message = FALSE}
```{r}
#| message: false
students <- read_csv("data/students.csv", na = c("N/A", ""))
students
@ -134,7 +156,9 @@ This function takes in a data frame and returns a data frame with variable names
[^data-import-1]: The [janitor](http://sfirke.github.io/janitor/) package is not part of the tidyverse, but it offers handy functions for data cleaning and works well within data pipelines that uses `|>`.
```{r message = FALSE}
```{r}
#| message: false
library(janitor)
students |>
clean_names()
@ -187,14 +211,18 @@ There are a few good reasons to favor readr functions over the base equivalents:
To prevent them from causing problems they need to be surrounded by a quoting character, like `"` or `'`. By default, `read_csv()` assumes that the quoting character will be `"`.
What argument to `read_csv()` do you need to specify to read the following text into a data frame?
```{r, eval = FALSE}
```{r}
#| eval: false
"x,y\n1,'a,b'"
```
5. Identify what is wrong with each of the following inline CSV files.
What happens when you run the code?
```{r, eval = FALSE}
```{r}
#| eval: false
read_csv("a,b\n1,2,3\n4,5,6")
read_csv("a,b,c\n1,2\n1,2,3,4")
read_csv("a,b\n\"1")
@ -239,14 +267,19 @@ If you want to export a csv file to Excel, use `write_excel_csv()` --- this writ
The most important arguments are `x` (the data frame to save), and `file` (the location to save it).
You can also specify how missing values are written with `na`, and if you want to `append` to an existing file.
```{r, eval = FALSE}
```{r}
#| eval: false
write_csv(students, "students.csv")
```
Now let's read that csv file back in.
Note that the type information is lost when you save to csv:
```{r, warning = FALSE, message = FALSE}
```{r}
#| warning: false
#| message: false
students
write_csv(students, "students-2.csv")
read_csv("students-2.csv")
@ -265,7 +298,9 @@ There are two alternatives:
2. The feather package implements a fast binary file format that can be shared across programming languages:
```{r, eval = FALSE}
```{r}
#| eval: false
library(feather)
write_feather(students, "students.feather")
read_feather("students.feather")
@ -283,7 +318,9 @@ There are two alternatives:
Feather tends to be faster than RDS and is usable outside of R.
RDS supports list-columns (which you'll learn about in Chapter \@ref(list-columns); feather currently does not.
```{r, include = FALSE}
```{r}
#| include: false
file.remove("students-2.csv")
file.remove("students.rds")
```

View File

@ -27,7 +27,10 @@ If you particularly enjoy this chapter and learn more about the underlying theor
In this chapter we'll focus on tidyr, a package that provides a bunch of tools to help tidy up your messy datasets.
tidyr is a member of the core tidyverse.
```{r setup, message = FALSE}
```{r}
#| label: setup
#| message: false
library(tidyverse)
```
@ -62,7 +65,8 @@ There are three interrelated rules that make a dataset tidy:
Figure \@ref(fig:tidy-structure) shows the rules visually.
```{r tidy-structure}
```{r}
#| label: tidy-structure
#| echo: FALSE
#| out.width: NULL
#| fig.cap: >
@ -73,6 +77,7 @@ Figure \@ref(fig:tidy-structure) shows the rules visually.
#| shows that each variable is column. The second panel shows that each
#| observation is a row. The third panel shows that each value is
#| a cell.
knitr::include_graphics("images/tidy-1.png", dpi = 270)
```
@ -83,13 +88,14 @@ There are two main advantages:
If you have a consistent data structure, it's easier to learn the tools that work with it because they have an underlying uniformity.
2. There's a specific advantage to placing variables in columns because it allows R's vectorised nature to shine.
As you learned in Sections \@ref(mutate) and \@ref(summarise), most built-in R functions work with vectors of values.
As you learned in Sections \@ref(mutate) and \@ref(summarize), most built-in R functions work with vectors of values.
That makes transforming tidy data feel particularly natural.
dplyr, ggplot2, and all the other packages in the tidyverse are designed to work with tidy data.
Here are a couple of small examples showing how you might work with `table1`.
```{r fig.width = 5}
```{r}
#| fig.width: 5
#| fig.alt: >
#| This figure shows the numbers of cases in 1999 and 2000 for
#| Afghanistan, Brazil, and China, with year on the x-axis and number
@ -115,8 +121,8 @@ table1 |>
# Visualise changes over time
ggplot(table1, aes(year, cases)) +
geom_line(aes(group = country), colour = "grey50") +
geom_point(aes(colour = country, shape = country)) +
geom_line(aes(group = country), color = "grey50") +
geom_point(aes(color = country, shape = country)) +
scale_x_continuous(breaks = c(1999, 2000))
```
@ -230,7 +236,8 @@ billboard_tidy
Now we're in a good position to look at how song ranks vary over time by drawing a plot.
The code is shown below and the result is Figure \@ref(fig:billboard-ranks).
```{r billboard-ranks}
```{r}
#| label: billboard-ranks
#| fig.cap: >
#| A line plot showing the how the rank of a song changes over time.
#| fig.alt: >
@ -239,6 +246,7 @@ The code is shown below and the result is Figure \@ref(fig:billboard-ranks).
#| rapidly accelerate to a low rank, and then decay again. There are
#| suprisingly few tracks in the region when week is >20 and rank is
#| >50.
billboard_tidy |>
ggplot(aes(week, rank, group = track)) +
geom_line(alpha = 1/3) +
@ -275,33 +283,37 @@ How does this transformation take place?
It's easier to see if we take it component by component.
Columns that are already variables need to be repeated, once for each column in `cols`, as shown in Figure \@ref(fig:pivot-variables).
```{r pivot-variables}
```{r}
#| label: pivot-variables
#| echo: FALSE
#| out.width: NULL
#| fig.alt: >
#| A diagram showing showing how `pivot_longer()` transforms a simple
#| dataset, using colour to highlight how the values in the `var` column
#| dataset, using color to highlight how the values in the `var` column
#| ("A", "B", "C") are each repeated twice in the output because there are
#| two columns being pivotted ("col1" and "col2").
#| fig.cap: >
#| Columns that are already variables need to be repeated, once for
#| each column that is pivotted.
knitr::include_graphics("diagrams/tidy-data/variables.png", dpi = 270)
```
The column names become values in a new variable, whose name is given by `names_to`, as shown in Figure \@ref(fig:pivot-names).
They need to be repeated once for each row in the original dataset.
```{r pivot-names}
#| echo: FALSE
```{r}
#| label: pivot-names
#| echo: false
#| out.width: NULL
#| fig.alt: >
#| A diagram showing showing how `pivot_longer()` transforms a simple
#| data set, using colour to highlight how column names ("col1" and
#| data set, using color to highlight how column names ("col1" and
#| "col2") become the values in a new name `var` column. They are repeated
#| three times because there were three rows in the input.
#| fig.cap: >
#| The column names of pivoted columns become a new column.
knitr::include_graphics("diagrams/tidy-data/column-names.png", dpi = 270)
```
@ -309,18 +321,20 @@ The cell values also become values in a new variable, with name given by `values
The are unwound row by row.
Figure \@ref(fig:pivot-values) illustrates the process.
```{r pivot-values}
#| echo: FALSE
```{r}
#| label: pivot-values
#| echo: false
#| out.width: NULL
#| fig.alt: >
#| A diagram showing showing how `pivot_longer()` transforms data,
#| using colour to highlight how the cell values (the numbers 1 to 6)
#| using color to highlight how the cell values (the numbers 1 to 6)
#| become value in a new `value` column. They are unwound row-by-row,
#| so the originals rows (1,2), then (3,4), then (5,6), become a column
#| running from 1 to 6.
#| fig.cap: >
#| The number of values are preserved (not repeated), but unwound
#| row-by-row.
knitr::include_graphics("diagrams/tidy-data/cell-values.png", dpi = 270)
```
@ -359,11 +373,12 @@ Conceptually, this is only a minor variation on the simpler case you've already
Figure \@ref(fig:pivot-multiple-names) shows the basic idea: now, instead of the column names pivoting into a single column, they pivot into multiple columns.
You can imagine this happening in two steps (first pivoting and then separating) but under the hood it happens in a single step because that gives better performance.
```{r pivot-multiple-names}
```{r}
#| label: pivot-multiple-names
#| echo: FALSE
#| out.width: NULL
#| fig.alt: >
#| A diagram that uses colour to illustrate how supplying `names_sep`
#| A diagram that uses color to illustrate how supplying `names_sep`
#| and multiple `names_to` creates multiple variables in the output.
#| The input has variable names "x_1" and "y_2" which are split up
#| by "_" to create name and number columns in the output. This is
@ -372,6 +387,7 @@ You can imagine this happening in two steps (first pivoting and then separating)
#| fig.cap: >
#| Pivotting with many variables in the column names means that each
#| column name now fills in values in multiple output columns.
knitr::include_graphics("diagrams/tidy-data/multiple-names.png", dpi = 270)
```
@ -407,11 +423,12 @@ We again use `values_drop_na = TRUE`, since the shape of the input forces the cr
Figure \@ref(fig:pivot-names-and-values) illustrates the basic idea with a simpler example.
When you use `".value"` in `names_to`, the column names in the input contribute to both values and variable names in the output.
```{r pivot-names-and-values}
#| echo: FALSE
```{r}
#| label: pivot-names-and-values
#| echo: false
#| out.width: NULL
#| fig.alt: >
#| A diagram that uses colour to illustrate how the special ".value"
#| A diagram that uses color to illustrate how the special ".value"
#| sentinel works. The input has names "x_1", "x_2", "y_1", and "y_2",
#| and we want to use the first component ("x", "y") as a variable name
#| and the second ("1", "2") as the value for a new "id" column.
@ -420,6 +437,7 @@ When you use `".value"` in `names_to`, the column names in the input contribute
#| into two components: the first part determines the output column
#| name (`x` or `y`), and the second part determines the value of the
#| `id` column.
knitr::include_graphics("diagrams/tidy-data/names-and-values.png", dpi = 270)
```
@ -556,7 +574,7 @@ Since you don't know how to work this sort of data yet, you'll want to follow th
```{r}
df %>%
group_by(id, name) %>%
summarise(n = n(), .groups = "drop") %>%
summarize(n = n(), .groups = "drop") %>%
filter(n > 1L)
```
@ -665,7 +683,7 @@ For example, if you're interested in just the total number of missing values in
```{r}
cms_patient_experience |>
group_by(org_pac_id) |>
summarise(
summarize(
n_miss = sum(is.na(prf_rate)),
n = n(),
)
@ -703,7 +721,9 @@ Depending on what you want to do next you might finding any of the following thr
- If you wanted to display the distribution of each metric, you might keep it as is so you could facet by `measure_abbr`.
```{r, fig.show='hide'}
```{r}
#| fig.show: "hide"
cms_patient_care |>
filter(type == "observed") |>
ggplot(aes(score)) +
@ -713,7 +733,9 @@ Depending on what you want to do next you might finding any of the following thr
- If you wanted to explore how different metrics are related, you might put the measure name names in the columns so you could compare them in scatterplots.
```{r, fig.show='hide'}
```{r}
#| fig.show: "hide"
cms_patient_care |>
filter(type == "observed") |>
select(-type) |>

View File

@ -18,7 +18,9 @@ We'll come back these functions in more detail in later chapters, as we start to
In this chapter we'll focus on the dplyr package, another core member of the tidyverse.
We'll illustrate the key ideas using data from the nycflights13 package, and use ggplot2 to help us understand the data.
```{r setup}
```{r}
#| label: setup
library(nycflights13)
library(tidyverse)
```
@ -382,7 +384,7 @@ ggplot(flights, aes(air_time - airtime2)) + geom_histogram()
So far you've learned about functions that work with rows and columns.
dplyr gets even more powerful when you add in the ability to work with groups.
In this section, we'll focus on the most important functions: `group_by()`, `summarise()`, and the slice family of functions.
In this section, we'll focus on the most important functions: `group_by()`, `summarize()`, and the slice family of functions.
### `group_by()`
@ -396,18 +398,18 @@ flights |>
`group_by()` doesn't change the data but, if you look closely at the output, you'll notice that it's now "grouped by" month.
This means subsequent operations will now work "by month".
### `summarise()` {#summarise}
### `summarize()` {#summarize}
The most important grouped operation is a summary.
It collapses each group to a single row[^data-transform-3].
Here we compute the average departure delay by month:
[^data-transform-3]: This is a slightly simplification; later on you'll learn how to use `summarise()` to produce multiple summary rows for each group.
[^data-transform-3]: This is a slightly simplification; later on you'll learn how to use `summarize()` to produce multiple summary rows for each group.
```{r}
flights |>
group_by(month) |>
summarise(
summarize(
delay = mean(dep_delay)
)
```
@ -419,18 +421,18 @@ We'll come back to discuss missing values in Chapter \@ref(missing-values), but
```{r}
flights |>
group_by(month) |>
summarise(
summarize(
delay = mean(dep_delay, na.rm = TRUE)
)
```
You can create any number of summaries in a single call to `summarise()`.
You can create any number of summaries in a single call to `summarize()`.
You'll learn various useful summaries in the upcoming chapters, but one very useful summary is `n()`, which returns the number of rows in each group:
```{r}
flights |>
group_by(month) |>
summarise(
summarize(
delay = mean(dep_delay, na.rm = TRUE),
n = n()
)
@ -462,7 +464,7 @@ This is similar to computing the max delay with `summarize()`, but you get the w
```{r}
flights |>
group_by(dest) |>
summarise(max_delay = max(arr_delay, na.rm = TRUE))
summarize(max_delay = max(arr_delay, na.rm = TRUE))
```
### Grouping by multiple variables
@ -482,16 +484,18 @@ To make it obvious what's happening, dplyr displays a message that tells you how
```{r}
daily_flights <- daily |>
summarise(
summarize(
n = n()
)
```
If you're happy with this behavior, you can explicitly request it in order to suppress the message:
```{r, results = FALSE}
```{r}
#| results: false
daily_flights <- daily |>
summarise(
summarize(
n = n(),
.groups = "drop_last"
)
@ -501,13 +505,13 @@ Alternatively, change the default behavior by setting a different value, e.g. `"
### Ungrouping
You might also want to remove grouping outside of `summarise()`.
You might also want to remove grouping outside of `summarize()`.
You can do this with `ungroup()`.
```{r}
daily |>
ungroup() |>
summarise(
summarize(
delay = mean(dep_delay, na.rm = TRUE),
flights = n()
)
@ -520,7 +524,7 @@ As you can see, when you summarize an ungrouped data frame, you get a single row
1. Which carrier has the worst delays?
Challenge: can you disentangle the effects of bad airports vs. bad carriers?
Why/why not?
(Hint: think about `flights |> group_by(carrier, dest) |> summarise(n())`)
(Hint: think about `flights |> group_by(carrier, dest) |> summarize(n())`)
2. Find the most delayed flight to each destination.
@ -539,15 +543,16 @@ That way, you can ensure that you're not drawing conclusions based on very small
For example, let's look at the planes (identified by their tail number) that have the highest average delays:
```{r}
#| fig.alt: >
#| fig-alt: >
#| A frequency histogram showing the distribution of flight delays.
#| The distribution is unimodal, with a large spike around 0, and
#| asymmetric: very few flights leave more than 30 minutes early,
#| but flights are delayed up to 5 hours.
delays <- flights |>
filter(!is.na(arr_delay), !is.na(tailnum)) |>
group_by(tailnum) |>
summarise(
summarize(
delay = mean(arr_delay, na.rm = TRUE),
n = n()
)
@ -560,11 +565,12 @@ Wow, there are some planes that have an *average* delay of 5 hours (300 minutes)
That seems pretty surprising, so lets draw a scatterplot of number of flights vs. average delay:
```{r}
#| fig.alt: >
#| fig-alt: >
#| A scatterplot showing number of flights versus after delay. Delays
#| for planes with very small number of flights have very high variability
#| (from -50 to ~300), but the variability rapidly decreases as the
#| number of flights increases.
ggplot(delays, aes(n, delay)) +
geom_point(alpha = 1/10)
```
@ -576,14 +582,16 @@ The shape of this plot is very characteristic: whenever you plot a mean (or othe
When looking at this sort of plot, it's often useful to filter out the groups with the smallest numbers of observations, so you can see more of the pattern and less of the extreme variation in the smallest groups:
```{r, warning = FALSE}
#| fig.alt: >
```{r}
#| warning: false
#| fig-alt: >
#| Now that the y-axis (average delay) is smaller (-20 to 60 minutes),
#| we can see a more complicated story. The smooth line suggests
#| an initial decrease in average delay from 10 minutes to 0 minutes
#| as number of flights per plane increases from 25 to 100.
#| This is followed by a gradual increase up to 10 minutes for 250
#| flights, then a gradual decrease to ~5 minutes at 500 flights.
delays |>
filter(n > 25) |>
ggplot(aes(n, delay)) +
@ -600,7 +608,7 @@ The following code uses data from the **Lahman** package to compare what proport
```{r}
batters <- Lahman::Batting |>
group_by(playerID) |>
summarise(
summarize(
perf = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),
n = sum(AB, na.rm = TRUE)
)
@ -613,13 +621,15 @@ When we plot the skill of the batter (measured by the batting average, `ba`) aga
2. There's a positive correlation between skill (`perf`) and opportunities to hit the ball (`n`) because obviously teams want to give their best batters the most opportunities to hit the ball.
```{r, warning = FALSE}
#| fig.alt: >
```{r}
#| warning: false
#| fig-alt: >
#| A scatterplot of number of batting opportunites vs batting performance
#| overlaid with a smoothed line. Average performance increases sharply
#| from 0.2 at when n is 1 to 0.25 when n is ~1000. Average performance
#| continues to increase linearly at a much shallower slope reaching
#| ~0.3 when n is ~15,000.
batters |>
filter(n > 100) |>
ggplot(aes(n, perf)) +

View File

@ -1,4 +1,4 @@
# Data visualisation {#data-visualisation}
# Data visualization {#data-visualisation}
## Introduction
@ -73,7 +73,10 @@ To learn more about `mpg`, open its help page by running `?mpg`.
To plot `mpg`, run this code to put `displ` on the x-axis and `hwy` on the y-axis:
```{r}
#| fig-alt: "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg that shows a negative association."
#| fig-alt: >
#| Scatterplot of highway fuel efficiency versus engine size of cars that
#| shows a negative association.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
```
@ -104,6 +107,7 @@ To make a graph, replace the bracketed sections in the code below with a dataset
```{r}
#| eval: false
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
```
@ -137,11 +141,15 @@ How can you explain these cars?
```{r}
#| echo: false
#| fig-alt: "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg that shows a negative association. Cars with engine size greater than 5 litres and highway fuel efficiency greater than 20 miles per gallon stand out from the rest of the data and are highlighted in red."
#| fig-alt: >
#| Scatterplot of highway fuel efficiency versus engine size of cars that
#| shows a negative association. Cars with engine size greater than 5 litres
#| and highway fuel efficiency greater than 20 miles per gallon stand out from
#| the rest of the data and are highlighted in red.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_point(data = dplyr::filter(mpg, displ > 5, hwy > 20), colour = "red", size = 2.2)
geom_point(data = dplyr::filter(mpg, displ > 5, hwy > 20), color = "red", size = 2.2)
```
Let's hypothesize that the cars are hybrids.
@ -158,14 +166,17 @@ Here we change the levels of a point's size, shape, and color to make the point
```{r}
#| echo: false
#| fig-asp: 1/4
#| fig-alt: "Diagram that shows four plotting characters next to each other. The first is a large circle, the second is a small circle, the third is a triangle, and the fourth is a blue circle."
#| fig.asp: 1/4
#| fig-alt: >
#| Diagram that shows four plotting characters next to each other. The first
#| is a large circle, the second is a small circle, the third is a triangle,
#| and the fourth is a blue circle.
ggplot() +
geom_point(aes(1, 1), size = 20) +
geom_point(aes(2, 1), size = 10) +
geom_point(aes(3, 1), size = 20, shape = 17) +
geom_point(aes(4, 1), size = 20, colour = "blue") +
geom_point(aes(4, 1), size = 20, color = "blue") +
scale_x_continuous(NULL, limits = c(0.5, 4.5), labels = NULL) +
scale_y_continuous(NULL, limits = c(0.9, 1.1), labels = NULL) +
theme(aspect.ratio = 1/3)
@ -175,7 +186,12 @@ You can convey information about your data by mapping the aesthetics in your plo
For example, you can map the colors of your points to the `class` variable to reveal the class of each car.
```{r}
#| fig-alt: "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg that shows a negative association. The points representing each car are coloured according to the class of the car. The legend on the right of the plot shows the mapping between colours and levels of the class variable: 2seater, compact, midsize, minivan, pickup, or suv."
#| fig-alt: >
#| Scatterplot of highway fuel efficiency versus engine size of cars that
#| shows a negative association. The points representing each car are colored
#| according to the class of the car. The legend on the right of the plot
#| shows the mapping between colors and levels of the class variable:
#| 2seater, compact, midsize, minivan, pickup, or suv.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
@ -197,7 +213,12 @@ In this case, the exact size of each point would reveal its class affiliation.
We get a *warning* here, because mapping an unordered variable (`class`) to an ordered aesthetic (`size`) is generally not a good idea.
```{r}
#| fig-alt: "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg that shows a negative association. The points representing each car are sized according to the class of the car. The legend on the right of the plot shows the mapping between colours and levels of the class variable -- going from small to large: 2seater, compact, midsize, minivan, pickup, or suv."
#| fig-alt: >
#| Scatterplot of highway fuel efficiency versus engine size of cars that
#| shows a negative association. The points representing each car are sized
#| according to the class of the car. The legend on the right of the plot
#| shows the mapping between colors and levels of the class variable -- going
#| from small to large: 2seater, compact, midsize, minivan, pickup, or suv.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = class))
@ -210,9 +231,15 @@ Similarly, we could have mapped `class` to the *alpha* aesthetic, which controls
#| out-width: "50%"
#| fig-align: "default"
#| warning: false
#| fig-asp: 1/2
#| fig-cap: ""
#| fig-alt: "Two scatterplots next to each other, both visualizing highway fuel efficiency versus engine size of cars in ggplot2::mpg and showing a negative association. In the plot on the left class is mapped to the alpha aesthetic, resulting in different transparency levels for each level of class. In the plot on the right class is mapped the shape aesthetic, resulting in different plotting character shapes for each level of class. Each plot comes with a legend that shows the mapping between alpha level or shape and levels of the class variable."
#| fig-alt: >
#| Two scatterplots next to each other, both visualizing highway fuel
#| efficiency versus engine size of cars and showing a negative association.
#| In the plot on the left class is mapped to the alpha aesthetic, resulting
#| in different transparency levels for each level of class. In the plot on
#| the right class is mapped the shape aesthetic, resulting in different
#| plotting character shapes for each level of class. Each plot comes with a
#| legend that shows the mapping between alpha level or shape and levels of
#| the class variable.
# Left
ggplot(data = mpg) +
@ -240,7 +267,9 @@ You can also *set* the aesthetic properties of your geom manually.
For example, we can make all of the points in our plot blue:
```{r}
#| fig-alt: "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg that shows a negative association. All points are blue."
#| fig-alt: >
#| Scatterplot of highway fuel efficiency versus engine size of cars that
#| shows a negative association. All points are blue.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
@ -259,9 +288,25 @@ You'll need to pick a value that makes sense for that aesthetic:
#| label: shapes
#| echo: false
#| warning: false
#| fig-asp: 1/2.75
#| fig-cap: "R has 25 built in shapes that are identified by numbers. There are some seeming duplicates: for example, 0, 15, and 22 are all squares. The difference comes from the interaction of the `colour` and `fill` aesthetics. The hollow shapes (0--14) have a border determined by `colour`; the solid shapes (15--20) are filled with `colour`; the filled shapes (21--24) have a border of `colour` and are filled with `fill`."
#| fig-alt: "Mapping between shapes and the numbers that represent them: 0 - square, 1 - circle, 2 - triangle point up, 3 - plus, 4 - cross, 5 - diamond, 6 - triangle point down, 7 - square cross, 8 - star, 9 - diamond plus, 10 - circle plus, 11 - triangles up and down, 12 - square plus, 13 - circle cross, 14 - square and triangle down, 15 - filled square, 16 - filled circle, 17 - filled triangle point-up, 18 - filled diamond, 19 - solid circle, 20 - bullet (smaller circle), 21 - filled circle blue, 22 - filled square blue, 23 - filled diamond blue, 24 - filled triangle point-up blue, 25 - filled triangle point down blue."
#| fig.asp: 1/2.75
#| fig.align: "center"
#| fig-cap: >
#| R has 25 built in shapes that are identified by numbers. There are some
#| seeming duplicates: for example, 0, 15, and 22 are all squares. The
#| difference comes from the interaction of the `color` and `fill`
#| aesthetics. The hollow shapes (0--14) have a border determined by `color`;
#| the solid shapes (15--20) are filled with `color`; the filled shapes
#| (21--24) have a border of `color` and are filled with `fill`.
#| fig-alt: >
#| Mapping between shapes and the numbers that represent them: 0 - square,
#| 1 - circle, 2 - triangle point up, 3 - plus, 4 - cross, 5 - diamond,
#| 6 - triangle point down, 7 - square cross, 8 - star, 9 - diamond plus,
#| 10 - circle plus, 11 - triangles up and down, 12 - square plus,
#| 13 - circle cross, 14 - square and triangle down, 15 - filled square,
#| 16 - filled circle, 17 - filled triangle point-up, 18 - filled diamond,
#| 19 - solid circle, 20 - bullet (smaller circle), 21 - filled circle blue,
#| 22 - filled square blue, 23 - filled diamond blue, 24 - filled triangle
#| point-up blue, 25 - filled triangle point down blue.
shapes <- tibble(
shape = c(0, 1, 2, 5, 3, 4, 6:19, 22, 21, 24, 23, 20),
@ -285,7 +330,11 @@ ggplot(shapes, aes(x, y)) +
Why are the points not blue?
```{r}
#| fig-alt: "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg that shows a negative association. All points are red and the legend shows a red point that is mapped to the word 'blue'."
#| fig-alt: >
#| Scatterplot of highway fuel efficiency versus engine size of cars
#| that shows a negative association. All points are red and
#| the legend shows a red point that is mapped to the word 'blue'.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))
```
@ -304,7 +353,7 @@ ggplot(shapes, aes(x, y)) +
What shapes does it work with?
(Hint: use `?geom_point`)
6. What happens if you map an aesthetic to something other than a variable name, like `aes(colour = displ < 5)`?
6. What happens if you map an aesthetic to something other than a variable name, like `aes(color = displ < 5)`?
Note, you'll also need to specify x and y.
## Common problems
@ -347,7 +396,9 @@ The first argument of `facet_wrap()` is a formula, which you create with `~` fol
The variable that you pass to `facet_wrap()` should be discrete.
```{r}
#| fig-alt: "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg, faceted by class, with facets spanning two rows."
#| fig-alt: >
#| Scatterplot of highway fuel efficiency versus engine size of cars,
#| faceted by class, with facets spanning two rows.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
@ -359,7 +410,12 @@ The first argument of `facet_grid()` is also a formula.
This time the formula should contain two variable names separated by a `~`.
```{r}
#| fig-alt: "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg, faceted by number of cylinders across rows and by type of drive train across columns. This results in a 4x3 grid of 12 facets. Some of these facets have no observations: 5 cylinders and 4 wheel drive, 4 or 5 cylinders and front wheel drive."
#| fig-alt: >
#| Scatterplot of highway fuel efficiency versus engine size of cars, faceted
#| by number of cylinders across rows and by type of drive train across
#| columns. This results in a 4x3 grid of 12 facets. Some of these facets have
#| no observations: 5 cylinders and 4 wheel drive, 4 or 5 cylinders and front
#| wheel drive.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
@ -376,7 +432,10 @@ If you prefer to not facet in the rows or columns dimension, use a `.` instead o
How do they relate to this plot?
```{r}
#| fig-alt: "Scatterplot of number of cycles versus type of drive train of cars in ggplot2::mpg. Shows that there are no cars with 5 cylinders that are 4 wheel drive or with 4 or 5 cylinders that are front wheel drive."
#| fig-alt: >
#| Scatterplot of number of cycles versus type of drive train of cars.
#| The plot shows that there are no cars with 5 cylinders that are 4
#| wheel drive or with 4 or 5 cylinders that are front wheel drive.
ggplot(data = mpg) +
geom_point(mapping = aes(x = drv, y = cyl))
@ -407,7 +466,7 @@ If you prefer to not facet in the rows or columns dimension, use a `.` instead o
facet_wrap(~ class, nrow = 2)
```
What are the advantages to using faceting instead of the colour aesthetic?
What are the advantages to using faceting instead of the color aesthetic?
What are the disadvantages?
How might the balance change if you had a larger dataset?
@ -421,7 +480,10 @@ If you prefer to not facet in the rows or columns dimension, use a `.` instead o
What does this say about when to place a faceting variable across rows or columns?
```{r}
#| fig-alt: "Two faceted plots, both visualizing highway fuel efficiency versus engine size of cars in ggplot2::mpg, faceted by drive train. In the top plot, facet are organized across rows and in the second, across columns."
#| fig-alt: >
#| Two faceted plots, both visualizing highway fuel efficiency versus
#| engine size of cars, faceted by drive train. In the top plot, facet
#| are organized across rows and in the second, across columns.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
@ -436,7 +498,9 @@ If you prefer to not facet in the rows or columns dimension, use a `.` instead o
How do the positions of the facet labels change?
```{r}
#| fig-alt: "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg, faceted by type of drive train across rows."
#| fig-alt: >
#| Scatterplot of highway fuel efficiency versus engine size of cars,
#| faceted by type of drive train across rows.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
@ -453,7 +517,11 @@ How are these two plots similar?
#| fig-width: 4
#| out-width: "50%"
#| fig-align: "default"
#| fig-alt: "Two plots: the plot on the left is a scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg and the plot on the right shows a smooth curve that follows the trajectory of the relationship between these variables. A confidence interval around the smooth curve is also displayed."
#| fig-alt: >
#| There are two plots. The plot on the left is a scatterplot of highway fuel
#| efficiency versus engine size of cars and the plot on the right shows a
#| smooth curve that follows the trajectory of the relationship between these
#| variables. A confidence interval around the smooth curve is also displayed.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
@ -497,7 +565,11 @@ On the other hand, you *could* set the linetype of a line.
```{r}
#| message: false
#| fig-alt: "A plot of highway fuel efficiency versus engine size of cars in ggplot2::mpg. The data are represented with smooth curves, which use a different line type (solid, dashed, or long dashed) for each type of drive train. Confidence intervals around the smooth curves are also displayed."
#| fig-alt: >
#| A plot of highway fuel efficiency versus engine size of cars. The data are
#| represented with smooth curves, which use a different line type (solid,
#| dashed, or long dashed) for each type of drive train. Confidence intervals
#| around the smooth curves are also displayed.
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))
@ -512,7 +584,11 @@ If this sounds strange, we can make it more clear by overlaying the lines on top
```{r}
#| echo: false
#| message: false
#| fig-alt: "A plot of highway fuel efficiency versus engine size of cars in ggplot2::mpg. The data are represented with points (coloured by drive train) as well as smooth curves (where line type is determined based on drive train as well). Confidence intervals around the smooth curves are also displayed."
#| fig-alt: >
#| A plot of highway fuel efficiency versus engine size of cars. The data
#| are represented with points (colored by drive train) as well as smooth
#| curves (where line type is determined based on drive train as well).
#| Confidence intervals around the smooth curves are also displayed.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() +
@ -538,7 +614,15 @@ It is convenient to rely on this feature because the group aesthetic by itself d
#| fig-align: "default"
#| out-width: "33%"
#| message: false
#| fig-alt: "Three plots, each with highway fuel efficiency on the y-axis and engine size of cars in ggplot2::mpg, where data are represented by a smooth curve. The first plot only has these two variables, the center plot has three separate smooth curves for each level of drive train, and the right plot not only has the same three separate smooth curves for each level of drive train but these curves are plotted in different colours, without a legend explaining which color maps to which level. Confidence intervals around the smooth curves are also displayed."
#| fig-alt: >
#| Three plots, each with highway fuel efficiency on the y-axis and engine
#| size of cars, where data are represented by a smooth curve. The first plot
#| only has these two variables, the center plot has three separate smooth
#| curves for each level of drive train, and the right plot not only has the
#| same three separate smooth curves for each level of drive train but these
#| curves are plotted in different colors, without a legend explaining which
#| color maps to which level. Confidence intervals around the smooth curves
#| are also displayed.
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy))
@ -557,7 +641,10 @@ To display multiple geoms in the same plot, add multiple geom functions to `ggpl
```{r}
#| message: false
#| fig-alt: "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg with a smooth curve overlaid. A confidence interval around the smooth curves is also displayed."
#| fig-alt: >
#| Scatterplot of highway fuel efficiency versus engine size of cars with a
#| smooth curve overlaid. A confidence interval around the smooth curves is
#| also displayed.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
@ -585,7 +672,11 @@ This makes it possible to display different aesthetics in different layers.
```{r}
#| message: false
#| fig-alt: "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg, where points are coloured according to the car class. A smooth curve following the trajectory of the relationship between highway fuel efficiency versus engine size of cars is overlaid along with a confidence interval around it."
#| fig-alt: >
#| Scatterplot of highway fuel efficiency versus engine size of cars, where
#| points are colored according to the car class. A smooth curve following
#| the trajectory of the relationship between highway fuel efficiency versus
#| engine size of cars is overlaid along with a confidence interval around it.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
@ -598,7 +689,12 @@ The local data argument in `geom_smooth()` overrides the global data argument in
```{r}
#| message: false
#| fig-alt: "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg, where points are coloured according to the car class. A smooth curve following the trajectory of the relationship between highway fuel efficiency versus engine size of subcompact cars is overlaid along with a confidence interval around it."
#| fig-alt: >
#| Scatterplot of highway fuel efficiency versus engine size of cars, where
#| points are colored according to the car class. A smooth curve following
#| the trajectory of the relationship between highway fuel efficiency versus
#| engine size of subcompact cars is overlaid along with a confidence interval
#| around it.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
@ -656,7 +752,21 @@ ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
#| fig-width: 3
#| out-width: "50%"
#| fig-align: "default"
#| fig-alt: "There are six scatterplots in this figure, arranged in a 3x2 grid. In all plots highway fuel efficiency of cars in ggplot2::mpg are on the y-axis and engine size is on the x-axis. The first plot shows all points in black with a smooth curve overlaid on them. In the second plot points are also all black, with separate smooth curves overlaid for each level of drive train. On the third plot, points and the smooth curves are represented in different colours for each level of drive train. In the fourth plot the points are represented in different colours for each level of drive train but there is only a single smooth line fitted to the whole data. In the fifth plot, points are represented in different colours for each level of drive train, and a separate smooth curve with different line types are fitted to each level of drive train. And finally in the sixth plot points are represented in different colours for each level of drive train and they have a thick white border."
#| fig-alt: >
#| There are six scatterplots in this figure, arranged in a 3x2 grid.
#| In all plots highway fuel efficiency of cars are on the y-axis and
#| engine size is on the x-axis. The first plot shows all points in black
#| with a smooth curve overlaid on them. In the second plot points are
#| also all black, with separate smooth curves overlaid for each level of
#| drive train. On the third plot, points and the smooth curves are
#| represented in different colors for each level of drive train. In the
#| fourth plot the points are represented in different colors for each
#| level of drive train but there is only a single smooth line fitted to
#| the whole data. In the fifth plot, points are represented in different
#| colors for each level of drive train, and a separate smooth curve with
#| different line types are fitted to each level of drive train. And
#| finally in the sixth plot points are represented in different colors
#| for each level of drive train and they have a thick white border.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
@ -674,8 +784,8 @@ ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(aes(color = drv)) +
geom_smooth(aes(linetype = drv), se = FALSE)
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(size = 4, colour = "white") +
geom_point(aes(colour = drv))
geom_point(size = 4, color = "white") +
geom_point(aes(color = drv))
```
## Statistical transformations
@ -688,7 +798,10 @@ The `diamonds` dataset is in the ggplot2 package and contains information on \~5
The chart shows that more diamonds are available with high quality cuts than with low quality cuts.
```{r}
#| fig-alt: "Bar chart of number of each each cut of diamond in the ggplots::diamonds dataset. There are roughly 1500 fair diamonds, 5000 good, 12000 very good, 14000 premium, and 22000 ideal cut diamonds."
#| fig-alt: >
#| Bar chart of number of each each cut of diamond. There are roughly 1500
#| Fair, 5000 Good, 12000 Very Good, 14000 Premium, and 22000 Ideal cut
#| diamonds.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
@ -712,7 +825,12 @@ The figure below describes how this process works with `geom_bar()`.
```{r}
#| echo: false
#| out-width: "100%"
#| fig-alt: 'A figure demonstrating three steps of creating a bar chart: 1. geom_bar() begins with the diamonds data set. 2. geom_bar() transforms the data with the "count" stat, which returns a data set of cut values and counts. 3. geom_bar() uses the transformed data to build the plot. cut is mapped to the x-axis, count is mapped to the y-axis.'
#| fig-alt: >
#| A figure demonstrating three steps of creating a bar chart.
#| Step 1. geom_bar() begins with the diamonds data set. Step 2. geom_bar()
#| transforms the data with the count stat, which returns a data set of
#| cut values and counts. Step 3. geom_bar() uses the transformed data to
#| build the plot. cut is mapped to the x-axis, count is mapped to the y-axis.
knitr::include_graphics("images/visualization-stat-bar.png")
```
@ -726,7 +844,10 @@ You can generally use geoms and stats interchangeably.
For example, you can recreate the previous plot using `stat_count()` instead of `geom_bar()`:
```{r}
#| fig-alt: "Bar chart of number of each each cut of diamond in the ggplots::diamonds dataset. There are roughly 1500 fair diamonds, 5000 good, 12000 very good, 14000 premium, and 22000 ideal cut diamonds."
#| fig-alt: >
#| Bar chart of number of each each cut of diamond. There are roughly 1500
#| Fair, 5000 Good, 12000 Very Good, 14000 Premium, and 22000 Ideal cut
#| diamonds.
ggplot(data = diamonds) +
stat_count(mapping = aes(x = cut))
@ -743,7 +864,10 @@ However, there are three reasons why you might need to use a stat explicitly:
```{r}
#| warning: false
#| fig-alt: "Bar chart of number of each each cut of diamond in the ggplots::diamonds dataset. There are roughly 1500 fair diamonds, 5000 good, 22000 ideal, 14000 premium, and 12000 very good, cut diamonds."
#| fig-alt: >
#| Bar chart of number of each each cut of diamond. There are roughly 1500
#| Fair, 5000 Good, 12000 Very Good, 14000 Premium, and 22000 Ideal cut
#| diamonds.
demo <- tribble(
~cut, ~freq,
@ -765,7 +889,10 @@ However, there are three reasons why you might need to use a stat explicitly:
For example, you might want to display a bar chart of proportions, rather than counts:
```{r}
#| fig-alt: "Bar chart of proportion of each each cut of diamond in the ggplots::diamonds dataset. Roughly, fair diamonds make up 0.03, good 0.09, very good 0.22, premium 26, and ideal 0.40."
#| fig-alt: >
#| Bar chart of proportion of each each cut of diamond. Roughly, Fair
#| diamonds make up 0.03, Good 0.09, Very Good 0.22, Premium 26, and
#| Ideal 0.40.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = after_stat(prop), group = 1))
@ -777,7 +904,12 @@ However, there are three reasons why you might need to use a stat explicitly:
For example, you might use `stat_summary()`, which summarizes the y values for each unique x value, to draw attention to the summary that you're computing:
```{r}
#| fig-alt: "A plot with depth on the y-axis and cut on the x-axis (with levels fair, good, very good, premium, and ideal) of diamonds in ggplot2::diamonds. For each level of cut, vertical lines extend from minimum to maximum depth for diamonds in that cut category, and the median depth is indicated on the line with a point."
#| fig-alt: >
#| A plot with depth on the y-axis and cut on the x-axis (with levels
#| fair, good, very good, premium, and ideal) of diamonds. For each level
#| of cut, vertical lines extend from minimum to maximum depth for diamonds
#| in that cut category, and the median depth is indicated on the line
#| with a point.
ggplot(data = diamonds) +
stat_summary(
@ -823,15 +955,18 @@ To see a complete list of stats, try the [ggplot2 cheatsheet](http://rstudio.com
## Position adjustments
There's one more piece of magic associated with bar charts.
You can colour a bar chart using either the `colour` aesthetic, or, more usefully, `fill`:
You can color a bar chart using either the `color` aesthetic, or, more usefully, `fill`:
```{r}
#| out-width: "50%"
#| fig-align: "default"
#| fig-alt: "Two bar charts of cut of diamonds in ggplot2::diamonds. In the first plot, the bars have coloured borders. In the second plot, they're filled with colours. Heights of the bars correspond to the number of diamonds in each cut category."
#| fig-alt: >
#| Two bar charts of cut of diamonds. In the first plot, the bars have colored
#| borders. In the second plot, they're filled with colors. Heights of the
#| bars correspond to the number of diamonds in each cut category.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, colour = cut))
geom_bar(mapping = aes(x = cut, color = cut))
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut))
```
@ -840,7 +975,12 @@ Note what happens if you map the fill aesthetic to another variable, like `clari
Each colored rectangle represents a combination of `cut` and `clarity`.
```{r}
#| fig-alt: "Segmented bar chart of cut of diamonds in ggplot2::diamonds, where each bar is filled with colours for the levels of clarity. Heights of the bars correspond to the number of diamonds in each cut category, and heights of the coloured segments are proportional to the number of diamonds with a given clarity level within a given cut level."
#| fig-alt: >
#| Segmented bar chart of cut of diamonds, where each bar is filled with
#| colors for the levels of clarity. Heights of the bars correspond to the
#| number of diamonds in each cut category, and heights of the colored
#| segments are proportional to the number of diamonds with a given clarity
#| level within a given cut level.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity))
@ -856,11 +996,18 @@ If you don't want a stacked bar chart, you can use one of three other options: `
```{r}
#| out-width: "50%"
#| fig-align: "default"
#| fig-alt: "Two segmented bar charts of cut of diamonds in ggplot2::diamonds, where each bar is filled with colours for the levels of clarity. Heights of the bars correspond to the number of diamonds in each cut category, and heights of the coloured segments are proportional to the number of diamonds with a given clarity level within a given cut level. However the segments overlap. In the first plot the segments are filled with transparent colours, in the second plot the segments are only outlined with colours."
#| fig-alt: >
#| Two segmented bar charts of cut of diamonds, where each bar is filled
#| with colors for the levels of clarity. Heights of the bars correspond
#| to the number of diamonds in each cut category, and heights of the
#| colored segments are proportional to the number of diamonds with a
#| given clarity level within a given cut level. However the segments
#| overlap. In the first plot the segments are filled with transparent
#| colors, in the second plot the segments are only outlined with colors.
ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +
geom_bar(alpha = 1/5, position = "identity")
ggplot(data = diamonds, mapping = aes(x = cut, colour = clarity)) +
ggplot(data = diamonds, mapping = aes(x = cut, color = clarity)) +
geom_bar(fill = NA, position = "identity")
```
@ -870,7 +1017,11 @@ If you don't want a stacked bar chart, you can use one of three other options: `
This makes it easier to compare proportions across groups.
```{r}
#| fig-alt: "Segmented bar chart of cut of diamonds in ggplot2::diamonds, where each bar is filled with colours for the levels of clarity. Height of each bar is 1 and heights of the coloured segments are proportional to the proportion of diamonds with a given clarity level within a given cut level."
#| fig-alt: >
#| Segmented bar chart of cut of diamonds, where each bar is filled with
#| colors for the levels of clarity. Height of each bar is 1 and heights
#| of the colored segments are proportional to the proportion of diamonds
#| with a given clarity level within a given cut level.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")
@ -880,7 +1031,12 @@ If you don't want a stacked bar chart, you can use one of three other options: `
This makes it easier to compare individual values.
```{r}
#| fig-alt: "Dodged bar chart of cut of diamonds in ggplot2::diamonds. Dodged bars are grouped by levels of cut (fair, good, very good, premium, and ideal). In each group there are eight bars, one for each level of clarity, and filled with a different color for each level. Heights of these bars represent the number of diamonds with a given level of cut and clarity."
#| fig-alt: >
#| Dodged bar chart of cut of diamonds. Dodged bars are grouped by levels
#| of cut (fair, good, very good, premium, and ideal). In each group there
#| are eight bars, one for each level of clarity, and filled with a
#| different color for each level. Heights of these bars represent the
#| number of diamonds with a given level of cut and clarity.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")
@ -891,8 +1047,10 @@ Recall our first scatterplot.
Did you notice that the plot displays only 126 points, even though there are 234 observations in the dataset?
```{r}
#| echo: FALSE
#| fig-alt: "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg that shows a negative association."
#| echo: false
#| fig-alt: >
#| Scatterplot of highway fuel efficiency versus engine size of cars that
#| shows a negative association.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
@ -908,7 +1066,9 @@ You can avoid this gridding by setting the position adjustment to "jitter".
This spreads the points out because no two points are likely to receive the same amount of random noise.
```{r}
#| fig-alt: "Jittered scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg that shows a negative association."
#| fig-alt: >
#| Jittered scatterplot of highway fuel efficiency versus engine size of cars.
#| The plot shows a negative association.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")
@ -925,7 +1085,10 @@ To learn more about a position adjustment, look up the help page associated with
How could you improve it?
```{r}
#| fig-alt: "Scatterplot of highway fuel efficiency versus city fuel efficiency of cars in ggplot2::mpg that shows a positive association. The number of points visible in this plot is less than the number of points in the dataset."
#| fig-alt: >
#| Scatterplot of highway fuel efficiency versus city fuel efficiency
#| of cars that shows a positive association. The number of points
#| visible in this plot is less than the number of points in the dataset.
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point()
@ -952,7 +1115,13 @@ There are a three other coordinate systems that are occasionally helpful.
#| fig-width: 3
#| out-width: "50%"
#| fig-align: "default"
#| fig-alt: "Two side-by-side box plots of highway fuel efficiency of cars in ggplot2::mpg. A separate box plot is created for cars in each level of class (2seater, compact, midsize, minivan, pickup, subcompact, and suv). In the first plot class is on the x-axis, in the second plot class is on the y-axis. The second plot makes it easier to read the names of the levels of class since they're listed down the y-axis, avoiding overlap."
#| fig-alt: >
#| Two side-by-side box plots of highway fuel efficiency of cars. A
#| separate box plot is created for cars in each level of class (2seater,
#| compact, midsize, minivan, pickup, subcompact, and suv). In the first
#| plot class is on the x-axis, in the second plot class is on the y-axis.
#| The second plot makes it easier to read the names of the levels of class
#| since they're listed down the y-axis, avoiding overlap.
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot()
@ -966,7 +1135,10 @@ There are a three other coordinate systems that are occasionally helpful.
```{r}
#| fig-width: 3
#| fig-align: "default"
#| fig-alt: "Side-by-side box plots of highway fuel efficiency of cars in ggplot2::mpg. A separate box plot is drawn along the y-axis for cars in each level of class (2seater, compact, midsize, minivan, pickup, subcompact, and suv)."
#| fig-alt: >
#| Side-by-side box plots of highway fuel efficiency of cars. A separate
#| box plot is drawn along the y-axis for cars in each level of class
#| (2seater, compact, midsize, minivan, pickup, subcompact, and suv).
ggplot(data = mpg, mapping = aes(y = class, x = hwy)) +
geom_boxplot()
@ -979,16 +1151,18 @@ There are a three other coordinate systems that are occasionally helpful.
#| fig-width: 3
#| out-width: "50%"
#| fig-align: "default"
#| message: FALSE
#| fig-alt: "Two maps of the boundaries of New Zealand. In the first plot the aspect ratio is incorrect, in the second plot it's correct."
#| message: false
#| fig-alt: >
#| Two maps of the boundaries of New Zealand. In the first plot the aspect
#| ratio is incorrect, in the second plot it's correct.
nz <- map_data("nz")
ggplot(nz, aes(long, lat, group = group)) +
geom_polygon(fill = "white", colour = "black")
geom_polygon(fill = "white", color = "black")
ggplot(nz, aes(long, lat, group = group)) +
geom_polygon(fill = "white", colour = "black") +
geom_polygon(fill = "white", color = "black") +
coord_quickmap()
```
@ -999,8 +1173,10 @@ There are a three other coordinate systems that are occasionally helpful.
#| fig-width: 3
#| out-width: "50%"
#| fig-align: "default"
#| fig-asp: 1
#| fig-alt: "Two plots: on the left is a bar chart of cut of diamonds in ggplot2::diamonds, on the right is a Coxcomb chart of the same data."
#| fig.asp: 1
#| fig-alt: >
#| There are two plots. On the left is a bar chart of cut of diamonds,
#| on the right is a Coxcomb chart of the same data.
bar <- ggplot(data = diamonds) +
geom_bar(
@ -1029,9 +1205,13 @@ There are a three other coordinate systems that are occasionally helpful.
What does `geom_abline()` do?
```{r}
#| fig-asp: 1
#| fig.asp: 1
#| out-width: "50%"
#| fig-alt: "Scatterplot of highway fuel efficiency versus engine size of cars in ggplot2::mpg that shows a negative association. The plot also has a straight line that follows the trend of the relationship between the variables but doesn't go through the cloud of points, it's beneath it."
#| fig-alt: >
#| Scatterplot of highway fuel efficiency versus engine size of cars that
#| shows a negative association. The plot also has a straight line that
#| follows the trend of the relationship between the variables but doesn't
#| go through the cloud of points, it's beneath it.
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point() +
@ -1063,9 +1243,14 @@ The grammar of graphics is based on the insight that you can uniquely describe *
To see how this works, consider how you could build a basic plot from scratch: you could start with a dataset and then transform it into the information that you want to display (with a stat).
```{r}
#| echo: FALSE
#| echo: false
#| out-width: "100%"
#| fig-alt: "A figure demonstrating the steps for going from raw data (ggplot2::diamonds) to table of counts where each row represents one level of cut and a count column shows how many diamonds are in that cut level. Steps 1 and 2 are annotated: 1. Begin with the diamonds dataset. 2. Compute counts for each cut value with stat_count()."
#| fig-alt: >
#| A figure demonstrating the steps for going from raw data to table of counts
#| where each row represents one level of cut and a count column shows how many
#| diamonds are in that cut level. Steps 1 and 2 are annotated. Step 1. Begin
#| with the diamonds dataset. Step 2. Compute counts for each cut value
#| with stat_count().
knitr::include_graphics("images/visualization-grammar-1.png")
```
@ -1075,9 +1260,14 @@ You could then use the aesthetic properties of the geoms to represent variables
You would map the values of each variable to the levels of an aesthetic.
```{r}
#| echo: FALSE
#| echo: false
#| out-width: "100%"
#| fig-alt: "A figure demonstrating the steps for going from raw data (ggplot2::diamonds) to table of counts where each row represents one level of cut and a count column shows how many diamonds are in that cut level. Each level is also mapped to a color. Steps 3 and 4 are annotated: 3. Represent each observation with a bar. 4. Map the fill of each bar to the ..count.. variable."
#| fig-alt: >
#| A figure demonstrating the steps for going from raw data to table of counts
#| where each row represents one level of cut and a count column shows how
#| many diamonds are in that cut level. Each level is also mapped to a color.
#| Steps 3 and 4 are annotated. Step 3. Represent each observation with a bar.
#| Step 4. Map the fill of each bar to the ..count.. variable.
knitr::include_graphics("images/visualization-grammar-2.png")
```
@ -1087,9 +1277,13 @@ At that point, you would have a complete graph, but you could further adjust the
You could also extend the plot by adding one or more additional layers, where each additional layer uses a dataset, a geom, a set of mappings, a stat, and a position adjustment.
```{r}
#| echo: FALSE
#| echo: false
#| out-width: "100%"
#| fig-alt: "A figure demonstrating the steps for going from raw data (ggplot2::diamonds) to bar chart where each bar represents one level of cut and filled in with a different color. Steps 5 and 6 are annotated: 5. Place geoms in a Cartesian coordinate system. 6. Map the y values to ..count.. and the x values to cut."
#| fig-alt: >
#| A figure demonstrating the steps for going from raw data to bar chart where
#| each bar represents one level of cut and filled in with a different color.
#| Steps 5 and 6 are annotated. Step 5. Place geoms in a Cartesian coordinate
#| system. Step 6. Map the y values to ..count.. and the x values to cut.
knitr::include_graphics("images/visualization-grammar-3.png")
```

View File

@ -29,7 +29,11 @@ By contributing to this book, you agree to abide by its terms.
## Acknowledgements {.unnumbered}
```{r, results = "asis", echo = FALSE, message = FALSE}
```{r}
#| results: "asis"
#| echo: false
#| message: false
library(dplyr)
contributors <- readr::read_csv("contributors.csv", col_types = list())
contributors <- contributors |>

View File

@ -10,7 +10,15 @@ Data science is a huge field, and there's no way you can master it all by readin
The goal of this book is to give you a solid foundation in the most important tools, and enough knowledge to find the resources to learn more when necessary.
Our model of the tools needed in a typical data science project looks something like this:
```{r echo = FALSE, out.width = "75%"}
```{r}
#| echo: false
#| out.width: "75%"
#| fig.align: "center"
#| fig.alt: >
#| A diagram displaying the data science cycle: Import -> Tidy -> Understand
#| (which has the phases Transform -> Visualize -> Model in a cycle) ->
#| Communicate. Surrounding all of these is Communicate.
knitr::include_graphics("diagrams/data-science.png")
```
@ -139,7 +147,13 @@ For this book, make sure you have at least RStudio 1.6.0.
When you start RStudio, you'll see two key regions in the interface: the console pane, and the output pane.
```{r echo = FALSE, out.width = "75%"}
```{r}
#| echo: false
#| out.width: "75%"
#| fig.align: "center"
#| fig.alt: >
#| The RStudio IDE with the panes Console and Output highlighted.
knitr::include_graphics("diagrams/rstudio-console.png")
```
@ -156,7 +170,9 @@ All packages in the tidyverse share a common philosophy of data and R programmin
You can install the complete tidyverse with a single line of code:
```{r, eval = FALSE}
```{r}
#| eval: false
install.packages("tidyverse")
```
@ -186,7 +202,9 @@ As you tackle more data science projects with R, you'll learn new packages and n
In this book we'll use three data packages from outside the tidyverse:
```{r, eval = FALSE}
```{r}
#| eval: false
install.packages(c("nycflights13", "gapminder", "Lahman"))
```
@ -197,7 +215,8 @@ These packages provide data on airline flights, world development, and baseball
The previous section showed you several examples of running R code.
Code in the book looks like this:
```{r, eval = TRUE}
```{r}
#| eval: true
1 + 2
```
@ -239,14 +258,18 @@ There are a few people we'd like to thank in particular, because they have spent
This book was written in the open, and many people contributed pull requests to fix minor problems.
Special thanks goes to everyone who contributed via GitHub:
```{r, results = "asis", echo = FALSE, message = FALSE}
```{r}
#| results: "asis"
#| echo: false
#| message: false
library(dplyr)
# git --no-pager shortlog -ns > contribs.txt
contribs <- readr::read_tsv("contribs.txt", col_names = c("n", "name"))
contribs <- contribs |>
filter(!name %in% c("hadley", "Garrett", "Hadley Wickham",
"Garrett Grolemund")) |>
"Garrett Grolemund", "Mine Cetinkaya-Rundel")) |>
arrange(name) |>
mutate(uname = ifelse(!grepl(" ", name), paste0("@", name), name))

View File

@ -7,7 +7,7 @@ status("restructuring")
## Introduction
You've already learned the basics of missing values earlier in the the book.
You first saw them in Section \@ref(summarise) where they interfered with computing summary statistics, and you learned about their their infectious nature and how to check for their presence in Section \@ref(na-comparison).
You first saw them in Section \@ref(summarize) where they interfered with computing summary statistics, and you learned about their their infectious nature and how to check for their presence in Section \@ref(na-comparison).
Now we'll come back to them in more depth, so you can learn more of the details.
We'll start by discussing some general tools for working with missing values recorded as `NA`s.

View File

@ -1,4 +1,4 @@
# Two-table verbs
# Two-table verbs {#relational-data}
```{r, results = "asis", echo = FALSE}
status("restructuring")

View File

@ -9,6 +9,11 @@ The goal of data exploration is to generate many promising leads that you can la
```{r}
#| echo: false
#| out.width: "75%"
#| fig.alt: >
#| A diagram displaying the data science cycle: Import -> Tidy -> Explore
#| (which has the phases Transform -> Visualize -> Model in a cycle) ->
#| Communicate. Surrounding all of these is Communicate. Explore is highlighted.
knitr::include_graphics("diagrams/data-science-explore.png")
```

View File

@ -41,6 +41,7 @@ All R statements where you create objects, **assignment** statements, have the s
```{r}
#| eval: false
object_name <- value
```
@ -85,7 +86,9 @@ Object names must start with a letter, and can only contain letters, numbers, `_
You want your object names to be descriptive, so you'll need to adopt a convention for multiple words.
We recommend **snake_case** where you separate lowercase words with `_`.
```{r, eval = FALSE}
```{r}
#| eval: false
i_use_snake_case
otherPeopleUseCamelCase
some.people.use.periods
@ -124,7 +127,9 @@ r_rocks <- 2 ^ 3
Let's try to inspect it:
```{r, eval = FALSE}
```{r}
#| eval: false
r_rock
#> Error: object 'r_rock' not found
R_rocks
@ -139,7 +144,9 @@ Case matters.
R has a large collection of built-in functions that are called like this:
```{r eval = FALSE}
```{r}
#| eval: false
function_name(arg1 = val1, arg2 = val2, ...)
```
@ -176,7 +183,12 @@ Usually, this means you've forgotten either a `"` or a `)`. Either add the missi
Note that the environment tab in the upper right pane displays all of the objects that you've created:
```{r, echo = FALSE, out.width = NULL}
```{r}
#| echo: false
#| fig-alt: >
#| Environment tab of RStudio which shows r_rocks, this_is_a_really_long_name,
#| x, and y in the Global Environment.
knitr::include_graphics("screenshots/rstudio-env.png")
```
@ -184,7 +196,9 @@ knitr::include_graphics("screenshots/rstudio-env.png")
1. Why does this code not work?
```{r, error = TRUE}
```{r}
#| error: true
my_variable <- 10
my_varıable
```
@ -194,7 +208,9 @@ knitr::include_graphics("screenshots/rstudio-env.png")
2. Tweak each of the following R commands so that they run correctly:
```{r, eval = FALSE}
```{r}
#| eval: false
libary(tidyverse)
ggplot(dota = mpg) +

View File

@ -10,12 +10,15 @@ We briefly introduced pipes in the previous chapter, but before going too much f
To add the pipe to your code, we recommend using the build-in keyboard shortcut Ctrl/Cmd + Shift + M.
You'll need to make one change to your RStudio options to use `|>` instead of `%>%` as shown in Figure \@ref(fig:pipe-options); more on `%>%` shortly.
```{r pipe-options, out.width = NULL, echo = FALSE}
```{r}
#| label: pipe-options
#| echo: false
#| fig.cap: >
#| To insert `|>`, make sure the "Use native pipe" option is checked.
#| fig.alt: >
#| Screenshot showing the "Use native pipe operator" option which can
#| be found on the "Editing" panel of the "Code" options.
knitr::include_graphics("screenshots/rstudio-pipe-options.png")
```
@ -24,7 +27,9 @@ knitr::include_graphics("screenshots/rstudio-pipe-options.png")
Each individual dplyr verb is quite simple, so solving complex problems typically requires combining multiple verbs.
For example, the last chapter finished with a moderately complex pipe:
```{r, eval = FALSE}
```{r}
#| eval: false
flights |>
filter(!is.na(arr_delay), !is.na(tailnum)) |>
group_by(tailnum) |>
@ -39,7 +44,9 @@ Even though this pipe has four steps, it's easy to skim because the verbs come a
What would happen if we didn't have the pipe?
We could nest each function call inside the previous call:
```{r, eval = FALSE}
```{r}
#| eval: false
summarise(
group_by(
filter(
@ -56,7 +63,9 @@ summarise(
Or we could use a bunch of intermediate variables:
```{r, eval = FALSE}
```{r}
#| eval: false
flights1 <- filter(flights, !is.na(arr_delay), !is.na(tailnum))
flights2 <- group_by(flights1, tailnum)
flights3 <- summarise(flight2,
@ -72,7 +81,9 @@ While both of these forms have their time and place, the pipe generally produces
If you've been using the tidyverse for a while, you might be familiar with the `%>%` pipe provided by the **magrittr** package.
The magrittr package is included in the core tidyverse, so you can use `%>%` whenever you load the tidyverse:
```{r, message = FALSE}
```{r}
#| message: false
library(tidyverse)
mtcars %>%

View File

@ -10,7 +10,12 @@ To give yourself more room to work, it's a great idea to use the script editor.
Open it up either by clicking the File menu, and selecting New File, then R script, or using the keyboard shortcut Cmd/Ctrl + Shift + N.
Now you'll see four panes:
```{r echo = FALSE, out.width = "75%"}
```{r}
#| echo: false
#| out.width: "75%"
#| fig-alt: >
#| RStudio IDE with Editor, Console, and Output highlighted.
knitr::include_graphics("diagrams/rstudio-editor.png")
```
@ -68,7 +73,9 @@ If your cursor is at █, pressing Cmd/Ctrl + Enter will run the complete comman
It will also move the cursor to the next statement (beginning with `not_cancelled |>`).
That makes it easy to step through your complete script by repeatedly pressing Cmd/Ctrl + Enter.
```{r, eval = FALSE}
```{r}
#| eval: false
library(dplyr)
library(nycflights13)
@ -77,7 +84,7 @@ not_cancelled <- flights |>
not_cancelled |>
group_by(year, month, day) |>
summarise(mean = mean(dep_delay))
summarize(mean = mean(dep_delay))
```
Instead of running your code expression-by-expression, you can also execute the complete script in one step: Cmd/Ctrl + Shift + S.
@ -95,19 +102,41 @@ Over time, sending code to the console in this way will become so natural that y
The script editor will also highlight syntax errors with a red squiggly line and a cross in the sidebar:
```{r echo = FALSE, out.width = NULL}
```{r}
#| echo: false
#| out.width: NULL
#| fig-alt: >
#| Script editor with the script `x y <- 10`. A red X indicates that there is
#| syntax error. The syntax error is also highlighted with a red squiggly line.
knitr::include_graphics("screenshots/rstudio-diagnostic.png")
```
Hover over the cross to see what the problem is:
```{r echo = FALSE, out.width = NULL}
```{r}
#| echo: false
#| out.width: NULL
#| fig.alt: >
#| Script editor with the script `x y <- 10`. A red X indicates that there is
#| syntax error. The syntax error is also highlighted with a red squiggly line.
#| Hovering over the X shows a text box with the text 'unexpected token y' and
#| unexpected token <-'.
knitr::include_graphics("screenshots/rstudio-diagnostic-tip.png")
```
RStudio will also let you know about potential problems:
```{r echo = FALSE, out.width = NULL}
```{r}
#| echo: false
#| out.width: NULL
#| fig.alt: >
#| Script editor with the script `3 == NA`. A yellow exclamation park
#| indicates that there may be a potential problem. Hovering over the
#| exclamation mark shows a text box with the text 'use is.na to check
#| whether expression evaluates to NA'.
knitr::include_graphics("screenshots/rstudio-diagnostic-warn.png")
```
@ -133,7 +162,14 @@ You'll either have to retype a lot of code from memory (inevitably, making mista
To encourage this behavior, I highly recommend that you instruct RStudio not to preserve your workspace between sessions:
```{r, echo = FALSE, out.width = "75%"}
```{r}
#| echo: false
#| out.width: "75%"
#| fig.alt: >
#| RStudio preferences window where the option 'Restore .RData into workspace
#| at startup' is not checked. Also, the option 'Save workspace to .RData
#| on exit' is set to 'Never'.
knitr::include_graphics("screenshots/rstudio-workspace.png")
```
@ -154,13 +190,21 @@ R has a powerful notion of the **working directory**.
This is where R looks for files that you ask it to load, and where it will put any files that you ask it to save.
RStudio shows your current working directory at the top of the console:
```{r, echo = FALSE, out.width = "50%"}
```{r}
#| echo: false
#| out.width: "50%"
#| fig-alt: >
#| The Console tab shows the current working directory as
#| '~/Documents/r4ds/r4ds'.
knitr::include_graphics("screenshots/rstudio-wd.png")
```
And you can print this out in R code by running `getwd()`:
```{r eval = FALSE}
```{r}
#| eval: false
getwd()
#> [1] "/Users/hadley/Documents/r4ds/r4ds"
```
@ -171,7 +215,9 @@ Very soon now you should evolve to organizing your analytical projects into dire
**I do not recommend it**, but you can also set the working directory from within R:
```{r eval = FALSE}
```{r}
#| eval: false
setwd("/path/to/my/CoolProject")
```
@ -203,7 +249,17 @@ This is such a wise and common practice that RStudio has built-in support for th
Let's make a project for you to use while you're working through the rest of this book.
Click File \> New Project, then:
```{r, echo = FALSE, out.width = "50%"}
```{r}
#| echo: false
#| out.width: "50%"
#| fig-alt: >
#| There are three screenshots of the New Project menu. In the first screenshot,
#| the `Create Project` window is shown and 'New Directory' is selected.
#| In the second screenshot, the `Project Type` window is shown and
#| 'Empty Project' is selected. In the third screenshot, the 'Create New Project'
#| window is shown and the directory name is given as 'r4ds' and the project
#| is being created as subdirectory of the Desktop.
knitr::include_graphics("screenshots/rstudio-project-1.png")
knitr::include_graphics("screenshots/rstudio-project-2.png")
knitr::include_graphics("screenshots/rstudio-project-3.png")
@ -215,7 +271,9 @@ If you don't store it somewhere sensible, it will be hard to find it in the futu
Once this process is complete, you'll get a new RStudio project just for this book.
Check that the "home" directory of your project is the current working directory:
```{r eval = FALSE}
```{r}
#| eval: false
getwd()
#> [1] /Users/hadley/Documents/r4ds/r4ds
```
@ -226,7 +284,10 @@ Now enter the following commands in the script editor, and save the file, callin
Next, run the complete script which will save a PDF and CSV file into your project directory.
Don't worry about the details, you'll learn them later in the book.
```{r toy-line, eval = FALSE}
```{r}
#| label: toy-line
#| eval: false
library(tidyverse)
ggplot(diamonds, aes(carat, price)) +

View File

@ -16,7 +16,8 @@ The command palette lets you use any build-in RStudio command, as well as many a
Open the palette by pressing Cmd/Ctrl + Shift + P, then type "styler" to see all the shortcuts provided by styler.
Figure \@ref(fig:styler) shows the results.
```{r styler}
```{r}
#| label: styler
#| echo: false
#| out.width: NULL
#| fig.cap: >
@ -25,10 +26,13 @@ Figure \@ref(fig:styler) shows the results.
#| fig.alt: >
#| A screenshot showing the command palette after typing "styler", showing
#| the four styling tool provided by the package.
knitr::include_graphics("screenshots/rstudio-palette.png")
```
```{r setup}
```{r}
#| label: setup
library(tidyverse)
library(nycflights13)
```
@ -39,7 +43,9 @@ We talked briefly about names in Section \@ref(whats-in-a-name).
Remember that variable names (those created by `<-` and those created by `mutate()`) should use only lowercase letters, numbers, and `_`.
Use `_` to separate words within a name.
```{r, eval = FALSE}
```{r}
#| eval: false
# Strive for:
short_flights <- flights |> filter(air_time < 60)
@ -58,7 +64,9 @@ In general, if you have a bunch of variables that are a variation on a theme you
Put spaces on either side of mathematical operators apart from `^` (i.e., `+`, `-`, `==`, `<`, ...), and around the assignment operator (`<-`).
```{r, eval = FALSE}
```{r}
#| eval: false
# Strive for
z <- (a + b)^2 / d
@ -69,7 +77,9 @@ z<-( a + b ) ^ 2/d
Don't put spaces inside or outside parentheses for regular function calls.
Always put a space after a comma, just like in regular English.
```{r, eval = FALSE}
```{r}
#| eval: false
# Strive for
mean(x, na.rm = TRUE)
@ -81,7 +91,9 @@ It's OK to add extra spaces if it improves alignment.
For example, if you're creating multiple variables in `mutate()`, you might want to add spaces so that all the `=` line up.
This makes it easier to skim the code.
```{r, eval = FALSE}
```{r}
#| eval: false
flights |>
mutate(
speed = air_time / distance,
@ -95,7 +107,9 @@ flights |>
`|>` should always have a space before it and should typically be the last thing on a line.
This makes makes it easier to add new steps, rearrange existing steps, modify elements within a step, and to get a 50,000 ft view by skimming the verbs on the left-hand side.
```{r, eval = FALSE}
```{r}
#| eval: false
# Strive for
flights |>
filter(!is.na(arr_delay), !is.na(tailnum)) |>
@ -105,14 +119,16 @@ flights |>
flights|>filter(!is.na(arr_delay), !is.na(tailnum))|>count(dest)
```
If the function you're piping into has named arguments (like `mutate()` or `summarise()`), put each argument on a new line.
If the function you're piping into has named arguments (like `mutate()` or `summarize()`), put each argument on a new line.
If the function doesn't have named arguments (like `select()` or `filter()`) keep everything on one line unless it doesn't fit, in which case you should put each argument on its own line.
```{r, eval = FALSE}
```{r}
#| eval: false
# Strive for
flights |>
group_by(tailnum) |>
summarise(
summarize(
delay = mean(arr_delay, na.rm = TRUE),
n = n()
)
@ -122,18 +138,20 @@ flights |>
group_by(
tailnum
) |>
summarise(delay = mean(arr_delay, na.rm = TRUE), n = n())
summarize(delay = mean(arr_delay, na.rm = TRUE), n = n())
```
After the first step of the pipeline, indent each line by two spaces.
If you're putting each argument on its own line, indent by an extra two spaces.
Make sure `)` is on its own line, and un-indented to match the horizontal position of the function name.
```{r, eval = FALSE}
```{r}
#| eval: false
# Strive for
flights |>
group_by(tailnum) |>
summarise(
summarize(
delay = mean(arr_delay, na.rm = TRUE),
n = n()
)
@ -141,14 +159,14 @@ flights |>
# Avoid
flights|>
group_by(tailnum) |>
summarise(
summarize(
delay = mean(arr_delay, na.rm = TRUE),
n = n()
)
flights|>
group_by(tailnum) |>
summarise(
summarize(
delay = mean(arr_delay, na.rm = TRUE),
n = n()
)
@ -157,7 +175,9 @@ flights|>
It's OK to shirk some of these rules if your pipeline fits easily on one line.
But in our collective experience, it's common for short snippets to grow longer, so you'll usually save time in the long run by starting with all the vertical space you need.
```{r, eval = FALSE}
```{r}
#| eval: false
# This fits compactly on one line
df |> mutate(y = x + 1)
@ -180,10 +200,12 @@ This means breaking up long pipelines if there are intermediate states that can
The same basic rules that apply to the pipe also apply to ggplot2; just treat `+` the same way as `|>`.
```{r, eval = FALSE}
```{r}
#| eval: false
flights |>
group_by(month) |>
summarise(
summarize(
delay = mean(arr_delay, na.rm = TRUE)
) |>
ggplot(aes(month, delay)) +
@ -193,10 +215,12 @@ flights |>
Again, if you can fit all of the arguments to a function on to a single line, put each argument on its own line:
```{r, eval = FALSE}
```{r}
#| eval: false
flights |>
group_by(dest) |>
summarise(
summarize(
distance = mean(distance),
speed = mean(air_time / distance, na.rm = TRUE)
) |>
@ -205,13 +229,13 @@ flights |>
method = "loess",
span = 0.5,
se = FALSE,
colour = "white",
color = "white",
size = 4
) +
geom_point()
```
## Organisation
## Organization
Use comments to explain the "why" of your code, not the "how" or the "what".
If you simply describe what your code is doing in prose, you'll have to be careful to update the comment and code in tandem: if you change the code and forget to update the comment, they'll be inconsistent which will lead to confusion when you come back to your code in the future.
@ -220,7 +244,9 @@ There's no way to re-capture this knowledge from the code itself.
As your scripts get longer, use **sectioning** comments to break up your file into manageable pieces:
```{r, eval = FALSE}
```{r}
#| eval: false
# Load data --------------------------------------
# Plot data --------------------------------------
@ -228,13 +254,15 @@ As your scripts get longer, use **sectioning** comments to break up your file in
RStudio provides a keyboard shortcut to create these headers (Cmd/Ctrl + Shift + R), and will display them in the code navigation drop-down at the bottom-left of the editor, as shown in Figure \@ref(fig:rstudio-sections).
```{r rstudio-sections, echo = FALSE, out.width = NULL}
```{r}
#| label: rstudio-sections
#| echo: false
#| out.width: NULL
#| fig.cap: >
#| After adding sectioning comments to your script, you can
#| easily navigate to them using the code navigation tool in the
#| bottom-left of the script editor.
knitr::include_graphics("screenshots/rstudio-nav.png")
```
@ -242,8 +270,10 @@ knitr::include_graphics("screenshots/rstudio-nav.png")
1. Restyle the following pipelines following the guidelines above.
```{r, eval = FALSE}
flights|>filter(dest=="IAH")|>group_by(year,month,day)|>summarise(n=n(),delay=mean(arr_delay,na.rm=TRUE))|>filter(n>10)
```{r}
#| eval: false
flights|>filter(carrier=="UA",dest%in%c("IAH","HOU"),sched_dep_time>0900,sched_arr_time<2000)|>group_by(flight)|>summarise(delay=mean(arr_delay,na.rm=TRUE),cancelled=sum(is.na(arr_delay)),n=n())|>filter(n>10)
flights|>filter(dest=="IAH")|>group_by(year,month,day)|>summarize(n=n(),delay=mean(arr_delay,na.rm=TRUE))|>filter(n>10)
flights|>filter(carrier=="UA",dest%in%c("IAH","HOU"),sched_dep_time>0900,sched_arr_time<2000)|>group_by(flight)|>summarize(delay=mean(arr_delay,na.rm=TRUE),cancelled=sum(is.na(arr_delay)),n=n())|>filter(n>10)
```