Eliminate pipes chapter (#1332)

More pipes in to transform chapter, and reflow chapter summaries.
This commit is contained in:
Hadley Wickham 2023-03-01 13:34:26 -06:00 committed by GitHub
parent 844879979b
commit 0c2971b9d1
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
7 changed files with 146 additions and 207 deletions

View File

@ -29,11 +29,10 @@ book:
- data-visualize.qmd
- workflow-basics.qmd
- data-transform.qmd
- workflow-pipes.qmd
- data-tidy.qmd
- workflow-style.qmd
- data-import.qmd
- data-tidy.qmd
- workflow-scripts.qmd
- data-import.qmd
- workflow-help.qmd
- part: visualize.qmd

View File

@ -504,5 +504,5 @@ In this chapter, you've learned how to load CSV files with `read_csv()` and to d
You've learned how csv files work, some of the problems you might encounter, and how to overcome them.
We'll come to data import a few times in this book: @sec-import-spreadsheets from Excel and Google Sheets, @sec-import-databases will show you how to load data from databases, @sec-arrow from parquet files, @sec-rectangling from JSON, and @sec-scraping from websites.
Now that you're writing a substantial amount of R code, it's time to learn more about organizing your code into files and directories.
In the next chapter, you'll learn all about the advantages of scripts and projects, and some of the many tools that they provide to make your life easier.
We're just about at the end of this section of the book, but there's one important last topic to cover: how to get help.
So in the next chapter, you'll learn some good places to look for help, how to create a reprex to maximize your chances of getting good help, and some general advice on keeping up with the world of R.

View File

@ -586,4 +586,5 @@ The examples we used here are just a selection of those from `vignette("pivot",
If you particularly enjoyed this chapter and want to learn more about the underlying theory, you can learn more about the history and theoretical underpinnings in the [Tidy Data](https://www.jstatsoft.org/article/view/v059i10) paper published in the Journal of Statistical Software.
In the next chapter, we'll pivot back to workflow to discuss the importance of code style, keeping your code "tidy" (ha!) in order to make it easy for you and others to read and understand your code.
Now that you're writing a substantial amount of R code, it's time to learn more about organizing your code into files and directories.
In the next chapter, you'll learn all about the advantages of scripts and projects, and some of the many tools that they provide to make your life easier.

View File

@ -14,7 +14,7 @@ Often you'll need to create some new variables or summaries to see the most impo
You'll learn how to do all that (and more!) in this chapter, which will introduce you to data transformation using the **dplyr** package and a new dataset on flights that departed New York City in 2013.
The goal of this chapter is to give you an overview of all the key tools for transforming a data frame.
We'll start with functions that operate on rows and then columns of a data frame.
We'll start with functions that operate on rows and then columns of a data frame, then circle back to talk more about the pipe, an important tool that you use to combine verbs.
We will then introduce the ability to work with groups.
We will end the chapter with a case study that showcases these functions in action and we'll come back to the functions in more detail in later chapters, as we start to dig into specific types of data (e.g. numbers, strings, dates).
@ -71,8 +71,8 @@ But before we discuss their individual differences, it's worth stating what they
3. The result is always a new data frame.
Because the first argument is a data frame and the output is a data frame, dplyr verbs work well with the pipe, `|>`.
The pipe takes the thing on its left and passes it along to the function on its right so that `x |> f(y)` is equivalent to `f(x, y)`, and `x |> f(y) |> g(z)` is equivalent to into `g(f(x, y), z)`.
Since each verb is quite simple, solving complex problems will usually require combining multiple verbs, and we'll do so with the pipe, `|>`.
We'll discuss the pipe more in @the-pipe, but in brief, the pipe takes the thing on its left and passes it along to the function on its right so that `x |> f(y)` is equivalent to `f(x, y)`, and `x |> f(y) |> g(z)` is equivalent to into `g(f(x, y), z)`.
The easiest way to pronounce the pipe is "then".
That makes it possible to get a sense of the following code even though you haven't yet learned the details:
@ -87,11 +87,8 @@ flights |>
)
```
The code starts with the `flights` dataset, then filters it, then groups it, then summarizes it.
We'll come back to the pipe and its alternatives in @sec-pipes.
dplyr's verbs are organized into four groups based on what they operate on: **rows**, **columns**, **groups**, or **tables**.
In the following sections you'll learn the most important verbs for rows, columns, and groups, then we'll come back to verbs that work on tables in @sec-joins.
In the following sections you'll learn the most important verbs for rows, columns, and groups, then we'll come back to the join verbs that work on tables in @sec-joins.
Let's dive in!
## Rows
@ -191,15 +188,6 @@ flights |>
arrange(desc(dep_delay))
```
You can combine `arrange()` and `filter()` to solve more complex problems.
For example, we could filter for the flights that left roughly on time, then arrange the results to see which flights were most delayed on arrival:
```{r}
flights |>
filter(dep_delay <= 10 & dep_delay >= -10) |>
arrange(desc(arr_delay))
```
### `distinct()`
`distinct()` finds all the unique rows in a dataset, so in a technical sense, it primarily operates on the rows.
@ -281,6 +269,7 @@ You can also use `.after` to add after a variable, and in both `.before` and `.a
For example, we could add the new variables after `day`:
```{r}
#| results: false
flights |>
mutate(
gain = dep_delay - arr_delay,
@ -290,9 +279,11 @@ flights |>
```
Alternatively, you can control which variables are kept with the `.keep` argument.
A particularly useful argument is `"used"` which allows you to see the inputs and outputs from your calculations:
A particularly useful argument is `"used"` which allows you to see the inputs and outputs from your calculations.
For example, the following output will contain only the variables `dep_delay`, `arr_delay`, `air_time`, `gain`, `hours`, and `gain_per_hour`.
```{r}
#| results: false
flights |>
mutate(
gain = dep_delay - arr_delay,
@ -309,23 +300,37 @@ In this situation, the first challenge is often just focusing on the variables y
`select()` allows you to rapidly zoom in on a useful subset using operations based on the names of the variables.
`select()` is not terribly useful with the `flights` data because we only have 19 variables, but you can still get the general idea of how it works:
```{r}
# Select columns by name
flights |>
select(year, month, day)
- Select columns by name:
# Select all columns between year and day (inclusive)
flights |>
select(year:day)
```{r}
#| results: false
flights |>
select(year, month, day)
```
# Select all columns except those from year to day (inclusive)
flights |>
select(!year:day)
- Select all columns between year and day (inclusive):
# Select all columns that are characters
flights |>
select(where(is.character))
```
```{r}
#| results: false
flights |>
select(year:day)
```
- Select all columns except those from year to day (inclusive):
```{r}
#| results: false
flights |>
select(!year:day)
```
- Select all columns that are characters:
```{r}
#| results: false
flights |>
select(where(is.character))
```
There are a number of helper functions you can use within `select()`:
@ -372,6 +377,8 @@ flights |>
But you can use the same `.before` and `.after` arguments as `mutate()` to choose where to put them:
```{r}
#| results: false
flights |>
relocate(year:dep_time, .after = time_hour)
flights |>
@ -423,6 +430,93 @@ ggplot(flights, aes(x = air_time - airtime2)) + geom_histogram()
6. Rename `air_time` to `air_time_min` to indicate units of measurement and move it to the beginning of the data frame.
## The pipe {#the-pipe}
We've shown you simple examples of the pipe above, but its real power arises when you start to combine multiple verbs.
For example, imagine that you wanted to find the fast flights to Houston's IAH airport: you need to combine `filter()`, `mutate()`, `select()`, and `arrange()`:
```{r}
flights |>
filter(dest == "IAH") |>
mutate(speed = distance / air_time * 60) |>
select(year:day, dep_time, carrier, flight, speed) |>
arrange(desc(speed))
```
Even though this pipe has four steps, it's easy to skim because the verbs come at the start of each line: start with the `flights` data, then filter, then group, then summarize.
What would happen if we didn't have the pipe?
We could nest each function call inside the previous call:
```{r}
#| results: false
arrange(
select(
mutate(
filter(
flights,
dest == "IAH"
),
speed = distance / air_time * 60
),
year:day, dep_time, carrier, flight, speed
),
desc(speed)
)
```
Or we could use a bunch of intermediate variables:
```{r}
#| results: false
flights1 <- filter(flights, dest == "IAH")
flights2 <- mutate(flights1, speed = distance / air_time * 60)
flights3 <- select(flights2, year:day, dep_time, carrier, flight, speed)
arrange(flights3, desc(speed))
```
While both forms have their time and place, the pipe generally produces data analysis code that is easier to write and read.
To add the pipe to your code, we recommend using the build-in keyboard shortcut Ctrl/Cmd + Shift + M.
You'll need to make one change to your RStudio options to use `|>` instead of `%>%` as shown in @fig-pipe-options; more on `%>%` shortly.
```{r}
#| label: fig-pipe-options
#| echo: false
#| fig-cap: >
#| To insert `|>`, make sure the "Use native pipe operator" option is checked.
#| fig-alt: >
#| Screenshot showing the "Use native pipe operator" option which can
#| be found on the "Editing" panel of the "Code" options.
knitr::include_graphics("screenshots/rstudio-pipe-options.png")
```
::: callout-note
## magrittr
If you've been using the tidyverse for a while, you might be familiar with the `%>%` pipe provided by the **magrittr** package.
The magrittr package is included in the core tidyverse, so you can use `%>%` whenever you load the tidyverse:
```{r}
#| eval: false
library(tidyverse)
mtcars %>%
group_by(cyl) %>%
summarize(n = n())
```
For simple cases, `|>` and `%>%` behave identically.
So why do we recommend the base pipe?
Firstly, because it's part of base R, it's always available for you to use, even when you're not using the tidyverse.
Secondly, `|>` is quite a bit simpler than `%>%`: in the time between the invention of `%>%` in 2014 and the inclusion of `|>` in R 4.1.0 in 2021, we gained a better understanding of the pipe.
This allowed the base implementation to jettison infrequently used and less important features.
:::
## Groups
So far you've learned about functions that work with rows and columns.
@ -459,7 +553,7 @@ flights |>
Uhoh!
Something has gone wrong and all of our results are `NA` (pronounced "N-A"), R's symbol for missing value.
We'll come back to discuss missing values in @sec-missing-values, but for now we'll remove them by using `na.rm = TRUE`:
We'll come back to discuss missing values in detail in @sec-missing-values, but for now we'll remove them by using `na.rm = TRUE`:
```{r}
flights |>
@ -502,13 +596,7 @@ flights |>
slice_max(arr_delay, n = 1)
```
This is similar to computing the max delay with `summarize()`, but you get the whole row instead of the single summary:
```{r}
flights |>
group_by(dest) |>
summarize(max_delay = max(arr_delay, na.rm = TRUE))
```
This is similar to computing the max delay with `summarize()`, but you get the whole row instead of the single summary.
### Grouping by multiple variables
@ -777,4 +865,4 @@ The tools are roughly grouped into three categories: those that manipulate the r
In this chapter, we've focused on these "whole data frame" tools, but you haven't yet learned much about what you can do with the individual variable.
We'll come back to that in the Transform part of the book, where each chapter will give you tools for a specific type of variable.
For now, we'll pivot back to workflow, and in the next chapter you'll learn more about the pipe, `|>`, why we recommend it, and a little of the history that lead from magrittr's `%>%` to base R's `|>`.
In the next chapter, we'll pivot back to workflow to discuss the importance of code style, keeping your code well organized in order to make it easy for you and others to read and understand your code.

View File

@ -1,157 +0,0 @@
# Workflow: pipes {#sec-workflow-pipes}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
status("complete")
```
The pipe, `|>`, is a powerful tool for clearly expressing a sequence of operations that transform an object.
We briefly introduced pipes in the previous chapter, but before going further, we want to give a few more details and discuss `%>%`, a predecessor to `|>`.
To add the pipe to your code, we recommend using the build-in keyboard shortcut Ctrl/Cmd + Shift + M.
You'll need to make one change to your RStudio options to use `|>` instead of `%>%` as shown in @fig-pipe-options; more on `%>%` shortly.
```{r}
#| label: fig-pipe-options
#| echo: false
#| fig-cap: >
#| To insert `|>`, make sure the "Use native pipe operator" option is checked.
#| fig-alt: >
#| Screenshot showing the "Use native pipe operator" option which can
#| be found on the "Editing" panel of the "Code" options.
knitr::include_graphics("screenshots/rstudio-pipe-options.png")
```
## Why use a pipe?
Each individual dplyr verb is quite simple, so solving complex problems typically requires combining multiple verbs.
For example, the last chapter finished with a moderately complex pipe:
```{r}
#| eval: false
flights |>
filter(!is.na(arr_delay), !is.na(tailnum)) |>
group_by(tailnum) |>
summarize(
delay = mean(arr_delay, na.rm = TRUE),
n = n()
)
```
Even though this pipe has four steps, it's easy to skim because the verbs come at the start of each line: start with the `flights` data, then filter, then group, then summarize.
What would happen if we didn't have the pipe?
We could nest each function call inside the previous call:
```{r}
#| eval: false
summarize(
group_by(
filter(
flights,
!is.na(arr_delay), !is.na(tailnum)
),
tailnum
),
delay = mean(arr_delay, na.rm = TRUE
),
n = n()
)
```
Or we could use a bunch of intermediate variables:
```{r}
#| eval: false
flights1 <- filter(flights, !is.na(arr_delay), !is.na(tailnum))
flights2 <- group_by(flights1, tailnum)
flights3 <- summarize(flight2,
delay = mean(arr_delay, na.rm = TRUE),
n = n()
)
```
While both forms have their time and place, the pipe generally produces data analysis code that is easier to write and read.
## magrittr and the `%>%` pipe
If you've been using the tidyverse for a while, you might be familiar with the `%>%` pipe provided by the **magrittr** package.
The magrittr package is included in the core tidyverse, so you can use `%>%` whenever you load the tidyverse:
```{r}
#| message: false
library(tidyverse)
mtcars %>%
group_by(cyl) %>%
summarize(n = n())
```
For simple cases, `|>` and `%>%` behave identically.
So why do we recommend the base pipe?
Firstly, because it's part of base R, it's always available for you to use, even when you're not using the tidyverse.
Secondly, `|>` is quite a bit simpler than `%>%`: in the time between the invention of `%>%` in 2014 and the inclusion of `|>` in R 4.1.0 in 2021, we gained a better understanding of the pipe.
This allowed the base implementation to jettison infrequently used and less important features.
## `|>` vs. `%>%`
While `|>` and `%>%` behave identically for simple cases, there are a few crucial differences.
These are most likely to affect you if you're a long-term user of `%>%` who has taken advantage of some of the more advanced features.
But they're still good to know about even if you've never used `%>%` because you're likely to encounter some of them when reading wild-caught code.
- By default, the pipe passes the object on its left-hand side to the first argument of the function on the right-hand side.
`%>%` allows you to change the placement with a `.` placeholder.
For example, `x %>% f(1)` is equivalent to `f(x, 1)` but `x %>% f(1, .)` is equivalent to `f(1, x)`.
R 4.2.0 added a `_` placeholder to the base pipe, with one additional restriction: the argument has to be named.
For example, `x |> f(1, y = _)` is equivalent to `f(1, y = x)`.
- The `|>` placeholder is deliberately simple and can't replicate many features of the `%>%` placeholder: you can't pass it to multiple arguments, and it doesn't have any special behavior when the placeholder is used inside another function.
For example, `df %>% split(.$var)` is equivalent to `split(df, df$var)` and `df %>% {split(.$x, .$y)}` is equivalent to `split(df$x, df$y)`.
With `%>%`, you can use `.` on the left-hand side of operators like `$`, `[[`, `[` (which you'll learn about in @sec-subset-many), so you can extract a single column from a data frame with (e.g.) `mtcars %>% .$cyl`.
A future version of R may add similar support for `|>` and `_`.
For the special case of extracting a column out of a data frame, you can also use `dplyr::pull()`:
```{r}
mtcars |> pull(cyl)
```
- `%>%` allows you to drop the parentheses when calling a function with no other arguments; `|>` always requires the parentheses.
- `%>%` allows you to start a pipe with `.` to create a function rather than immediately executing the pipe; this is not supported by the base pipe.
Luckily there's no need to commit entirely to one pipe or the other --- you can use the base pipe for the majority of cases where it's sufficient and use the magrittr pipe when you really need its special features.
## `|>` vs. `+`
Sometimes we'll turn the end of a data transformation pipeline into a plot.
Watch for the transition from `|>` to `+`.
We wish this transition wasn't necessary, but unfortunately, ggplot2 was created before the pipe was discovered.
```{r}
#| eval: false
diamonds |>
count(cut, clarity) |>
ggplot(aes(x = clarity, y = cut, fill = n)) +
geom_tile()
```
## Summary
In this chapter, you've learned more about the pipe: why we recommend it and some of the history that lead to `|>`.
The pipe is important because you'll use it again and again throughout your analysis, but hopefully, it will quickly become invisible, and your fingers will type it (or use the keyboard shortcut) without your brain having to think too much about it.
In the next chapter, we switch back to data science tools, learning about tidy data.
Tidy data is a consistent way of organizing your data frames that is used throughout the tidyverse.
This consistency makes your life easier because once you have tidy data, it just works with the vast majority of tidyverse functions.
Of course, life is never easy, and most datasets you encounter in the wild will not already be tidy.
So we'll also teach you how to use the tidyr package to tidy your untidy data.

View File

@ -352,4 +352,7 @@ In summary, scripts and projects give you a solid workflow that will serve you w
- Only ever use relative paths, not absolute paths.
Then everything you need is in one place and cleanly separated from all the other projects that you are working on.
Next up, you'll learn about how to get help and how to ask good coding questions.
So far, we've worked with datasets bundled inside of R packages.
This makes it easier to get some practice on pre-prepared data, but obviously your data won't be available in this way.
So in the next chapter, you're going to learn how load data from disk into your R session using the readr package.

View File

@ -242,6 +242,9 @@ flights |>
geom_point()
```
Watch for the transition from `|>` to `+`.
We wish this transition wasn't necessary, but unfortunately, ggplot2 was written before the pipe was discovered.
## Sectioning comments
As your scripts get longer, you can use **sectioning** comments to break up your file into manageable pieces:
@ -289,6 +292,8 @@ In this chapter, you've learn the most important principles of code style.
These may feel like a set of arbitrary rules to start with (because they are!) but over time, as you write more code, and share code with more people, you'll see how important a consistent style is.
And don't forget about the styler package: it's a great way to quickly improve the quality of poorly styled code.
So far, we've worked with datasets bundled inside of R packages.
This makes it easier to get some practice on pre-prepared data, but obviously your data won't be available in this way.
So in the next chapter, you're going to learn how load data from disk into your R session using the readr package.
In the next chapter, we switch back to data science tools, learning about tidy data.
Tidy data is a consistent way of organizing your data frames that is used throughout the tidyverse.
This consistency makes your life easier because once you have tidy data, it just works with the vast majority of tidyverse functions.
Of course, life is never easy, and most datasets you encounter in the wild will not already be tidy.
So we'll also teach you how to use the tidyr package to tidy your untidy data.