r4ds/workflow-pipes.Rmd

111 lines
3.8 KiB
Plaintext
Raw Normal View History

# Workflow: Pipes {#workflow-pipes}
2021-04-21 21:25:39 +08:00
2022-02-16 05:34:31 +08:00
```{r, results = "asis", echo = FALSE}
status("restructuring")
```
## Introduction
2021-04-21 21:25:39 +08:00
2022-02-17 05:51:27 +08:00
The pipe, `|>` is a powerful tool for clearly expressing a sequence of multiple operations.
We briefly introduced them in the previous chapter but before going too much farther I wanted to give a little more motivation, discuss another important pipe (`%>%`), and discuss one challenge of the pipe.
2021-04-21 21:25:39 +08:00
2022-02-17 05:51:27 +08:00
## Why use a pipe?
2021-04-21 21:25:39 +08:00
2022-02-17 05:51:27 +08:00
Each individual dplyr function is quite simple, so to solve complex problems you'll typically need to combine multiple verbs together.
The end of the last chapter finished with a moderately complex pipe:
2021-04-21 21:25:39 +08:00
2022-02-17 05:51:27 +08:00
```{r, eval = FALSE}
flights |>
filter(!is.na(arr_delay), !is.na(tailnum)) |>
group_by(tailnum) |>
summarise(
delay = mean(arr_delay, na.rm = TRUE),
n = n()
)
```
2021-04-21 21:25:39 +08:00
2022-02-17 05:51:27 +08:00
Even though this pipe has four steps, it quites easy to skim to get the main meaning: we start with flights, then filter, then group, then summarize.
2021-04-21 21:25:39 +08:00
2022-02-17 05:51:27 +08:00
What would happen if we didn't have the pipe?
We can still solve this same problem but we'd need to nest each function call inside the previous:
```{r, eval = FALSE}
2022-02-17 05:51:27 +08:00
summarise(
group_by(
filter(
flights,
!is.na(arr_delay), !is.na(tailnum)
),
tailnum
),
delay = mean(arr_delay, na.rm = TRUE
),
n = n()
)
```
2022-02-17 05:51:27 +08:00
Or use a bunch of intermediate variables:
```{r, eval = FALSE}
2022-02-17 05:51:27 +08:00
flights1 <- filter(flights, !is.na(arr_delay), !is.na(tailnum))
flights2 <- group_by(flights1, tailnum)
flights3 <- summarise(flight2,
delay = mean(arr_delay, na.rm = TRUE),
n = n()
)
```
2022-02-17 05:51:27 +08:00
While both of these forms have their uses, the pipe generally produces code that is easier to read and easier to write.
2022-02-16 05:34:31 +08:00
## magrittr and the `%>%` pipe
2022-02-17 05:51:27 +08:00
If you've been using the tidyverse for a while, you might have been be more familiar with the `%>%` pipe provided by the **magrittr** package by Stefan Milton Bache.
The magrittr package is included in the code the tidyverse, so you can use `%>%` whenever you use the tidyverse:
```{r, message = FALSE}
library(tidyverse)
mtcars %>%
group_by(cyl) %>%
summarise(n = n())
```
2022-02-17 05:51:27 +08:00
For simple cases `|>` and `%>%` behave identically.
So why do we recommend the base pipe?
Firstly, because it's part of base R, it's always available for you to use, even when you're not using the tidyverse.
Secondly, the `|>` is quite a bit simpler than the magrittr pipe.
In the 7 years between the invention of `%>%` in 2014 and the inclusion of `|>` in R 4.1.0 in 2021, we honed in the core strength of the pipe, allowing the base implementation to jettison to estoeric and relatively unimportant features.
2022-02-17 05:51:27 +08:00
### Key differences
2022-02-17 05:51:27 +08:00
If you haven't used `%>%` you can skip this section; if you have, read on to learn about the most important differences.
2022-02-17 05:51:27 +08:00
- `%>%` allows you to use `.` as a placeholder to control how the object on the left is passed to the function on the right.
R 4.2.0 will bring a `_` as a placeholder with the additional restriction that it must be named.
2022-02-17 05:51:27 +08:00
- The base pipe `|>` doesn't support any of the more complex uses of `.` such as passing `.` to more than one argument, or the special behavior when used with `.`.
2022-02-17 05:51:27 +08:00
- The base pipe doesn't yet provide a convenient way to use `$` (and similar functions).
With magrittr, you can write:
2022-02-17 05:51:27 +08:00
```{r}
mtcars %>% .$cyl
```
2022-02-17 05:51:27 +08:00
With the base pipe you instead need the rather cryptic:
2022-02-17 05:51:27 +08:00
```{r}
mtcars |> (`$`)(cyl)
```
Fortunately, you can instead use `dplyr::pull():`
2022-02-17 05:51:27 +08:00
```{r}
mtcars |> pull(cyl)
```
2022-02-17 05:51:27 +08:00
- When calling a function with no argument, you could drop the parenthesis, and write (e.g.) `x %>% ungroup`.
The parenthesis are always required with `|>`.
2022-02-17 05:51:27 +08:00
- Starting a pipe with `.`, like `. %>% group_by(x) %>% summarise(x)` would create a function rather than immediately performing the pipe.