Describe pull in Base R chapter

Fixes #1108
This commit is contained in:
Hadley Wickham 2022-11-21 08:36:55 -06:00
parent c89aa20627
commit bdc3555b9a
2 changed files with 47 additions and 31 deletions

View File

@ -219,7 +219,7 @@ In this section, we'll show you how to use `[[` and `$` to pull columns out of a
### Data frames
`[[` and `$` can be used like `pull()` to extract columns out of a data frame.
`[[` and `$` can be used extract columns out of a data frame.
`[[` can access by position or by name, and `$` is specialized for access by name:
```{r}
@ -255,6 +255,16 @@ max(diamonds$carat)
levels(diamonds$cut)
```
dplyr also provides an equivalent to `[[`/`$` that we didn't mention in @sec-data-transform: `pull()`.
`pull()` takes either a variable name or variable position and returns just that column.
That means we could rewrite the above code to use the pipe:
```{r}
diamonds |> pull(carat) |> mean()
diamonds |> pull(cut) |> levels()
```
### Tibbles
There are a couple of important differences between tibbles and base `data.frame`s when it comes to `$`.
@ -537,3 +547,4 @@ This often makes life easier for programming and so becomes more important as yo
This chapter concludes the programming section of the book.
You've made a solid start on your journey to becoming not just a data scientist who uses R, but a data scientist who can *program* in R.
We hope these chapters have sparked your interested in programming and that you're are looking forward to learning more outside of this book.

View File

@ -384,14 +384,14 @@ With this theory under your belt, we'll then show you a bunch of examples to ill
### Indirection and tidy evaluation
When you start writing functions that use dplyr verbs you rapidly hit the problem of indirection.
Let's illustrate the problem with a very simple function: `pull_unique()`.
The goal of this function is to `pull()` the unique (distinct) values of a variable:
Let's illustrate the problem with a very simple function: `grouped_mean()`.
The goal of this function is compute the mean of `mean_var` grouped by `group_var`:
```{r}
pull_unique <- function(df, var) {
grouped_mean <- function(df, group_var, mean_var) {
df |>
distinct(var) |>
pull(var)
group_by(group_var) |>
summarize(mean(mean_var))
}
```
@ -399,38 +399,45 @@ If we try and use it, we get an error:
```{r}
#| error: true
diamonds |> pull_unique(clarity)
diamonds |> grouped_mean(cut, carat)
```
To make the problem a bit more clear we can use a made up data frame:
```{r}
df <- tibble(var = "var", x = "x", y = "y")
df |> pull_unique(x)
df |> pull_unique(y)
df <- tibble(
mean_var = 1,
group_var = "g",
group = 1,
x = 10,
y = 100
)
df |> grouped_mean(group, x)
df |> grouped_mean(group, y)
```
Regardless of how we call `pull_unique()` it always does `df |> distinct(var) |> pull(var)`, instead of `df |> distinct(x) |> pull(x)` or `df |> distinct(y) |> pull(y)`.
Regardless of how we call `grouped_mean()` it always does `df |> group_by(group_var) |> summarise(mean(mean_var))`, instead of `df |> group_by(group) |> summarise(mean(x))` or `df |> group_by(group) |> summarise(mean(y))`.
This is a problem of indirection, and it arises because dplyr uses **tidy evaluation** to allow you to refer to the names of variables inside your data frame without any special treatment.
Tidy evaluation is great 95% of the time because it makes your data analyses very concise as you never have to say which data frame a variable comes from; it's obvious from the context.
The downside of tidy evaluation comes when we want to wrap up repeated tidyverse code into a function.
Here we need some way to tell `distinct()` and `pull()` not to treat `var` as the name of a variable, but instead look inside `var` for the variable we actually want to use.
Here we need some way to tell `group_mean()` and `summarise()` not to treat `group_var` and `mean_var` as the name of the variables, but instead look inside them for the variable we actually want to use.
Tidy evaluation includes a solution to this problem called **embracing** 🤗.
Embracing a variable means to wrap it in braces so (e.g.) `var` becomes `{{ var }}`.
Embracing a variable tells dplyr to use the value stored inside the argument, not the argument as the literal variable name.
One way to remember what's happening is to think of `{{ }}` as looking down a tunnel --- `{{ var }}` will make a dplyr function look inside of `var` rather than looking for a variable called `var`.
So to make `pull_unique()` work we need to replace `var` with `{{ var }}`:
So to make grouped_mean`()` work we need to replace surround `group_var` and `mean_var()` with `{{ }}`:
```{r}
pull_unique <- function(df, var) {
grouped_mean <- function(df, group_var, mean_var) {
df |>
distinct({{ var }}) |>
pull({{ var }})
group_by({{ group_var }}) |>
summarize(mean({{ mean_var }}))
}
diamonds |> pull_unique(clarity)
diamonds |> grouped_mean(cut, carat)
```
Success!
@ -511,8 +518,7 @@ unique_where <- function(df, condition, var) {
df |>
filter({{ condition }}) |>
distinct({{ var }}) |>
arrange({{ var }}) |>
pull({{ var }})
arrange({{ var }})
}
# Find all the destinations in December
@ -521,7 +527,7 @@ flights |> unique_where(month == 12, dest)
flights |> unique_where(tailnum == "N14228", month)
```
Here we embrace `condition` because it's passed to `filter()` and `var` because its passed to `distinct()`, `arrange()`, and `pull()`.
Here we embrace `condition` because it's passed to `filter()` and `var` because its passed to `distinct()` and `arrange()`.
We've made all these examples take a data frame as the first argument, but if you're working repeatedly with the same data, it can make sense to hardcode it.
For example, the following function always works with the flights dataset and always selects `time_hour`, `carrier`, and `flight` since they form the compound primary key that allows you to identify a row.
@ -890,21 +896,20 @@ This makes it easier to see the hierarchy in your code by skimming the left-hand
```{r}
# missing extra two spaces
pull_unique <- function(df, var) {
df |>
distinct({{ var }}) |>
pull({{ var }})
density <- function(colour, facets, binwidth = 0.1) {
diamonds |>
ggplot(aes(carat, after_stat(density), colour = {{ colour }})) +
geom_freqpoly(binwidth = binwidth) +
facet_wrap(vars({{ facets }}))
}
# Pipe indented incorrectly
pull_unique <- function(df, var) {
df |>
distinct({{ var }}) |>
pull({{ var }})
density <- function(colour, facets, binwidth = 0.1) {
diamonds |>
ggplot(aes(carat, after_stat(density), colour = {{ colour }})) +
geom_freqpoly(binwidth = binwidth) +
facet_wrap(vars({{ facets }}))
}
# Missing {} and all one line
pull_unique <- function(df, var) df |> distinct({{ var }}) |> pull({{ var }})
```
As you can see we recommend putting extra spaces inside of `{{ }}`.