Bashing functions into shape

This commit is contained in:
Hadley Wickham 2022-09-20 17:03:26 -05:00
parent f0dfed0163
commit 3e88bddda3
1 changed files with 116 additions and 63 deletions

View File

@ -30,7 +30,10 @@ The chapter concludes with some also gives you some suggestions for how to style
### Prerequisites
We'll wrap up a variety of functions from around the tidyverse.
```{r}
#| message: false
library(tidyverse)
```
@ -292,106 +295,162 @@ n_missing <- function(x) {
## Data frame functions
Tidy evaluation is hard to notice because it's the air that you breathe in this book.
Writing funtions with it is hard, because you have to explicitly think about things that you haven't had to before.
Things that the tidyverse has been designed to help you avoid thinking about so that you can focus on your analysis.
### Introduction to tidy evaluation
The second common form of function takes a data frame as the first argument, some extra arguments that say what to do with it, and returns a data frame.
There are lots of functions of this nature, but we'll focus on wrapping tidyverse functions, principally those from dplyr and tidyr.
### Tidy evaluation
Let's illustrate the problem with a very simple function: `pull_unique()`.
The goal of this function is to `pull()` the unique (distinct) values of a variable:
```{r}
mutate_y <- function(data) {
mutate(data, y = a + x)
}
```
These sorts of functions often wrap up other tidyverse functions, and so inevitably encounter the challenge of what's called tidy evaluation.
Let's illustrate the problem with a function so simple that you'd never both writing it yourself:
```{r}
my_select <- function(df, var) {
pull_unique <- function(df, var) {
df |>
select(var)
distinct(var) |>
pull(var)
}
```
What's going to happen if I run the following code?
If we try and use it, we get an error:
```{r}
df <- tibble(var = 1, rav = 2)
df |> my_select(rav)
#| error: true
diamonds |> pull_unique(clarity)
```
The problem is one of ambiguity.
Inside the function, should `var` refer directly to the literal variable called `var` inside the data frame you've passed in, or should it refer to the code you've supplied in the `var` argument.
dplyr prefers directs of indirect so we get an undesirably response.
To resolve this problem, we need a tool: `{{ }}`, called embracing:
To make the problem a bit more clear we can use a made up data frame:
```{r}
my_select <- function(df, var) {
df <- tibble(var = "var", x = "x", y = "y")
df |> pull_unique(x)
df |> pull_unique(y)
```
The problem is that regardless of the inputs, our function is always doing literally `df |> distinct(var) |> pull(var)`, instead of `df |> distinct(x) |> pull(x)` or `df |> distinct(y) |> pull(y)`.
This is a problem of indirection, and it arises because dplyr allows you to refer to the names of variables inside your data frame without any special treatment, so called **tidy evaluation**.
Tidy evaluation is great 95% of the time because it makes our data analyses very concise as we never have to say which data frame a variable comes from; it's obvious from the context.
The downside of tidy evaluation comes when we want to wrap up repeated tidyverse code into a function: we need some way tell `distinct()` and `pull()` not to treat `var` as the name of a variable, but instead look inside `var` for the variable we actually want to use.
The solution to this problem is **embracing**.
By wrapping a variable in `{{ }}` (embracing it) dplyr knows that we want to use the value stored inside that variable.
One way to remember what's happening is to think of `{{ }}` like looking down a tunnel --- it's going to make the function look inside of `var` rather than looking for a variable called `var`.
To make `pull_unique()` work we just need to replace `var` with `{{ var }}`:
```{r}
pull_unique <- function(df, var) {
df |>
select({{ var }})
distinct({{ var }}) |>
pull({{ var }})
}
df |> my_select(rav)
diamonds |> pull_unique(clarity)
```
This tells dplyr you want to select not `var` directly, but use the contents of `var` that the user has provided.
One way to remember what's happening is to think of `{{ }}` like looking down a tunnel --- it's going to look inside of `var`.
### When to embrace?
There's much more to learn about tidy evaluation , but this should be enough to get you started writing functions.
So the art of wrapping tidyverse functions basically figuring out which arguments need to be embraced.
Fortunately this is pretty easy because you can look it up from the documentation 😄.
There are two terms to look for in the docs:
### Which arguments need embracing?
- **Data-masking**: this is used in functions like `arrange()`, `filter()`, and `summarise()` which do computation with variables.
- **Tidy-selections**: this is used for for functions like `select()`, `relocate()`, and `rename()` that work with groups of variables.
Not ever argument needs to be embraced --- only those arguments that are evaluated in the context of the data.
These fail into two main groups:
TODO: something about ...
- Arguments that select variables, like `select()`, `relocate()`, and `rename()`.
The technical name for these arguments is "tidy-select" arguments, and if you look at the documentation you'll see these arguments thus labelled.
Your intuition for many common functions should be pretty good --- think about whether it's ok to compute `x + 1` or select multiple variables with `a:x`.
There are are some that are harder to tell because you usually use them with a single variable, so it's hard to tell whether they're data-masking or tidy-select:
- Arguments that compute with variables: `arrange()`, `filter()`, and `summarise()`.
The technical name for these argument is "data-masking"
It's usually easier to tell which is which, but there are some that are harder because you usually supply just a single variable name.
- All the arguments to `aes()` is are computing arguments because you can write `aes(x * 2, y / 10)` etc
- The arguments to `group_by()`, `count()`, and `distinct()` are computing arguments because they can all create new variables.
- The `names_from` arguments to `pivot_wider()` is a selecting function because you can take the names from multiple variables with `names_from = c(x, y, z)`.
- It's not a data frame function, but ggplot2's `aes()` uses data-masking because `aes(x * 2, y / 10)` etc.
### Selection arguments
In the next two sections we'll explore the sorts of handy functions you might write for data-masking and tidy-select arguments
In @sec-across you'll learn more about `across()` which is a really powerful selecting function that you can use inside of computing arguments.
### Data-masking examples
### Computing arguments
If you commonly perform the same set of summaries when doing initial data exploration, you might consider wrapping them up in a helper function:
```{r}
my_summarise2 <- function(data, expr) {
summary6 <- function(data, var) {
data %>% summarise(
mean = mean({{ expr }}),
sum = sum({{ expr }}),
n = n()
min = min({{ var }}, na.rm = TRUE),
mean = mean({{ var }}, na.rm = TRUE),
median = median({{ var }}, na.rm = TRUE),
max = max({{ var }}, na.rm = TRUE),
n = n(),
n_miss = sum(is.na({{ var }}))
)
}
diamonds |> summary6(carat)
```
A common use case is to modify `count()`, for example to compute percents:
The nice thing about this function is because it wraps summary you can used it on grouped data:
```{r}
diamonds |>
group_by(cut) |>
summary6(carat)
```
Because the arguments to summarize are data-masking that also means that the `var` argument to `summary6()` is data-masking.
That means you can also summarize computed variables:
```{r}
diamonds |>
group_by(cut) |>
summary6(log10(carat))
```
To summarize multiple you'll need wait until @sec-across, where you'll learn about `across()` which lets you repeat the same computations with multiple variables.
Another common helper function is to write a version of `count()` that also computes proportions:
```{r}
# https://twitter.com/Diabb6/status/1571635146658402309
count_pct <- function(df, var) {
count_prop <- function(df, var, sort = FALSE) {
df |>
count({{ var }}, sort = TRUE) |>
mutate(pct = n / sum(n))
count({{ var }}, sort = sort) |>
mutate(prop = n / sum(n))
}
mtcars |> count_pct(cyl)
diamonds |> count_prop(clarity)
```
Or to pivot the output:
Note that this function has three arguments: `df`, `var`, and `sort`, and only `var` needs to be embraced because it's passed to `count()` which uses data-masking for all variables in `…`.
Or maybe you want to find the unique values for a variable for a subset of the data:
```{r}
#| eval: false
unique_where <- function(df, condition, var) {
df |>
filter({{ condition }}) |>
distinct({{ var }}) |>
arrange({{ var }}) |>
pull()
}
nycflights13::flights |> unique_where(month == 12, dest)
```
### Tidy-select arguments
```{r}
#| include: false
pick <- function(cols) {
across({{ cols }})
}
```
```{r}
# https://twitter.com/drob/status/1571879373053259776
enrich_join <- function(x, y, y_vars = everything(), by = NULL) {
left_join(x, y |> select({{ y_vars }}), by = by)
}
```
Another useful helper is to make a "wide" count, where you make a 2d table of counts.
```{r}
# Inspired by https://twitter.com/pollicipes/status/1571606508944719876
count_wide <- function(data, rows, cols) {
data |>
@ -404,15 +463,9 @@ mtcars |> count_wide(c(vs, am), cyl)
This requires use `pick()` to use tidy-select inside a data-masking (`count()`) function.
```{r}
# https://twitter.com/JustinTPriest/status/1571614088329048064
# https://twitter.com/FBpsy/status/1571909992139362304
# https://twitter.com/ekholm_e/status/1571900197894078465
### Learning more
enrich_join <- function(x, y, ..., by = NULL) {
left_join(x, y %>% select(...), by = by)
}
```
Once you have the basics under your belt, you can learn more about the full range of tidy evaluation possibilities by reading `vignette("programming", package = "dplyr")`.
## Style