Bashing functions into shape
This commit is contained in:
parent
f0dfed0163
commit
3e88bddda3
179
functions.qmd
179
functions.qmd
|
@ -30,7 +30,10 @@ The chapter concludes with some also gives you some suggestions for how to style
|
||||||
|
|
||||||
### Prerequisites
|
### Prerequisites
|
||||||
|
|
||||||
|
We'll wrap up a variety of functions from around the tidyverse.
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
|
#| message: false
|
||||||
library(tidyverse)
|
library(tidyverse)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
@ -292,106 +295,162 @@ n_missing <- function(x) {
|
||||||
|
|
||||||
## Data frame functions
|
## Data frame functions
|
||||||
|
|
||||||
Tidy evaluation is hard to notice because it's the air that you breathe in this book.
|
|
||||||
Writing funtions with it is hard, because you have to explicitly think about things that you haven't had to before.
|
|
||||||
Things that the tidyverse has been designed to help you avoid thinking about so that you can focus on your analysis.
|
|
||||||
|
|
||||||
### Introduction to tidy evaluation
|
|
||||||
|
|
||||||
The second common form of function takes a data frame as the first argument, some extra arguments that say what to do with it, and returns a data frame.
|
The second common form of function takes a data frame as the first argument, some extra arguments that say what to do with it, and returns a data frame.
|
||||||
|
There are lots of functions of this nature, but we'll focus on wrapping tidyverse functions, principally those from dplyr and tidyr.
|
||||||
|
|
||||||
|
### Tidy evaluation
|
||||||
|
|
||||||
|
Let's illustrate the problem with a very simple function: `pull_unique()`.
|
||||||
|
The goal of this function is to `pull()` the unique (distinct) values of a variable:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
mutate_y <- function(data) {
|
pull_unique <- function(df, var) {
|
||||||
mutate(data, y = a + x)
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
These sorts of functions often wrap up other tidyverse functions, and so inevitably encounter the challenge of what's called tidy evaluation.
|
|
||||||
Let's illustrate the problem with a function so simple that you'd never both writing it yourself:
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
my_select <- function(df, var) {
|
|
||||||
df |>
|
df |>
|
||||||
select(var)
|
distinct(var) |>
|
||||||
|
pull(var)
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
What's going to happen if I run the following code?
|
If we try and use it, we get an error:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
df <- tibble(var = 1, rav = 2)
|
#| error: true
|
||||||
df |> my_select(rav)
|
diamonds |> pull_unique(clarity)
|
||||||
```
|
```
|
||||||
|
|
||||||
The problem is one of ambiguity.
|
To make the problem a bit more clear we can use a made up data frame:
|
||||||
Inside the function, should `var` refer directly to the literal variable called `var` inside the data frame you've passed in, or should it refer to the code you've supplied in the `var` argument.
|
|
||||||
dplyr prefers directs of indirect so we get an undesirably response.
|
|
||||||
To resolve this problem, we need a tool: `{{ }}`, called embracing:
|
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
my_select <- function(df, var) {
|
df <- tibble(var = "var", x = "x", y = "y")
|
||||||
|
df |> pull_unique(x)
|
||||||
|
df |> pull_unique(y)
|
||||||
|
```
|
||||||
|
|
||||||
|
The problem is that regardless of the inputs, our function is always doing literally `df |> distinct(var) |> pull(var)`, instead of `df |> distinct(x) |> pull(x)` or `df |> distinct(y) |> pull(y)`.
|
||||||
|
This is a problem of indirection, and it arises because dplyr allows you to refer to the names of variables inside your data frame without any special treatment, so called **tidy evaluation**.
|
||||||
|
|
||||||
|
Tidy evaluation is great 95% of the time because it makes our data analyses very concise as we never have to say which data frame a variable comes from; it's obvious from the context.
|
||||||
|
The downside of tidy evaluation comes when we want to wrap up repeated tidyverse code into a function: we need some way tell `distinct()` and `pull()` not to treat `var` as the name of a variable, but instead look inside `var` for the variable we actually want to use.
|
||||||
|
|
||||||
|
The solution to this problem is **embracing**.
|
||||||
|
By wrapping a variable in `{{ }}` (embracing it) dplyr knows that we want to use the value stored inside that variable.
|
||||||
|
One way to remember what's happening is to think of `{{ }}` like looking down a tunnel --- it's going to make the function look inside of `var` rather than looking for a variable called `var`.
|
||||||
|
|
||||||
|
To make `pull_unique()` work we just need to replace `var` with `{{ var }}`:
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
pull_unique <- function(df, var) {
|
||||||
df |>
|
df |>
|
||||||
select({{ var }})
|
distinct({{ var }}) |>
|
||||||
|
pull({{ var }})
|
||||||
}
|
}
|
||||||
df |> my_select(rav)
|
diamonds |> pull_unique(clarity)
|
||||||
```
|
```
|
||||||
|
|
||||||
This tells dplyr you want to select not `var` directly, but use the contents of `var` that the user has provided.
|
### When to embrace?
|
||||||
One way to remember what's happening is to think of `{{ }}` like looking down a tunnel --- it's going to look inside of `var`.
|
|
||||||
|
|
||||||
There's much more to learn about tidy evaluation , but this should be enough to get you started writing functions.
|
So the art of wrapping tidyverse functions basically figuring out which arguments need to be embraced.
|
||||||
|
Fortunately this is pretty easy because you can look it up from the documentation 😄.
|
||||||
|
There are two terms to look for in the docs:
|
||||||
|
|
||||||
### Which arguments need embracing?
|
- **Data-masking**: this is used in functions like `arrange()`, `filter()`, and `summarise()` which do computation with variables.
|
||||||
|
- **Tidy-selections**: this is used for for functions like `select()`, `relocate()`, and `rename()` that work with groups of variables.
|
||||||
|
|
||||||
Not ever argument needs to be embraced --- only those arguments that are evaluated in the context of the data.
|
TODO: something about ...
|
||||||
These fail into two main groups:
|
|
||||||
|
|
||||||
- Arguments that select variables, like `select()`, `relocate()`, and `rename()`.
|
Your intuition for many common functions should be pretty good --- think about whether it's ok to compute `x + 1` or select multiple variables with `a:x`.
|
||||||
The technical name for these arguments is "tidy-select" arguments, and if you look at the documentation you'll see these arguments thus labelled.
|
There are are some that are harder to tell because you usually use them with a single variable, so it's hard to tell whether they're data-masking or tidy-select:
|
||||||
|
|
||||||
- Arguments that compute with variables: `arrange()`, `filter()`, and `summarise()`.
|
|
||||||
The technical name for these argument is "data-masking"
|
|
||||||
|
|
||||||
It's usually easier to tell which is which, but there are some that are harder because you usually supply just a single variable name.
|
|
||||||
|
|
||||||
- All the arguments to `aes()` is are computing arguments because you can write `aes(x * 2, y / 10)` etc
|
|
||||||
- The arguments to `group_by()`, `count()`, and `distinct()` are computing arguments because they can all create new variables.
|
- The arguments to `group_by()`, `count()`, and `distinct()` are computing arguments because they can all create new variables.
|
||||||
- The `names_from` arguments to `pivot_wider()` is a selecting function because you can take the names from multiple variables with `names_from = c(x, y, z)`.
|
- The `names_from` arguments to `pivot_wider()` is a selecting function because you can take the names from multiple variables with `names_from = c(x, y, z)`.
|
||||||
|
- It's not a data frame function, but ggplot2's `aes()` uses data-masking because `aes(x * 2, y / 10)` etc.
|
||||||
|
|
||||||
### Selection arguments
|
In the next two sections we'll explore the sorts of handy functions you might write for data-masking and tidy-select arguments
|
||||||
|
|
||||||
In @sec-across you'll learn more about `across()` which is a really powerful selecting function that you can use inside of computing arguments.
|
### Data-masking examples
|
||||||
|
|
||||||
### Computing arguments
|
If you commonly perform the same set of summaries when doing initial data exploration, you might consider wrapping them up in a helper function:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
my_summarise2 <- function(data, expr) {
|
summary6 <- function(data, var) {
|
||||||
data %>% summarise(
|
data %>% summarise(
|
||||||
mean = mean({{ expr }}),
|
min = min({{ var }}, na.rm = TRUE),
|
||||||
sum = sum({{ expr }}),
|
mean = mean({{ var }}, na.rm = TRUE),
|
||||||
n = n()
|
median = median({{ var }}, na.rm = TRUE),
|
||||||
|
max = max({{ var }}, na.rm = TRUE),
|
||||||
|
n = n(),
|
||||||
|
n_miss = sum(is.na({{ var }}))
|
||||||
)
|
)
|
||||||
}
|
}
|
||||||
|
diamonds |> summary6(carat)
|
||||||
```
|
```
|
||||||
|
|
||||||
A common use case is to modify `count()`, for example to compute percents:
|
The nice thing about this function is because it wraps summary you can used it on grouped data:
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
diamonds |>
|
||||||
|
group_by(cut) |>
|
||||||
|
summary6(carat)
|
||||||
|
```
|
||||||
|
|
||||||
|
Because the arguments to summarize are data-masking that also means that the `var` argument to `summary6()` is data-masking.
|
||||||
|
That means you can also summarize computed variables:
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
diamonds |>
|
||||||
|
group_by(cut) |>
|
||||||
|
summary6(log10(carat))
|
||||||
|
```
|
||||||
|
|
||||||
|
To summarize multiple you'll need wait until @sec-across, where you'll learn about `across()` which lets you repeat the same computations with multiple variables.
|
||||||
|
|
||||||
|
Another common helper function is to write a version of `count()` that also computes proportions:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
# https://twitter.com/Diabb6/status/1571635146658402309
|
# https://twitter.com/Diabb6/status/1571635146658402309
|
||||||
count_pct <- function(df, var) {
|
count_prop <- function(df, var, sort = FALSE) {
|
||||||
df |>
|
df |>
|
||||||
count({{ var }}, sort = TRUE) |>
|
count({{ var }}, sort = sort) |>
|
||||||
mutate(pct = n / sum(n))
|
mutate(prop = n / sum(n))
|
||||||
}
|
}
|
||||||
|
|
||||||
mtcars |> count_pct(cyl)
|
diamonds |> count_prop(clarity)
|
||||||
```
|
```
|
||||||
|
|
||||||
Or to pivot the output:
|
Note that this function has three arguments: `df`, `var`, and `sort`, and only `var` needs to be embraced because it's passed to `count()` which uses data-masking for all variables in `…`.
|
||||||
|
|
||||||
|
Or maybe you want to find the unique values for a variable for a subset of the data:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
#| eval: false
|
unique_where <- function(df, condition, var) {
|
||||||
|
df |>
|
||||||
|
filter({{ condition }}) |>
|
||||||
|
distinct({{ var }}) |>
|
||||||
|
arrange({{ var }}) |>
|
||||||
|
pull()
|
||||||
|
}
|
||||||
|
nycflights13::flights |> unique_where(month == 12, dest)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Tidy-select arguments
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
#| include: false
|
||||||
|
pick <- function(cols) {
|
||||||
|
across({{ cols }})
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
# https://twitter.com/drob/status/1571879373053259776
|
||||||
|
enrich_join <- function(x, y, y_vars = everything(), by = NULL) {
|
||||||
|
left_join(x, y |> select({{ y_vars }}), by = by)
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Another useful helper is to make a "wide" count, where you make a 2d table of counts.
|
||||||
|
|
||||||
|
```{r}
|
||||||
# Inspired by https://twitter.com/pollicipes/status/1571606508944719876
|
# Inspired by https://twitter.com/pollicipes/status/1571606508944719876
|
||||||
count_wide <- function(data, rows, cols) {
|
count_wide <- function(data, rows, cols) {
|
||||||
data |>
|
data |>
|
||||||
|
@ -404,15 +463,9 @@ mtcars |> count_wide(c(vs, am), cyl)
|
||||||
|
|
||||||
This requires use `pick()` to use tidy-select inside a data-masking (`count()`) function.
|
This requires use `pick()` to use tidy-select inside a data-masking (`count()`) function.
|
||||||
|
|
||||||
```{r}
|
### Learning more
|
||||||
# https://twitter.com/JustinTPriest/status/1571614088329048064
|
|
||||||
# https://twitter.com/FBpsy/status/1571909992139362304
|
|
||||||
# https://twitter.com/ekholm_e/status/1571900197894078465
|
|
||||||
|
|
||||||
enrich_join <- function(x, y, ..., by = NULL) {
|
Once you have the basics under your belt, you can learn more about the full range of tidy evaluation possibilities by reading `vignette("programming", package = "dplyr")`.
|
||||||
left_join(x, y %>% select(...), by = by)
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
## Style
|
## Style
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue