966 lines
33 KiB
Plaintext
966 lines
33 KiB
Plaintext
# Functions {#sec-functions}
|
||
|
||
```{r}
|
||
#| results: "asis"
|
||
#| echo: false
|
||
source("_common.R")
|
||
status("polishing")
|
||
```
|
||
|
||
## Introduction
|
||
|
||
One of the best ways to improve your reach as a data scientist is to write functions.
|
||
Functions allow you to automate common tasks in a more powerful and general way than copy-and-pasting.
|
||
Writing a function has three big advantages over using copy-and-paste:
|
||
|
||
1. You can give a function an evocative name that makes your code easier to understand.
|
||
|
||
2. As requirements change, you only need to update code in one place, instead of many.
|
||
|
||
3. You eliminate the chance of making incidental mistakes when you copy and paste (i.e. updating a variable name in one place, but not in another).
|
||
|
||
A good rule of thumb is to consider writing a function whenever you've copied and pasted a block of code more than twice (i.e. you now have three copies of the same code).
|
||
In this chapter, you'll learn about three useful types of functions:
|
||
|
||
- Vector functions take one or more vectors as input and return a vector as output.
|
||
- Data frame functions take a data frame as input and return a data frame as output.
|
||
- Plot functions that take a data frame as input and return a plot as output.
|
||
|
||
Each of these sections include many examples to help you generalize the patterns that you see.
|
||
These examples wouldn't be possible without the help of folks of twitter, and we encourage follow the links in the comment to see original inspirations.
|
||
You might also want to read the original motivating tweets for [general functions](https://twitter.com/hadleywickham/status/1571603361350164486) and [plotting functions](https://twitter.com/hadleywickham/status/1574373127349575680) to see even more functions.
|
||
|
||
### Prerequisites
|
||
|
||
We'll wrap up a variety of functions from around the tidyverse.
|
||
We'll also use nycflights13 as a source of familiar data to use our functions with.
|
||
|
||
```{r}
|
||
#| message: false
|
||
library(tidyverse)
|
||
library(nycflights13)
|
||
```
|
||
|
||
## Vector functions
|
||
|
||
We'll begin with vector functions: functions that take one or more vectors and return a vector result.
|
||
For example, take a look at this code.
|
||
What does it do?
|
||
|
||
```{r}
|
||
df <- tibble(
|
||
a = rnorm(5),
|
||
b = rnorm(5),
|
||
c = rnorm(5),
|
||
d = rnorm(5),
|
||
)
|
||
|
||
df |> mutate(
|
||
a = (a - min(a, na.rm = TRUE)) /
|
||
(max(a, na.rm = TRUE) - min(a, na.rm = TRUE)),
|
||
b = (b - min(b, na.rm = TRUE)) /
|
||
(max(b, na.rm = TRUE) - min(a, na.rm = TRUE)),
|
||
c = (c - min(c, na.rm = TRUE)) /
|
||
(max(c, na.rm = TRUE) - min(c, na.rm = TRUE)),
|
||
d = (d - min(d, na.rm = TRUE)) /
|
||
(max(d, na.rm = TRUE) - min(d, na.rm = TRUE)),
|
||
)
|
||
```
|
||
|
||
You might be able to puzzle out that this rescales each column to have a range from 0 to 1.
|
||
But did you spot the mistake?
|
||
When Hadley wrote this code he made an error when copying-and-pasting and forgot to change an `a` to a `b`.
|
||
Preventing this type of mistake of is one very good reason to learn how to write functions.
|
||
|
||
### Writing a function
|
||
|
||
To write a function you need to first analyse your repeated code to figure what parts are constant and what parts vary.
|
||
If we take the code above and pull it outside of `mutate()`, it's a little easier to see the pattern because each repetition is now one line:
|
||
|
||
```{r}
|
||
#| eval: false
|
||
|
||
(a - min(a, na.rm = TRUE)) / (max(a, na.rm = TRUE) - min(a, na.rm = TRUE))
|
||
(b - min(b, na.rm = TRUE)) / (max(b, na.rm = TRUE) - min(b, na.rm = TRUE))
|
||
(c - min(c, na.rm = TRUE)) / (max(c, na.rm = TRUE) - min(c, na.rm = TRUE))
|
||
(d - min(d, na.rm = TRUE)) / (max(d, na.rm = TRUE) - min(d, na.rm = TRUE))
|
||
```
|
||
|
||
To make this a bit clearer we can replace the bit that varies with `█`:
|
||
|
||
```{r}
|
||
#| eval: false
|
||
(█ - min(█, na.rm = TRUE)) / (max(█, na.rm = TRUE) - min(█, na.rm = TRUE))
|
||
```
|
||
|
||
To turn this into a function you need three things:
|
||
|
||
1. A **name**.
|
||
Here we'll use `rescale01` because this function rescales a vector to lie between 0 and 1.
|
||
|
||
2. The **arguments**.
|
||
The arguments are things that vary across calls and our analysis above tells us that we have just one.
|
||
We'll call it `x` because this is the conventional name for a numeric vector.
|
||
|
||
3. The **body**.
|
||
The body is the code that's repeated across all the calls.
|
||
|
||
Then you create a function by following the template:
|
||
|
||
```{r}
|
||
name <- function(arguments) {
|
||
body
|
||
}
|
||
```
|
||
|
||
For this case that leads to:
|
||
|
||
```{r}
|
||
rescale01 <- function(x) {
|
||
(x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
|
||
}
|
||
```
|
||
|
||
At this point you might test with a few simple inputs to make sure you've captured the logic correctly:
|
||
|
||
```{r}
|
||
rescale01(c(-10, 0, 10))
|
||
rescale01(c(1, 2, 3, NA, 5))
|
||
```
|
||
|
||
Then you can rewrite the call to `mutate()` as:
|
||
|
||
```{r}
|
||
df |> mutate(
|
||
a = rescale01(a),
|
||
b = rescale01(b),
|
||
c = rescale01(c),
|
||
d = rescale01(d),
|
||
)
|
||
```
|
||
|
||
(In @sec-iteration, you'll learn how to use `across()` to reduce the duplication even further so all you need is `df |> mutate(across(a:d, rescale01))`).
|
||
|
||
### Improving our function
|
||
|
||
You might notice that the `rescale01()` function does some unnecessary work --- instead of computing `min()` twice and `max()` once we could instead compute both the minimum and maximum in one step with `range()`:
|
||
|
||
```{r}
|
||
rescale01 <- function(x) {
|
||
rng <- range(x, na.rm = TRUE)
|
||
(x - rng[1]) / (rng[2] - rng[1])
|
||
}
|
||
```
|
||
|
||
Or you might try this function on a vector that includes an infinite value:
|
||
|
||
```{r}
|
||
x <- c(1:10, Inf)
|
||
rescale01(x)
|
||
```
|
||
|
||
That result is not particularly useful so we could ask `range()` to ignore infinite values:
|
||
|
||
```{r}
|
||
rescale01 <- function(x) {
|
||
rng <- range(x, na.rm = TRUE, finite = TRUE)
|
||
(x - rng[1]) / (rng[2] - rng[1])
|
||
}
|
||
|
||
rescale01(x)
|
||
```
|
||
|
||
These changes illustrate an important benefit of functions: because we've moved the repeated code into a function, we only need to make the change in one place.
|
||
|
||
### Mutate functions
|
||
|
||
Now you've got the basic idea of functions, let's take a look at a whole bunch of examples.
|
||
We'll start by looking at "mutate" functions, i.e. functions that work well inside of `mutate()` and `filter()` because they return an output of the same length as the input.
|
||
|
||
Let's start with a simple variation of `rescale01()`.
|
||
Maybe you want to compute the Z-score, rescaling a vector to have a mean of zero and a standard deviation of one:
|
||
|
||
```{r}
|
||
z_score <- function(x) {
|
||
(x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
|
||
}
|
||
```
|
||
|
||
Or maybe you want to wrap up a straightforward `case_when()` and give it a useful name.
|
||
For example, this `clamp()` function ensures all values of a vector lie in between a minimum or a maximum:
|
||
|
||
```{r}
|
||
clamp <- function(x, min, max) {
|
||
case_when(
|
||
x < min ~ min,
|
||
x > max ~ max,
|
||
.default = x
|
||
)
|
||
}
|
||
|
||
clamp(1:10, min = 3, max = 7)
|
||
```
|
||
|
||
Or maybe you'd rather mark those values as `NA`s:
|
||
|
||
```{r}
|
||
na_outside <- function(x, min, max) {
|
||
case_when(
|
||
x < min ~ NA,
|
||
x > max ~ NA,
|
||
.default = x
|
||
)
|
||
}
|
||
|
||
na_outside(1:10, min = 3, max = 7)
|
||
```
|
||
|
||
Of course functions don't just need to work with numeric variables.
|
||
You might want to do some repeated string manipulation.
|
||
Maybe you need to make the first character upper case:
|
||
|
||
```{r}
|
||
first_upper <- function(x) {
|
||
str_sub(x, 1, 1) <- str_to_upper(str_sub(x, 1, 1))
|
||
x
|
||
}
|
||
|
||
first_upper("hello")
|
||
```
|
||
|
||
Or maybe you want to strip percent signs, commas, and dollar signs from a string before converting it into a number:
|
||
|
||
```{r}
|
||
# https://twitter.com/NVlabormarket/status/1571939851922198530
|
||
clean_number <- function(x) {
|
||
is_pct <- str_detect(x, "%")
|
||
num <- x |>
|
||
str_remove_all("%") |>
|
||
str_remove_all(",") |>
|
||
str_remove_all(fixed("$")) |>
|
||
as.numeric(x)
|
||
if_else(is_pct, num / 100, num)
|
||
}
|
||
|
||
clean_number("$12,300")
|
||
clean_number("45%")
|
||
```
|
||
|
||
Sometimes your functions will be highly specialized for one data analysis step.
|
||
For example, if you have a bunch of variables that record missing values as 997, 998, or 999, you might want to write a function to replace them with `NA`:
|
||
|
||
```{r}
|
||
fix_na <- function(x) {
|
||
if_else(x %in% c(997, 998, 999), NA, x)
|
||
}
|
||
```
|
||
|
||
We've focused on examples that take a single vector because we think they're the most common.
|
||
But there's no reason that your function can't take multiple vector inputs.
|
||
For example, you might want to compute the distance between two locations on the globe using the haversine formula.
|
||
This requires four vectors:
|
||
|
||
```{r}
|
||
# https://twitter.com/RosanaFerrero/status/1574722120428539906/photo/1
|
||
haversine <- function(long1, lat1, long2, lat2, round = 3) {
|
||
# convert to radians
|
||
long1 <- long1 * pi / 180
|
||
lat1 <- lat1 * pi / 180
|
||
long2 <- long2 * pi / 180
|
||
lat2 <- lat2 * pi / 180
|
||
|
||
R <- 6371 # Earth mean radius in km
|
||
a <- sin((lat2 - lat1) / 2)^2 +
|
||
cos(lat1) * cos(lat2) * sin((long2 - long1) / 2)^2
|
||
d <- R * 2 * asin(sqrt(a))
|
||
|
||
round(d, round)
|
||
}
|
||
```
|
||
|
||
### Summary functions
|
||
|
||
Another important family of vector functions is summary functions, functions that return a single value for use in `summarize()`.
|
||
Sometimes this can just be a matter of setting a default argument or two:
|
||
|
||
```{r}
|
||
commas <- function(x) {
|
||
str_flatten(x, collapse = ", ", last = " and ")
|
||
}
|
||
|
||
commas(c("cat", "dog", "pigeon"))
|
||
```
|
||
|
||
Or you might wrap up a simple computation, like for the coefficient of variation, which divides the standard deviation by the mean:
|
||
|
||
```{r}
|
||
cv <- function(x, na.rm = FALSE) {
|
||
sd(x, na.rm = na.rm) / mean(x, na.rm = na.rm)
|
||
}
|
||
|
||
cv(runif(100, min = 0, max = 50))
|
||
cv(runif(100, min = 0, max = 500))
|
||
```
|
||
|
||
Or maybe you just want to make a common pattern easier to remember by giving it a memorable name:
|
||
|
||
```{r}
|
||
# https://twitter.com/gbganalyst/status/1571619641390252033
|
||
n_missing <- function(x) {
|
||
sum(is.na(x))
|
||
}
|
||
```
|
||
|
||
You can also write functions with multiple vector inputs.
|
||
For example, maybe you want to compute the mean absolute prediction error to help you compare model predictions with actual values:
|
||
|
||
```{r}
|
||
# https://twitter.com/neilgcurrie/status/1571607727255834625
|
||
mape <- function(actual, predicted) {
|
||
sum(abs((actual - predicted) / actual)) / length(actual)
|
||
}
|
||
```
|
||
|
||
::: callout-note
|
||
## RStudio
|
||
|
||
Once you start writing functions, there are two RStudio shortcuts that are super useful:
|
||
|
||
- To find the definition of a function that you've written, place the cursor on the name of the function and press `F2`.
|
||
|
||
- To quickly jump to a function, press `Ctrl + .` to open the fuzzy file and function finder and type the first few letters of your function name.
|
||
You can also navigate to files, Quarto sections, and more, making it a very handy navigation tool.
|
||
:::
|
||
|
||
### Exercises
|
||
|
||
1. Practice turning the following code snippets into functions.
|
||
Think about what each function does.
|
||
What would you call it?
|
||
How many arguments does it need?
|
||
|
||
```{r}
|
||
#| eval: false
|
||
|
||
mean(is.na(x))
|
||
mean(is.na(y))
|
||
mean(is.na(z))
|
||
|
||
x / sum(x, na.rm = TRUE)
|
||
y / sum(y, na.rm = TRUE)
|
||
z / sum(z, na.rm = TRUE)
|
||
|
||
round(x / sum(x, na.rm = TRUE) * 100, 1)
|
||
round(y / sum(y, na.rm = TRUE) * 100, 1)
|
||
round(z / sum(z, na.rm = TRUE) * 100, 1)
|
||
```
|
||
|
||
2. In the second variant of `rescale01()`, infinite values are left unchanged.
|
||
Can you rewrite `rescale01()` so that `-Inf` is mapped to 0, and `Inf` is mapped to 1?
|
||
|
||
3. Given a vector of birthdates, write a function to compute the age in years.
|
||
|
||
4. Write your own functions to compute the variance and skewness of a numeric vector.
|
||
Variance is defined as $$
|
||
\mathrm{Var}(x) = \frac{1}{n - 1} \sum_{i=1}^n (x_i - \bar{x}) ^2 \text{,}
|
||
$$ where $\bar{x} = (\sum_i^n x_i) / n$ is the sample mean.
|
||
Skewness is defined as $$
|
||
\mathrm{Skew}(x) = \frac{\frac{1}{n-2}\left(\sum_{i=1}^n(x_i - \bar x)^3\right)}{\mathrm{Var}(x)^{3/2}} \text{.}
|
||
$$
|
||
|
||
5. Write `both_na()`, a summary function that takes two vectors of the same length and returns the number of positions that have an `NA` in both vectors.
|
||
|
||
6. Read the documentation to figure out what the following functions do.
|
||
Why are they useful even though they are so short?
|
||
|
||
```{r}
|
||
is_directory <- function(x) file.info(x)$isdir
|
||
is_readable <- function(x) file.access(x, 4) == 0
|
||
```
|
||
|
||
## Data frame functions
|
||
|
||
Vector functions are useful for pulling out code that's repeated within a dplyr verb.
|
||
But you'll often also repeat the verbs themselves, particularly within a large pipeline.
|
||
When you notice yourself copying and pasting multiple verbs multiple times, you might think about writing a data frame function.
|
||
Data frame functions work like dplyr verbs: they take a data frame as the first argument, some extra arguments that say what to do with it, and return a data frame or vector.
|
||
|
||
To let you write a function that uses dplyr verbs, we'll first introduce you to the challenge of indirection and how you can overcome it with embracing, `{{ }}`.
|
||
With this theory under your belt, we'll then show you a bunch of examples to illustrate what you might do with it.
|
||
|
||
### Indirection and tidy evaluation
|
||
|
||
When you start writing functions that use dplyr verbs you rapidly hit the problem of indirection.
|
||
Let's illustrate the problem with a very simple function: `grouped_mean()`.
|
||
The goal of this function is compute the mean of `mean_var` grouped by `group_var`:
|
||
|
||
```{r}
|
||
grouped_mean <- function(df, group_var, mean_var) {
|
||
df |>
|
||
group_by(group_var) |>
|
||
summarize(mean(mean_var))
|
||
}
|
||
```
|
||
|
||
If we try and use it, we get an error:
|
||
|
||
```{r}
|
||
#| error: true
|
||
diamonds |> grouped_mean(cut, carat)
|
||
```
|
||
|
||
To make the problem a bit more clear, we can use a made up data frame:
|
||
|
||
```{r}
|
||
df <- tibble(
|
||
mean_var = 1,
|
||
group_var = "g",
|
||
group = 1,
|
||
x = 10,
|
||
y = 100
|
||
)
|
||
|
||
df |> grouped_mean(group, x)
|
||
df |> grouped_mean(group, y)
|
||
```
|
||
|
||
Regardless of how we call `grouped_mean()` it always does `df |> group_by(group_var) |> summarise(mean(mean_var))`, instead of `df |> group_by(group) |> summarise(mean(x))` or `df |> group_by(group) |> summarise(mean(y))`.
|
||
This is a problem of indirection, and it arises because dplyr uses **tidy evaluation** to allow you to refer to the names of variables inside your data frame without any special treatment.
|
||
|
||
Tidy evaluation is great 95% of the time because it makes your data analyses very concise as you never have to say which data frame a variable comes from; it's obvious from the context.
|
||
The downside of tidy evaluation comes when we want to wrap up repeated tidyverse code into a function.
|
||
Here we need some way to tell `group_mean()` and `summarize()` not to treat `group_var` and `mean_var` as the name of the variables, but instead look inside them for the variable we actually want to use.
|
||
|
||
Tidy evaluation includes a solution to this problem called **embracing** 🤗.
|
||
Embracing a variable means to wrap it in braces so (e.g.) `var` becomes `{{ var }}`.
|
||
Embracing a variable tells dplyr to use the value stored inside the argument, not the argument as the literal variable name.
|
||
One way to remember what's happening is to think of `{{ }}` as looking down a tunnel --- `{{ var }}` will make a dplyr function look inside of `var` rather than looking for a variable called `var`.
|
||
|
||
So to make grouped_mean`()` work, we need to surround `group_var` and `mean_var()` with `{{ }}`:
|
||
|
||
```{r}
|
||
grouped_mean <- function(df, group_var, mean_var) {
|
||
df |>
|
||
group_by({{ group_var }}) |>
|
||
summarize(mean({{ mean_var }}))
|
||
}
|
||
|
||
diamonds |> grouped_mean(cut, carat)
|
||
```
|
||
|
||
Success!
|
||
|
||
### When to embrace? {#sec-embracing}
|
||
|
||
So the key challenge in writing data frame functions is figuring out which arguments need to be embraced.
|
||
Fortunately, this is easy because you can look it up from the documentation 😄.
|
||
There are two terms to look for in the docs which correspond to the two most common sub-types of tidy evaluation:
|
||
|
||
- **Data-masking**: this is used in functions like `arrange()`, `filter()`, and `summarize()` that compute with variables.
|
||
|
||
- **Tidy-selection**: this is used for functions like `select()`, `relocate()`, and `rename()` that select variables.
|
||
|
||
Your intuition about which arguments use tidy evaluation should be good for many common functions --- just think about whether you can compute (e.g. `x + 1`) or select (e.g. `a:x`).
|
||
|
||
In the following sections, we'll explore the sorts of handy functions you might write once you understand embracing.
|
||
|
||
### Common use cases
|
||
|
||
If you commonly perform the same set of summaries when doing initial data exploration, you might consider wrapping them up in a helper function:
|
||
|
||
```{r}
|
||
summary6 <- function(data, var) {
|
||
data |> summarize(
|
||
min = min({{ var }}, na.rm = TRUE),
|
||
mean = mean({{ var }}, na.rm = TRUE),
|
||
median = median({{ var }}, na.rm = TRUE),
|
||
max = max({{ var }}, na.rm = TRUE),
|
||
n = n(),
|
||
n_miss = sum(is.na({{ var }})),
|
||
.groups = "drop"
|
||
)
|
||
}
|
||
|
||
diamonds |> summary6(carat)
|
||
```
|
||
|
||
(Whenever you wrap `summarize()` in a helper, we think it's good practice to set `.groups = "drop"` to both avoid the message and leave the data in an ungrouped state.)
|
||
|
||
The nice thing about this function is, because it wraps `summarize()`, you can use it on grouped data:
|
||
|
||
```{r}
|
||
diamonds |>
|
||
group_by(cut) |>
|
||
summary6(carat)
|
||
```
|
||
|
||
Furthermore, since the arguments to summarize are data-masking also means that the `var` argument to `summary6()` is data-masking.
|
||
That means you can also summarize computed variables:
|
||
|
||
```{r}
|
||
diamonds |>
|
||
group_by(cut) |>
|
||
summary6(log10(carat))
|
||
```
|
||
|
||
To summarize multiple variables, you'll need to wait until @sec-across, where you'll learn how to use `across()`.
|
||
|
||
Another popular `summarize()` helper function is a version of `count()` that also computes proportions:
|
||
|
||
```{r}
|
||
# https://twitter.com/Diabb6/status/1571635146658402309
|
||
count_prop <- function(df, var, sort = FALSE) {
|
||
df |>
|
||
count({{ var }}, sort = sort) |>
|
||
mutate(prop = n / sum(n))
|
||
}
|
||
|
||
diamonds |> count_prop(clarity)
|
||
```
|
||
|
||
This function has three arguments: `df`, `var`, and `sort`, and only `var` needs to be embraced because it's passed to `count()` which uses data-masking for all variables in `…`.
|
||
|
||
Or maybe you want to find the sorted unique values of a variable for a subset of the data.
|
||
Rather than supplying a variable and a value to do the filtering, we'll allow the user to supply a condition:
|
||
|
||
```{r}
|
||
unique_where <- function(df, condition, var) {
|
||
df |>
|
||
filter({{ condition }}) |>
|
||
distinct({{ var }}) |>
|
||
arrange({{ var }})
|
||
}
|
||
|
||
# Find all the destinations in December
|
||
flights |> unique_where(month == 12, dest)
|
||
# Which months did plane N14228 fly in?
|
||
flights |> unique_where(tailnum == "N14228", month)
|
||
```
|
||
|
||
Here we embrace `condition` because it's passed to `filter()` and `var` because it's passed to `distinct()` and `arrange()`.
|
||
|
||
We've made all these examples to take a data frame as the first argument, but if you're working repeatedly with the same data, it can make sense to hardcode it.
|
||
For example, the following function always works with the flights dataset and always selects `time_hour`, `carrier`, and `flight` since they form the compound primary key that allows you to identify a row.
|
||
|
||
```{r}
|
||
flights_sub <- function(rows, cols) {
|
||
flights |>
|
||
filter({{ rows }}) |>
|
||
select(time_hour, carrier, flight, {{ cols }})
|
||
}
|
||
|
||
flights_sub(dest == "IAH", contains("time"))
|
||
```
|
||
|
||
### Data-masking vs. tidy-selection
|
||
|
||
Sometimes you want to select variables inside a function that uses data-masking.
|
||
For example, imagine you want to write a `count_missing()` that counts the number of missing observations in rows.
|
||
You might try writing something like:
|
||
|
||
```{r}
|
||
#| error: true
|
||
count_missing <- function(df, group_vars, x_var) {
|
||
df |>
|
||
group_by({{ group_vars }}) |>
|
||
summarize(n_miss = sum(is.na({{ x_var }})))
|
||
}
|
||
|
||
flights |>
|
||
count_missing(c(year, month, day), dep_time)
|
||
```
|
||
|
||
This doesn't work because `group_by()` uses data-masking, not tidy-selection.
|
||
We can work around that problem by using the handy `pick()` function, which allows you to use tidy-selection inside data-masking functions:
|
||
|
||
```{r}
|
||
count_missing <- function(df, group_vars, x_var) {
|
||
df |>
|
||
group_by(pick({{ group_vars }})) |>
|
||
summarize(n_miss = sum(is.na({{ x_var }})))
|
||
}
|
||
|
||
flights |>
|
||
count_missing(c(year, month, day), dep_time)
|
||
```
|
||
|
||
Another convenient use of `pick()` is to make a 2d table of counts.
|
||
Here we count using all the variables in the `rows` and `columns`, then use `pivot_wider()` to rearrange the counts into a grid:
|
||
|
||
```{r}
|
||
# https://twitter.com/pollicipes/status/1571606508944719876
|
||
count_wide <- function(data, rows, cols) {
|
||
data |>
|
||
count(pick(c({{ rows }}, {{ cols }}))) |>
|
||
pivot_wider(
|
||
names_from = {{ cols }},
|
||
values_from = n,
|
||
names_sort = TRUE,
|
||
values_fill = 0
|
||
)
|
||
}
|
||
|
||
diamonds |> count_wide(clarity, cut)
|
||
diamonds |> count_wide(c(clarity, color), cut)
|
||
```
|
||
|
||
While our examples have mostly focused on dplyr, tidy evaluation also underpins tidyr, and if you look at the `pivot_wider()` docs you can see that `names_from` uses tidy-selection.
|
||
|
||
### Exercises
|
||
|
||
1. Using the datasets from nycflights13, write a function that:
|
||
|
||
1. Finds all flights that were cancelled (i.e. `is.na(arr_time)`) or delayed by more than an hour.
|
||
|
||
```{r}
|
||
#| eval: false
|
||
flights |> filter_severe()
|
||
```
|
||
|
||
2. Counts the number of cancelled flights and the number of flights delayed by more than an hour.
|
||
|
||
```{r}
|
||
#| eval: false
|
||
flights |> group_by(dest) |> summarize_severe()
|
||
```
|
||
|
||
3. Finds all flights that were cancelled or delayed by more than a user supplied number of hours:
|
||
|
||
```{r}
|
||
#| eval: false
|
||
flights |> filter_severe(hours = 2)
|
||
```
|
||
|
||
4. Summarizes the weather to compute the minimum, mean, and maximum, of a user supplied variable:
|
||
|
||
```{r}
|
||
#| eval: false
|
||
weather |> summarize_weather(temp)
|
||
```
|
||
|
||
5. Converts the user supplied variable that uses clock time (e.g. `dep_time`, `arr_time`, etc.) into a decimal time (i.e. hours + (minutes / 60)).
|
||
|
||
```{r}
|
||
#| eval: false
|
||
weather |> standardise_time(sched_dep_time)
|
||
```
|
||
|
||
2. For each of the following functions list all arguments that use tidy evaluation and describe whether they use data-masking or tidy-selection: `distinct()`, `count()`, `group_by()`, `rename_with()`, `slice_min()`, `slice_sample()`.
|
||
|
||
3. Generalize the following function so that you can supply any number of variables to count.
|
||
|
||
```{r}
|
||
count_prop <- function(df, var, sort = FALSE) {
|
||
df |>
|
||
count({{ var }}, sort = sort) |>
|
||
mutate(prop = n / sum(n))
|
||
}
|
||
```
|
||
|
||
## Plot functions
|
||
|
||
Instead of returning a data frame, you might want to return a plot.
|
||
Fortunately, you can use the same techniques with ggplot2, because `aes()` is a data-masking function.
|
||
For example, imagine that you're making a lot of histograms:
|
||
|
||
```{r}
|
||
#| fig-show: hide
|
||
diamonds |>
|
||
ggplot(aes(carat)) +
|
||
geom_histogram(binwidth = 0.1)
|
||
|
||
diamonds |>
|
||
ggplot(aes(carat)) +
|
||
geom_histogram(binwidth = 0.05)
|
||
```
|
||
|
||
Wouldn't it be nice if you could wrap this up into a histogram function?
|
||
This is easy as pie once you know that `aes()` is a data-masking function and you need to embrace:
|
||
|
||
```{r}
|
||
histogram <- function(df, var, binwidth = NULL) {
|
||
df |>
|
||
ggplot(aes({{ var }})) +
|
||
geom_histogram(binwidth = binwidth)
|
||
}
|
||
|
||
diamonds |> histogram(carat, 0.1)
|
||
```
|
||
|
||
Note that `histogram()` returns a ggplot2 plot, meaning you can still add on additional components if you want.
|
||
Just remember to switch from `|>` to `+`:
|
||
|
||
```{r}
|
||
diamonds |>
|
||
histogram(carat, 0.1) +
|
||
labs(x = "Size (in carats)", y = "Number of diamonds")
|
||
```
|
||
|
||
### More variables
|
||
|
||
It's straightforward to add more variables to the mix.
|
||
For example, maybe you want an easy way to eyeball whether or not a data set is linear by overlaying a smooth line and a straight line:
|
||
|
||
```{r}
|
||
# https://twitter.com/tyler_js_smith/status/1574377116988104704
|
||
linearity_check <- function(df, x, y) {
|
||
df |>
|
||
ggplot(aes({{ x }}, {{ y }})) +
|
||
geom_point() +
|
||
geom_smooth(method = "loess", color = "red", se = FALSE) +
|
||
geom_smooth(method = "lm", color = "blue", se = FALSE)
|
||
}
|
||
|
||
starwars |>
|
||
filter(mass < 1000) |>
|
||
linearity_check(mass, height)
|
||
```
|
||
|
||
Or maybe you want an alternative to colored scatterplots for very large datasets where overplotting is a problem:
|
||
|
||
```{r}
|
||
# https://twitter.com/ppaxisa/status/1574398423175921665
|
||
hex_plot <- function(df, x, y, z, bins = 20, fun = "mean") {
|
||
df |>
|
||
ggplot(aes({{ x }}, {{ y }}, z = {{ z }})) +
|
||
stat_summary_hex(
|
||
aes(color = after_scale(fill)), # make border same color as fill
|
||
bins = bins,
|
||
fun = fun,
|
||
)
|
||
}
|
||
|
||
diamonds |> hex_plot(carat, price, depth)
|
||
```
|
||
|
||
### Combining with dplyr
|
||
|
||
Some of the most useful helpers combine a dash of dplyr with ggplot2.
|
||
For example, if you might want to do a vertical bar chart where you automatically sort the bars in frequency order using `fct_infreq()`.
|
||
Since the bar chart is vertical, we also need to reverse the usual order to get the highest values at the top (also note the `:=` operator, which allows you to inject names with glue syntax on the left-hand side of `:=`; type: ?\`:=\` for more details):
|
||
|
||
```{r}
|
||
sorted_bars <- function(df, var) {
|
||
df |>
|
||
mutate({{ var }} := fct_rev(fct_infreq({{ var }}))) |>
|
||
ggplot(aes(y = {{ var }})) +
|
||
geom_bar()
|
||
}
|
||
|
||
diamonds |> sorted_bars(cut)
|
||
```
|
||
|
||
Or maybe you want to make it easy to draw a bar plot just for a subset of the data:
|
||
|
||
```{r}
|
||
conditional_bars <- function(df, condition, var) {
|
||
df |>
|
||
filter({{ condition }}) |>
|
||
ggplot(aes({{ var }})) +
|
||
geom_bar()
|
||
}
|
||
|
||
diamonds |> conditional_bars(cut == "Good", clarity)
|
||
```
|
||
|
||
You can also get creative and display data summaries in other ways.
|
||
For example, this code uses the axis labels to display the highest value.
|
||
As you learn more about ggplot2, the power of your functions will continue to increase.
|
||
|
||
```{r}
|
||
# https://gist.github.com/GShotwell/b19ef520b6d56f61a830fabb3454965b
|
||
fancy_ts <- function(df, val, group) {
|
||
labs <- df |>
|
||
group_by({{ group }}) |>
|
||
summarize(breaks = max({{ val }}))
|
||
|
||
df |>
|
||
ggplot(aes(date, {{ val }}, group = {{ group }}, color = {{ group }})) +
|
||
geom_path() +
|
||
scale_y_continuous(
|
||
breaks = labs$breaks,
|
||
labels = scales::label_comma(),
|
||
minor_breaks = NULL,
|
||
guide = guide_axis(position = "right")
|
||
)
|
||
}
|
||
|
||
df <- tibble(
|
||
dist1 = sort(rnorm(50, 5, 2)),
|
||
dist2 = sort(rnorm(50, 8, 3)),
|
||
dist4 = sort(rnorm(50, 15, 1)),
|
||
date = seq.Date(as.Date("2022-01-01"), as.Date("2022-04-10"), by = "2 days")
|
||
)
|
||
|
||
df <- pivot_longer(df, cols = -date, names_to = "dist_name", values_to = "value")
|
||
|
||
fancy_ts(df, value, dist_name)
|
||
```
|
||
|
||
Next we'll discuss two more complicated cases: faceting and automatic labeling.
|
||
|
||
### Faceting
|
||
|
||
Unfortunately, programming with faceting is a special challenge, because faceting was implemented before we understood what tidy evaluation was and how it should work.
|
||
So you have to learn a new syntax.
|
||
When programming with facets, instead of writing `~ x`, you need to write `vars(x)` and instead of `~ x + y` you need to write `vars(x, y)`.
|
||
The only advantage of this syntax is that `vars()` uses tidy evaluation so you can embrace within it:
|
||
|
||
```{r}
|
||
# https://twitter.com/sharoz/status/1574376332821204999
|
||
foo <- function(x) {
|
||
ggplot(mtcars, aes(mpg, disp)) +
|
||
geom_point() +
|
||
facet_wrap(vars({{ x }}))
|
||
}
|
||
|
||
foo(cyl)
|
||
```
|
||
|
||
As with data frame functions, it can be useful to make your plotting functions tightly coupled to a specific dataset, or even a specific variable.
|
||
For example, the following function makes it particularly easy to interactively explore the conditional distribution of `carat` from the diamonds dataset.
|
||
|
||
```{r}
|
||
# https://twitter.com/yutannihilat_en/status/1574387230025875457
|
||
density <- function(color, facets, binwidth = 0.1) {
|
||
diamonds |>
|
||
ggplot(aes(carat, after_stat(density), color = {{ color }})) +
|
||
geom_freqpoly(binwidth = binwidth) +
|
||
facet_wrap(vars({{ facets }}))
|
||
}
|
||
|
||
density()
|
||
density(cut)
|
||
density(cut, clarity)
|
||
```
|
||
|
||
### Labeling
|
||
|
||
Remember the histogram function we showed you earlier?
|
||
|
||
```{r}
|
||
histogram <- function(df, var, binwidth = NULL) {
|
||
df |>
|
||
ggplot(aes({{ var }})) +
|
||
geom_histogram(binwidth = binwidth)
|
||
}
|
||
```
|
||
|
||
Wouldn't it be nice if we could label the output with the variable and the bin width that was used?
|
||
To do so, we're going to have to go under the covers of tidy evaluation and use a function from the package we haven't talked about yet: rlang.
|
||
rlang is a low-level package that's used by just about every other package in the tidyverse because it implements tidy evaluation (as well as many other useful tools).
|
||
|
||
To solve the labeling problem we can use `rlang::englue()`.
|
||
This works similarly to `str_glue()`, so any value wrapped in `{ }` will be inserted into the string.
|
||
But it also understands `{{ }}`, which automatically inserts the appropriate variable name:
|
||
|
||
```{r}
|
||
histogram <- function(df, var, binwidth) {
|
||
label <- rlang::englue("A histogram of {{var}} with binwidth {binwidth}")
|
||
|
||
df |>
|
||
ggplot(aes({{ var }})) +
|
||
geom_histogram(binwidth = binwidth) +
|
||
labs(title = label)
|
||
}
|
||
|
||
diamonds |> histogram(carat, 0.1)
|
||
```
|
||
|
||
You can use the same approach in any other place where you want to supply a string in a ggplot2 plot.
|
||
|
||
### Exercises
|
||
|
||
Build up a rich plotting function by incrementally implementing each of the steps below:
|
||
|
||
1. Draw a scatterplot given dataset and `x` and `y` variables.
|
||
|
||
2. Add a line of best fit (i.e. a linear model with no standard errors).
|
||
|
||
3. Add a title.
|
||
|
||
## Style
|
||
|
||
R doesn't care what your function or arguments are called but the names make a big difference for humans.
|
||
Ideally, the name of your function will be short, but clearly evoke what the function does.
|
||
That's hard!
|
||
But it's better to be clear than short, as RStudio's autocomplete makes it easy to type long names.
|
||
|
||
Generally, function names should be verbs, and arguments should be nouns.
|
||
There are some exceptions: nouns are ok if the function computes a very well known noun (i.e. `mean()` is better than `compute_mean()`), or accessing some property of an object (i.e. `coef()` is better than `get_coefficients()`).
|
||
Use your best judgement and don't be afraid to rename a function if you figure out a better name later.
|
||
|
||
```{r}
|
||
#| eval: false
|
||
|
||
# Too short
|
||
f()
|
||
|
||
# Not a verb, or descriptive
|
||
my_awesome_function()
|
||
|
||
# Long, but clear
|
||
impute_missing()
|
||
collapse_years()
|
||
```
|
||
|
||
R also doesn't care about how you use white space in your functions but future readers will.
|
||
Continue to follow the rules from @sec-workflow-style.
|
||
Additionally, `function()` should always be followed by squiggly brackets (`{}`), and the contents should be indented by an additional two spaces.
|
||
This makes it easier to see the hierarchy in your code by skimming the left-hand margin.
|
||
|
||
```{r}
|
||
# missing extra two spaces
|
||
density <- function(color, facets, binwidth = 0.1) {
|
||
diamonds |>
|
||
ggplot(aes(carat, after_stat(density), color = {{ color }})) +
|
||
geom_freqpoly(binwidth = binwidth) +
|
||
facet_wrap(vars({{ facets }}))
|
||
}
|
||
|
||
# Pipe indented incorrectly
|
||
density <- function(color, facets, binwidth = 0.1) {
|
||
diamonds |>
|
||
ggplot(aes(carat, after_stat(density), color = {{ color }})) +
|
||
geom_freqpoly(binwidth = binwidth) +
|
||
facet_wrap(vars({{ facets }}))
|
||
}
|
||
```
|
||
|
||
As you can see we recommend putting extra spaces inside of `{{ }}`.
|
||
This makes it very obvious that something unusual is happening.
|
||
|
||
### Exercises
|
||
|
||
1. Read the source code for each of the following two functions, puzzle out what they do, and then brainstorm better names.
|
||
|
||
```{r}
|
||
f1 <- function(string, prefix) {
|
||
substr(string, 1, nchar(prefix)) == prefix
|
||
}
|
||
|
||
f3 <- function(x, y) {
|
||
rep(y, length.out = length(x))
|
||
}
|
||
```
|
||
|
||
2. Take a function that you've written recently and spend 5 minutes brainstorming a better name for it and its arguments.
|
||
|
||
3. Make a case for why `norm_r()`, `norm_d()` etc. would be better than `rnorm()`, `dnorm()`.
|
||
Make a case for the opposite.
|
||
|
||
## Summary
|
||
|
||
In this chapter, you learned how to write functions for three useful scenarios: creating a vector, creating a data frames, or creating a plot.
|
||
Along the way you saw many examples, which hopefully started to get your creative juices flowing, and gave you some ideas for where functions might help your analysis code.
|
||
|
||
We have only shown you the bare minimum to get started with functions and there's much more to learn.
|
||
A few places to learn more are:
|
||
|
||
- To learn more about programming with tidy evaluation, see useful recipes in [programming with dplyr](https://dplyr.tidyverse.org/articles/programming.html) and [programming with tidyr](https://tidyr.tidyverse.org/articles/programming.html) and learn more about the theory in [What is data-masking and why do I need {{?](https://rlang.r-lib.org/reference/topic-data-mask.html).
|
||
- To learn more about reducing duplication in your ggplot2 code, read the [Programming with ggplot2](https://ggplot2-book.org/programming.html){.uri} chapter of the ggplot2 book.
|
||
- For more advice on function style, see the [tidyverse style guide](https://style.tidyverse.org/functions.html){.uri}.
|
||
|
||
In the next chapter, we'll dive into some of the details of R's vector data structures that we've omitted so far.
|
||
These are not immediately useful by themselves, but are a necessary foundation for the following chapter on iteration which gives you further tools for reducing code duplication.
|