r4ds/functions.qmd

520 lines
16 KiB
Plaintext
Raw Normal View History

# Functions {#sec-functions}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
2022-09-19 05:18:45 +08:00
status("drafting")
```
2015-10-21 21:04:37 +08:00
## Introduction
2016-07-19 21:01:50 +08:00
One of the best ways to improve your reach as a data scientist is to write functions.
Functions allow you to automate common tasks in a more powerful and general way than copy-and-pasting.
Writing a function has three big advantages over using copy-and-paste:
2016-03-03 22:25:43 +08:00
1. You can give a function an evocative name that makes your code easier to understand.
2016-03-03 22:25:43 +08:00
2. As requirements change, you only need to update code in one place, instead of many.
2016-08-10 04:49:26 +08:00
3. You eliminate the chance of making incidental mistakes when you copy and paste (i.e. updating a variable name in one place, but not in another).
2016-08-10 04:49:26 +08:00
Writing good functions is a lifetime journey.
2022-08-10 00:43:12 +08:00
Even after using R for many years we still learn new techniques and better ways of approaching old problems.
2022-09-19 05:18:45 +08:00
The goal of this chapter is to get you started on your journey with functions with two pragmatic and useful types of functions:
2016-03-03 22:25:43 +08:00
2022-09-19 05:18:45 +08:00
- Vector functions work with individual vectors and reduce duplication within your `summarise()` and `mutate()` calls.
- Data frame functions work with entire data frames and reduce duplication within your large data analysis pipelines.
The chapter concludes with some also gives you some suggestions for how to style your code.
Good code style is like correct punctuation.
Youcanmanagewithoutit, but it sure makes things easier to read!
As with styles of punctuation, there are many possible variations.
Here we present the style we use in our code, but the most important thing is to be consistent.
2016-02-11 21:58:53 +08:00
2016-07-19 21:01:50 +08:00
### Prerequisites
2022-09-09 00:32:10 +08:00
```{r}
library(tidyverse)
```
2016-07-19 21:01:50 +08:00
2022-09-19 05:18:45 +08:00
## Vector functions
2016-03-01 22:29:58 +08:00
You should consider writing a function whenever you've copied and pasted a block of code more than twice (i.e. you now have three copies of the same code).
For example, take a look at this code.
What does it do?
2015-10-21 21:04:37 +08:00
```{r}
2016-08-18 21:37:48 +08:00
df <- tibble::tibble(
2015-10-21 21:04:37 +08:00
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)
2022-09-09 00:32:10 +08:00
df |> mutate(
a = (a - min(a, na.rm = TRUE)) /
(max(a, na.rm = TRUE) - min(a, na.rm = TRUE)),
b = (b - min(b, na.rm = TRUE)) /
(max(b, na.rm = TRUE) - min(a, na.rm = TRUE)),
c = (c - min(c, na.rm = TRUE)) /
(max(c, na.rm = TRUE) - min(c, na.rm = TRUE)),
d = (d - min(d, na.rm = TRUE)) /
(max(d, na.rm = TRUE) - min(d, na.rm = TRUE))
)
2015-10-19 21:41:33 +08:00
```
You might be able to puzzle out that this rescales each column to have a range from 0 to 1.
But did you spot the mistake?
2022-09-09 00:32:10 +08:00
Hadley made an error when copying-and-pasting the code for `b`: he forgot to change an `a` to a `b`.
Extracting repeated code out into a function is a good idea because it prevents you from making this type of mistake.
2015-10-19 21:41:33 +08:00
To write a function you need to first analyse the code.
How many inputs does it have?
2015-10-21 21:04:37 +08:00
```{r}
#| eval: false
2015-10-21 21:04:37 +08:00
(df$a - min(df$a, na.rm = TRUE)) /
(max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
2015-10-19 21:41:33 +08:00
```
This code only has one input: `df$a`.
(If you're surprised that `TRUE` is not an input, you can explore why in the exercise below.) To make the inputs more clear, it's a good idea to rewrite the code using temporary variables with general names.
2022-08-10 00:43:12 +08:00
Here this code only requires a single numeric vector, so we'll call it `x`:
2015-10-21 21:04:37 +08:00
```{r}
2016-08-18 21:37:48 +08:00
x <- df$a
2015-10-21 21:04:37 +08:00
(x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
```
There is some duplication in this code.
We're computing the range of the data three times, so it makes sense to do it in one step:
2015-10-21 21:04:37 +08:00
```{r}
rng <- range(x, na.rm = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
```
Pulling out intermediate calculations into named variables is a good practice because it makes it more clear what the code is doing.
2022-09-19 05:18:45 +08:00
### Creating a new function
2022-08-10 00:43:12 +08:00
Now that we've simplified the code, and checked that it still works, we can turn it into a function:
2015-10-21 21:04:37 +08:00
```{r}
rescale01 <- function(x) {
rng <- range(x, na.rm = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
}
rescale01(c(0, 5, 10))
2015-10-21 21:04:37 +08:00
```
2016-03-04 23:30:16 +08:00
There are three key steps to creating a new function:
1. You need to pick a **name** for the function.
2022-08-10 00:43:12 +08:00
Here we used `rescale01` because this function rescales a vector to lie between 0 and 1.
2. You list the inputs, or **arguments**, to the function inside `function`.
Here we have just one argument.
If we had more the call would look like `function(x, y, z)`.
2021-05-14 21:03:58 +08:00
3. You place the code you have developed in the **body** of the function, a `{` block that immediately follows `function(...)`.
2015-10-21 21:04:37 +08:00
2022-08-10 00:43:12 +08:00
Note the overall process: we only made the function after we'd figured out how to make it work with a simple input.
It's easier to start with working code and turn it into a function; it's harder to create a function and then try to make it work.
2016-03-07 22:32:47 +08:00
At this point it's a good idea to check your function with a few different inputs:
```{r}
rescale01(c(-10, 0, 10))
rescale01(c(1, 2, 3, NA, 5))
```
As you write more and more functions you'll eventually want to convert these informal, interactive tests into formal, automated tests.
That process is called unit testing.
Unfortunately, it's beyond the scope of this book, but you can learn about it in <https://r-pkgs.org/testing-basics.html>.
2016-03-07 22:32:47 +08:00
2016-03-08 22:15:54 +08:00
We can simplify the original example now that we have a function:
2015-10-21 21:04:37 +08:00
```{r}
2022-09-09 00:32:10 +08:00
df |> mutate(
a = rescale01(a),
b = rescale01(b),
c = rescale01(c),
d = rescale01(d)
)
2015-10-21 21:04:37 +08:00
```
Compared to the original, this code is easier to understand and we've eliminated one class of copy-and-paste errors.
There is still quite a bit of duplication since we're doing the same thing to multiple columns.
2022-09-09 00:32:10 +08:00
We could reduce that duplication with `across()` which you'll learn more about in @sec-iteration:
```{r}
df |>
mutate(across(a:d, rescale01))
```
2016-01-25 22:59:36 +08:00
Another advantage of functions is that if our requirements change, we only need to make the change in one place.
For example, we might discover that some of our variables include infinite values, and `rescale01()` fails:
2016-03-12 03:11:41 +08:00
```{r}
x <- c(1:10, Inf)
rescale01(x)
```
Because we've extracted the code into a function, we only need to make the fix in one place:
2016-03-12 03:11:41 +08:00
```{r}
rescale01 <- function(x) {
rng <- range(x, na.rm = TRUE, finite = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
}
rescale01(x)
```
This is an important part of the "do not repeat yourself" (or DRY) principle.
The more repetition you have in your code, the more places you need to remember to update when things change (and they always do!), and the more likely you are to create bugs over time.
2016-03-12 03:11:41 +08:00
2022-09-19 05:18:45 +08:00
```{r}
rescale_z <- function(x) {
(x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
}
fix_na <- function(x) {
if_else(x %in% c(99, 999, 9999), NA, x)
}
squish <- function(x, min, max) {
case_when(
x < min ~ min,
x > max ~ max,
.default = x
)
}
first_upper <- function(x) {
str_sub(x, 1, 1) <- str_to_upper(str_sub(x, 1, 1))
x
}
```
### Summary functions
```{r}
cv <- function(x, na.rm = FALSE) {
sd(x, na.rm = na.rm) / mean(x, na.rm = na.rm)
}
# Compute confidence interval around the mean using normal approximation
mean_ci <- function(x, conf = 0.95) {
se <- sd(x) / sqrt(length(x))
alpha <- 1 - conf
mean(x) + se * qnorm(c(alpha / 2, 1 - alpha / 2))
}
```
### Exercises
2016-01-25 22:59:36 +08:00
1. Why is `TRUE` not a parameter to `rescale01()`?
What would happen if `x` contained a single missing value, and `na.rm` was `FALSE`?
2. In the second variant of `rescale01()`, infinite values are left unchanged.
Rewrite `rescale01()` so that `-Inf` is mapped to 0, and `Inf` is mapped to 1.
2016-08-10 04:49:26 +08:00
3. Practice turning the following code snippets into functions.
Think about what each function does.
What would you call it?
How many arguments does it need?
Can you rewrite it to be more expressive or less duplicative?
2016-02-13 06:05:25 +08:00
```{r}
#| eval: false
2016-02-13 06:05:25 +08:00
mean(is.na(x))
2016-02-13 06:05:25 +08:00
x / sum(x, na.rm = TRUE)
2016-02-13 06:05:25 +08:00
sd(x, na.rm = TRUE) / mean(x, na.rm = TRUE)
```
4. Write your own functions to compute the variance and skewness of a numeric vector.
Variance is defined as $$
\mathrm{Var}(x) = \frac{1}{n - 1} \sum_{i=1}^n (x_i - \bar{x}) ^2 \text{,}
$$ where $\bar{x} = (\sum_i^n x_i) / n$ is the sample mean.
Skewness is defined as $$
\mathrm{Skew}(x) = \frac{\frac{1}{n-2}\left(\sum_{i=1}^n(x_i - \bar x)^3\right)}{\mathrm{Var}(x)^{3/2}} \text{.}
$$
5. Write `both_na()`, a function that takes two vectors of the same length and returns the number of positions that have an `NA` in both vectors.
6. What do the following functions do?
Why are they useful even though they are so short?
```{r}
is_directory <- function(x) file.info(x)$isdir
is_readable <- function(x) file.access(x, 4) == 0
2016-02-13 06:05:25 +08:00
```
7. Read the [complete lyrics](https://en.wikipedia.org/wiki/Little_Bunny_Foo_Foo) to "Little Bunny Foo Foo".
There's a lot of duplication in this song.
Extend the initial piping example to recreate the complete song, and use functions to reduce the duplication.
2016-01-25 22:59:36 +08:00
2022-09-19 05:18:45 +08:00
## Tidyeval
```{r}
mutate_y <- function(data) {
mutate(data, y = a + x)
}
```
## Select functions
We'll start with select-style verbs.
The most important example is `dplyr:select()` but it also includes `relocate()`, `rename()`, `pull()`, as well as `pivot_longer()` and `pivot_wider()`.
Technically, it's an argument, not a function, but in most cases the arguments to a function are select-style or mutate-style, not both.
You can recognize by looking in the docs for its technical name "tidyselect", so called because it's powered by the [tidyselect](https://tidyselect.r-lib.org/) package.
When you have the data-variable in an env-variable that is a function argument, you **embrace** the argument by surrounding it in doubled braces.
`across()` is a particularly important `select()` function.
We'll come back to it in @sec-across.
## Mutate functions
Above section helps you reduce repeated code inside a dplyr verbs.
This section teaches you how to reduce duplication outside of dplyr verbs.
As well as `mutate()` this includes `arrange()`, `count()`, `filter()`, `group_by()`, `distinct()`, and `summarise()`.
You can recgonise if an argument is mutate-style by looking for its technical name "data-masking" in the document.
Tidy evaluation is hard to notice because it's the air that you breathe in this book.
Writing funtions with it is hard, because you have to explicitly think about things that you haven't had to before.
Things that the tidyverse has been designed to help you avoid thinking about so that you can focus on your analysis.
## Style
2016-01-25 22:59:36 +08:00
It's important to remember that functions are not just for the computer, but are also for humans.
R doesn't care what your function is called, or what comments it contains, but these are important for human readers.
This section discusses some things that you should bear in mind when writing functions that humans can understand.
2022-09-19 05:18:45 +08:00
Excerpt from <https://style.tidyverse.org/functions.html>
### Names
The name of a function is important.
Ideally, the name of your function will be short, but clearly evoke what the function does.
That's hard!
But it's better to be clear than short, as RStudio's autocomplete makes it easy to type long names.
Generally, function names should be verbs, and arguments should be nouns.
There are some exceptions: nouns are ok if the function computes a very well known noun (i.e. `mean()` is better than `compute_mean()`), or accessing some property of an object (i.e. `coef()` is better than `get_coefficients()`).
A good sign that a noun might be a better choice is if you're using a very broad verb like "get", "compute", "calculate", or "determine".
Use your best judgement and don't be afraid to rename a function if you figure out a better name later.
2016-03-08 22:15:54 +08:00
```{r}
#| eval: false
2016-03-08 22:15:54 +08:00
# Too short
f()
# Not a verb, or descriptive
my_awesome_function()
2016-03-08 22:15:54 +08:00
# Long, but clear
impute_missing()
collapse_years()
```
2016-03-03 22:25:43 +08:00
2022-09-19 05:18:45 +08:00
### Indenting
2016-03-04 23:30:16 +08:00
Both `if` and `function` should (almost) always be followed by squiggly brackets (`{}`), and the contents should be indented by two spaces.
This makes it easier to see the hierarchy in your code by skimming the left-hand margin.
2016-03-08 22:15:54 +08:00
An opening curly brace should never go on its own line and should always be followed by a new line.
A closing curly brace should always go on its own line, unless it's followed by `else`.
Always indent the code inside curly braces.
```{r}
#| eval: false
# Good
if (y < 0 && debug) {
message("Y is negative")
}
if (y == 0) {
log(x)
} else {
y ^ x
}
# Bad
if (y < 0 && debug)
message("Y is negative")
if (y == 0) {
log(x)
}
else {
y ^ x
2016-01-25 22:59:36 +08:00
}
```
2016-03-03 22:25:43 +08:00
### Exercises
1. What's the difference between `if` and `ifelse()`?
Carefully read the help and construct three examples that illustrate the key differences.
2. Write a greeting function that says "good morning", "good afternoon", or "good evening", depending on the time of day.
(Hint: use a time argument that defaults to `lubridate::now()`.
That will make it easier to test your function.)
2016-03-04 23:30:16 +08:00
3. Implement a `fizzbuzz` function.
It takes a single number as input.
If the number is divisible by three, it returns "fizz".
If it's divisible by five it returns "buzz".
If it's divisible by three and five, it returns "fizzbuzz".
Otherwise, it returns the number itself.
Make sure you first write working code before you create the function.
4. How could you use `cut()` to simplify this set of nested if-else statements?
2016-03-04 23:30:16 +08:00
```{r}
#| eval: false
2016-03-04 23:30:16 +08:00
if (temp <= 0) {
"freezing"
} else if (temp <= 10) {
"cold"
} else if (temp <= 20) {
"cool"
} else if (temp <= 30) {
"warm"
} else {
"hot"
}
```
2022-08-10 00:43:12 +08:00
How would you change the call to `cut()` if we used `<` instead of `<=`?
What is the other chief advantage of `cut()` for this problem?
(Hint: what happens if you have many values in `temp`?)
2016-03-04 23:30:16 +08:00
5. What happens if you use `switch()` with numeric values?
2016-03-03 00:55:14 +08:00
6. What does this `switch()` call do?
What happens if `x` is "e"?
2016-03-03 00:55:14 +08:00
```{r}
#| eval: false
2016-03-03 22:25:43 +08:00
switch(x,
a = ,
b = "ab",
c = ,
d = "cd"
)
```
Experiment, then carefully read the documentation.
2016-03-03 00:55:14 +08:00
2022-09-19 05:18:45 +08:00
### Exercises
2016-03-03 22:25:43 +08:00
2022-09-19 05:18:45 +08:00
1. Read the source code for each of the following three functions, puzzle out what they do, and then brainstorm better names.
2022-09-19 05:18:45 +08:00
```{r}
f1 <- function(string, prefix) {
substr(string, 1, nchar(prefix)) == prefix
}
f2 <- function(x) {
if (length(x) <= 1) return(NULL)
x[-length(x)]
}
f3 <- function(x, y) {
rep(y, length.out = length(x))
}
```
2022-09-19 05:18:45 +08:00
2. Take a function that you've written recently and spend 5 minutes brainstorming a better name for it and its arguments.
2022-09-19 05:18:45 +08:00
3. Compare and contrast `rnorm()` and `MASS::mvrnorm()`.
How could you make them more consistent?
2022-09-19 05:18:45 +08:00
4. Make a case for why `norm_r()`, `norm_d()` etc would be better than `rnorm()`, `dnorm()`.
Make a case for the opposite.
2016-03-03 22:25:43 +08:00
2022-09-19 05:18:45 +08:00
## Learning more
2022-09-19 05:18:45 +08:00
### Conditional execution {#sec-conditional-execution}
2016-03-03 22:25:43 +08:00
2022-09-19 05:18:45 +08:00
An `if` statement allows you to conditionally execute code.
It looks like this:
2016-03-03 22:25:43 +08:00
```{r}
#| eval: false
2022-09-19 05:18:45 +08:00
if (condition) {
# code executed when condition is TRUE
} else {
# code executed when condition is FALSE
2016-03-08 22:15:54 +08:00
}
```
2016-03-07 22:32:47 +08:00
2022-09-19 05:18:45 +08:00
To get help on `if` you need to surround it in backticks: `` ?`if` ``.
The help isn't particularly helpful if you're not already an experienced programmer, but at least you know how to get to it!
2016-03-08 22:15:54 +08:00
2022-09-19 05:18:45 +08:00
Here's a simple function that uses an `if` statement.
The goal of this function is to return a logical vector describing whether or not each element of a vector is named.
2016-03-08 22:15:54 +08:00
```{r}
2022-09-19 05:18:45 +08:00
has_name <- function(x) {
nms <- names(x)
if (is.null(nms)) {
rep(FALSE, length(x))
} else {
!is.na(nms) & nms != ""
2016-03-08 22:15:54 +08:00
}
}
```
2022-09-19 05:18:45 +08:00
You can use `||` (or) and `&&` (and) to combine multiple logical expressions.
These operators are "short-circuiting": as soon as `||` sees the first `TRUE` it returns `TRUE` without computing anything else.
As soon as `&&` sees the first `FALSE` it returns `FALSE`.
2016-03-08 22:15:54 +08:00
2022-09-19 05:18:45 +08:00
This function takes advantage of the standard return rule: a function returns the last value that it computed.
Here that is either one of the two branches of the `if` statement.
2016-03-08 22:15:54 +08:00
2022-09-19 05:18:45 +08:00
The `condition` must evaluate to either `TRUE` or `FALSE`.
If it's not; you'll get an error.
2016-08-10 04:49:26 +08:00
```{r}
#| error: true
2022-09-19 05:18:45 +08:00
if (c(TRUE, FALSE)) {}
2016-03-09 22:42:51 +08:00
2022-09-19 05:18:45 +08:00
if (NA) {}
2016-03-09 22:42:51 +08:00
```
2022-09-19 05:18:45 +08:00
You should never use `|` or `&` in an `if` statement: these are vectorised operations that apply to multiple values (that's why you use them in `filter()`).
If you do have a logical vector, you can use `any()` or `all()` to collapse it to a single value.
Be careful when testing for equality.
`==` is vectorised, which means that it's easy to get more than one output.
Either check the length is already 1, collapse with `all()` or `any()`.
2016-02-11 21:58:53 +08:00
2022-09-19 05:18:45 +08:00
You can chain multiple if statements together:
2016-02-11 21:58:53 +08:00
```{r}
#| eval: false
2016-02-11 21:58:53 +08:00
2022-09-19 05:18:45 +08:00
if (this) {
# do that
} else if (that) {
# do something else
} else {
#
2016-02-13 06:05:25 +08:00
}
```
2022-09-19 05:18:45 +08:00
###