1. It's easier to see the intent of your code, because your eyes are
drawn to what is different, not what is the same.
1. It's easier to respond to changes in requirements. As your needs
change, you only need to make changes in one place, rather than
remembering to change every place that you copied-and-pasted the
code.
1. You're likely to have fewer bugs because each line of code is
used in more places.
One part of reducing duplication is writing functions. Functions allow you to identify repeated patterns of code and extract them out in to indepdent pieces that you can reuse and easily update as code changes. Iteration helps you when you need to do the same thing to multiple inputs: repeating the same operation on different columns, or on different datasets. (Generally, you won't need to use explicit iteration to deal with different subsets of your data: in most cases the implicit iteration in dplyr will take care of that problem for you.)
In this chapter you'll learn about two important iteration tools: for loops and functional programming. For loops are a great place to start because they make iteration very explicit, so it's obvious what's happening. However, for loops are quite verbose, and include quite a bit of book-keeping code, that is duplicated for every for loop. Functional programming (FP) offers tools to extract out this duplicated code, so each common for loop pattern gets its own function. Once you master the vocabulary of FP, you can solve many common iteration problems with less code, more ease, and fewer errors.
Some people will tell you to avoid for loops because they are slow. They're wrong! (Well at least they're rather out of date, for loops haven't been slow for many years). The chief benefits of using FP functions like `lapply()` or `purrr::map()` is that they are more expressive and make code both easier to write and easier to read.
In later chapters you'll learn how to apply these iterating ideas when modelling. You can often use multiple simple models to help understand a complex dataset, or you might have multiple models because you're bootstrapping or cross-validating. The techniques you'll learn in this chapter will be invaluable.
The goal of using purrr functions instead of for loops is to allow you break common list manipulation challenges into independent pieces:
1. How can you solve the problem for a single element of the list? Once
you've solved that problem, purrr takes care of generalising your
solution to every element in the list.
1. If you're solving a complex problem, how can you break it down into
bite sized pieces that allow you to advance one small step towards a
solution? With purrr, you get lots of small pieces that you can
compose together with the pipe.
This structure makes it easier to solve new problems. It also makes it easier to understand your solutions to old problems when you re-read your old code.
The first iteration will run `output[[1]] <- median(df[[1]])`,
the second will run `output[[2]] <- median(df[[2]])`, and so on.
That's all there is to the for loop! Now is a good time to practice creating some basic (and not so basic) for loops using the exercises below. Then we'll move on some variations of the for loop that help you solve other problems that will crop up in practice.
### Exercises
1. Write for loops to:
1. Compute the mean of every column in the `mtcars`.
1. Determine the type of each column in `nycflights13::flights`.
1. Compute the number of unique values in each column of `iris`.
1. Generate 10 random normals for each of $mu = -10$, $0$, $10$, and $100$.
Think about output, sequence, and body, __before__ you start writing
the loop.
1. Eliminate the for loop in each of the following examples by taking
advantage of a built-in function that works with vectors:
1. Convert the song "99 bottles of beer on the wall" to a function.
Generalise to any number of any vessel containing any liquid on
any surface.
1. Convert the nursery rhyme "ten in the bed" to a function. Generalise
it to any number of people in any sleeping structure.
1. It's common to see for loops that don't preallocate the output and instead
increase the length of a vector at each step:
```{r, eval = FALSE}
output <- vector("integer", 0)
for (i in seq_along(x)) {
output <- c(output, lengths(x[[i]]))
}
output
```
How does this affect performance?
## For loop variations
Once you have the basic for loop under your belt, there are some variations on a theme that you should be aware of. These variations are important regardless of how you do iteration, so don't forget about them once you've master the FP techniques you'll learn about in the next section.
1. Modifying an existing object, instead of creating a new object.
1. Looping over names or values, instead of indices.
1. Handling outputs of unknown length.
1. Handling sequences of unknown length.
### Modifying an existing object
Sometimes you want to use a for loop to modify an existing object. For example, remember our challenge from [functions]. We wanted to rescale every column in a data frame:
Typically you'll be modifying a list or data frame with this sort of loop, so remember to use `[[`, not `[`. You might have spotted that I used `[[` in all my for loops: I think it's safer to use the subsetting operator that will work in all circumstances (and it makes it clear than I'm working with a single value each time).
There are three basic ways to loop over a vector. So far I've shown you the most general: looping over the numeric indices with `for (i in seq_along(xs))`, and extracting the value with `x[[i]]`. There are two other forms:
1. Loop over the elements: `for (x in xs)`. This is most useful if you only
care about side-effects, liking plotting or saving a file, because it's
Sometimes you might know now how long the output will be. For example, imagine you want to simulate some random vectors of random lengths. You might be tempted to solve this problem by progressively growing the vector:
But this type of is not very efficient because in each iteration, R has to copy all the data from the previous iterations. In technical terms you get "quadratic" ($O(n^2)$) behaviour which means that a loop with three times as many elements would take nine times ($3^2$) as long to run.
Sometimes you don't even know how long the input sequence should run for. This is common when doing simulations. For example, you might want to loop until you get three heads in a row. You can't do that sort of iteration with the for loop. Instead, you can use a while loop.
A while loop is more general than a for loop, because you can rewrite any for loop as a while loop, but you can't rewrite every while loop as a for loop:
I mention for loops briefly, because I hardly ever use them. They're most often used for simulation, which is outside the scope of this book. However, it is good to know they exist, if you encounter a problem where the number of iterations is not known in advance.
For loops are not as important in R as they are in other languages because R is a functional programming language. This means that it's possible to wrap up for loops in a function, and call that function instead of using the for loop directly.
But then you think it'd also be helpful to be able to compute the median, and the standard deviation, so you copy and paste your `col_mean()` function and replace the `mean()` with `median()` and `sd()`:
Uh oh! You've copied-and-pasted this code twice, so it's time to think about how to generalise it. Notice that most of code is for-loop boilerplate and it's hard to see the one thing (`mean()`, `median()`, `sd()`) that is different between the functions.
You've reduced the chance of bugs (because you now have 1/3 less code), and made it easy to generalise to new situations.
We can do exactly the same thing with `col_mean()`, `col_median()` and `col_sd()`. We can add an argument that supplies the function to apply to each column:
The idea of passing a function to another function is extremely powerful idea, and it's one of the reasons that R is called a functional programming language. It might take you a while to wrap your head around it, but it's worth the investment. In the rest of the chapter, you'll learn about and use the __purrr__ package which provides a general set of functions that eliminate the need for many common for loops. The apply family of functions in base R (`apply()`, `lapply()`, `tapply()`, etc) solve a similar problem, but purrr is more consistent and thus is easier to learn.
The pattern of looping over a vector and doing something to each element is so common that the purrr package provides a family of functions to do it for you. Each function always returns the same type of output so there are six variations based on what sort of result you want:
it's called exclusively for its side effects, so it's described in more
detail later in [walk](#walk).
Each function takes a vector as input, applies a function to each piece, and then returns a new vector that's the same length (and has the same names) as the input. The type of the vector is determined by the suffix to the map function. Usually you want to use the most specific available, using `map()` only as a fallback when there is no specialised equivalent available.
Once you master these functions, you'll find it takes much less time to solve iteration problems. But never feel bad about using a for loop instead of a function. The map functions are a step up a tower of abstraction, and it can take a long time to get your head around how they work. The important thing is that you solve the problem that you're working on, not write the most concise and elegant code.
We can use these functions to perform the same computations as the last for loop. Those summary functions returned doubles, so we need to use `map_dbl()`:
Compared to using a for loop, focus is on the operation being performed (i.e. `mean()`, `median()`, `sd()`), not the book-keeping required to loop over every element and store the output. This is even more apparent if we use the pipe:
There are a few shortcuts that you can use with `.f` in order to save a little typing. Imagine you want to fit a linear model to each group in a dataset. The following toy example splits the up the `mtcars` dataset in to three pieces (one for each value of cylinder) and fits the same linear model to each piece:
```{r}
models <- mtcars %>%
split(.$cyl) %>%
map(function(df) lm(mpg ~ wt, data = df))
```
The syntax for creating an anonymous function in R is quite verbose so purrr provides a convenient shortcut: a one-sided formula.
When you're looking at many models, you might want to extract a summary statistic like the $R^2$. To do that we need to first run `summary()` and then extract the component called `r.squared`. We could do that using the shorthand for anonymous functions:
```{r}
models %>%
map(summary) %>%
map_dbl(~.$r.squared)
```
But extracting named components is a common operation, so purrr provides an even shorter shortcut: you can use a string.
```{r}
models %>%
map(summary) %>%
map_dbl("r.squared")
```
You can also use a numeric vector to select elements by position:
1. Rewrite `map(x, function(df) lm(mpg ~ wt, data = df))` to eliminate the
anonymous function.
## Dealing with failure
When you do many operations on a list, sometimes one will fail. When this happens, you'll get an error message, and no output. This is annoying: why does one failure prevent you from accessing all the other successes? How do you ensure that one bad apple doesn't ruin the whole barrel?
In this section you'll learn how to deal this situation with a new function: `safely()`. `safely()` is an adverb: it takes a function (a verb) and returns a modified version. In this case, the modified function will never throw an error. Instead, it always returns a list with two elements:
1. `result` is the original result. If there was an error, this will be `NULL`.
1. `error` is an error object. If the operation was successful this will be
`NULL`.
(You might be familiar with the `try()` function in base R. It's similar, but because it sometimes returns the original result and it sometimes returns an error object it's more difficult to work with.)
Let's illustrate this with a simple example: `log()`:
```{r}
safe_log <- safely(log)
str(safe_log(10))
str(safe_log("a"))
```
When the function succeeds the `result` element contains the result and the `error` element is `NULL`. When the function fails, the `result` element is `NULL` and the `error` element contains an error object.
It's up to you how to deal with the errors, but typically you'll either look at the values of `x` where `y` is an error or work with the values of y that are ok:
```{r}
is_ok <- y$error %>% map_lgl(is_null)
x[!is_ok]
y$result[is_ok] %>% flatten_dbl()
```
Purrr provides two other useful adverbs:
* Like `safely()`, `possibly()` always succeeds. It's simpler than `safely()`,
because you give it a default value to return when there is an error.
```{r}
x <- list(1, 10, "a")
x %>% map_dbl(possibly(log, NA_real_))
```
* `quietly()` performs a similar role to `safely()`, but instead of capturing
errors, it captures printed output, messages, and warnings:
```{r}
x <- list(1, -1)
x %>% map(quietly(log)) %>% str()
```
### Exercises
1. Challenge: read all the csv files in this directory. Which ones failed
So far we've mapped along a single list. But often you have multiple related lists that you need iterate along in parallel. That's the job of the `map2()` and `pmap()` functions. For example, imagine you want to simulate some random normals with different means. You know how to do that with `map()`:
```{r}
mu <- list(5, 10, -3)
mu %>% map(rnorm, n = 10)
```
What if you also want to vary the standard deviation? You need to iterate along a vector of means and a vector of standard deviations in parallel. That's a job for `map2()` which works with two parallel sets of inputs:
The arguments that vary for each call come before the function name, and arguments that are the same for every function call come afterwards.
Like `map()`, `map2()` is just a wrapper around a for loop:
```{r}
map2 <- function(x, y, f, ...) {
out <- vector("list", length(x))
for (i in seq_along(x)) {
out[[i]] <- f(x[[i]], y[[i]], ...)
}
out
}
```
You could also imagine `map3()`, `map4()`, `map5()`, `map6()` etc, but that would get tedious quickly. Instead, purrr provides `pmap()` which takes a list of arguments. You might use that if you wanted to vary the mean, standard deviation, and number of samples:
As soon as your code gets complicated, I think a data frame is a good approach because it ensures that each column has a name and is the same length as all the other columns. We'll come back to this idea when we explore the intersection of dplyr, purrr, and model fitting.
### Invoking different functions
There's one more step up in complexity - as well as varying the arguments to the function you might also vary the function itself:
The first argument is a list of functions or character vector of function names. The second argument is a list of lists giving the arguments that vary for each function. The subsequent arguments are passed on to every function.
You can use `dplyr::frame_data()` to make creating these matching pairs a little easier:
```{r, eval = FALSE}
# Needs dev version of dplyr
sim <- dplyr::frame_data(
~f, ~params,
"runif", list(min = -1, max = -1),
"rnorm", list(sd = 5),
"rpois", list(lambda = 10)
)
sim %>% dplyr::mutate(
samples = invoke_map(f, params, n = 10)
)
```
## Walk {#walk}
Walk is an alternative to map that you use when you want to call a function for its side effects, rather than for its return value. You typically do this because you want to render output to the screen or save files to disk - the important thing is the action, not the return value. Here's a very simple example:
```{r}
x <- list(1, "a", 3)
x %>%
walk(print)
```
`walk()` is generally not that useful compared to `walk2()` or `pwalk()`. For example, if you had a list of plots and a vector of file names, you could use `pwalk()` to save each file to the corresponding location on disk:
Imagine we want to summarise each numeric column of a data frame. We could do it in two steps:
1. Find all numeric columns.
1. Summarise each column.
In code, that would look like:
```{r}
col_sum <- function(df, f) {
is_num <- df %>% map_lgl(is_numeric)
df[is_num] %>% map_dbl(f)
}
```
`is_numeric()` is a __predicate__: a function that returns either `TRUE` or `FALSE`. There are a number of of purrr functions designed to work specifically with predicates:
* `keep()` and `discard()` keeps/discards list elements where the predicate is
true.
* `head_while()` and `tail_while()` keep the first/last elements of a list until
you get the first element where the predicate is true.
* `some()` and `every()` determine if the predicate is true for any or all of
the elements.
* `detect()` and `detect_index()`
We could use `keep()` to simplify the summary function to:
```{r}
col_sum <- function(df, f) {
df %>%
keep(is.numeric) %>%
map_dbl(f)
}
```
I like this formulation because you can easily read the sequence of steps.
### Exercises
1. A possible base R equivalent of `col_sum()` is:
```{r}
col_sum3 <- function(df, f) {
is_num <- sapply(df, is.numeric)
df_num <- df[, is_num]
sapply(df_num, f)
}
```
But it has a number of bugs as illustrated with the following inputs:
```{r, eval = FALSE}
df <- data.frame(z = c("a", "b", "c"), x = 1:3, y = 3:1)
# OK
col_sum3(df, mean)
# Has problems: don't always return numeric vector