Iteration proofing

This commit is contained in:
hadley 2016-08-19 09:15:08 -05:00
parent d00b2f74bd
commit ecc6dc7909
3 changed files with 94 additions and 65 deletions

Binary file not shown.

Before

Width:  |  Height:  |  Size: 50 KiB

After

Width:  |  Height:  |  Size: 108 KiB

Binary file not shown.

View File

@ -2,10 +2,10 @@
## Introduction
In [functions], we talked about how important it is to reduce duplication in your code. Reducing code duplication has three main benefits:
In [functions], we talked about how important it is to reduce duplication in your code by creating functions instead of copying-and-pasting. Reducing code duplication has three main benefits:
1. It's easier to see the intent of your code, because your eyes are
drawn to what's changing, not what's staying the same.
drawn to what's different, not what stays the same.
1. It's easier to respond to changes in requirements. As your needs
change, you only need to make changes in one place, rather than
@ -15,8 +15,7 @@ In [functions], we talked about how important it is to reduce duplication in you
1. You're likely to have fewer bugs because each line of code is
used in more places.
One part of reducing duplication is writing functions. Functions allow you to identify repeated patterns of code and extract them out into independent pieces that you can reuse and easily update as code changes. __Iteration__ helps you when you need to do the same thing to multiple inputs: repeating the same operation on different columns, or on different datasets. (Generally, you won't need to use explicit iteration to deal with different subsets of your data: in most cases the implicit iteration in dplyr will take care of that problem for you.)
One tool for reducing duplication is functions, which reduce duplication by identifying repeated patterns of code and extract them out into independent pieces that can be easily reused and updated. Another toolf for reducing duplication is __tteration__, which helps you when you need to do the same thing to multiple inputs: repeating the same operation on different columns, or on different datasets.
In this chapter you'll learn about two important iteration paradigms: imperative programming and functional programming. On the imperative side you have tools like for loops and while loops, which are a great place to start because they make iteration very explicit, so it's obvious what's happening. However, for loops are quite verbose, and require quite a bit of bookkeeping code that is duplicated for every for loop. Functional programming (FP) offers tools to extract out this duplicated code, so each common for loop pattern gets its own function. Once you master the vocabulary of FP, you can solve many common iteration problems with less code, more ease, and fewer errors.
### Prerequisites
@ -29,7 +28,7 @@ library(purrr)
## For loops
Imagine we have this simple data frame:
Imagine we have this simple tibble:
```{r}
df <- tibble::tibble(
@ -61,7 +60,7 @@ output
Every for loop has three components:
1. The __output__: `output <- vector("integer", length(x))`.
1. The __output__: `output <- vector("double", length(x))`.
Before you start the loop, you must always allocate sufficient space
for the output. This is very important for efficiency: if you grow
the for loop at each iteration using `c()` (for example), your for loop
@ -103,13 +102,13 @@ That's all there is to the for loop! Now is a good time to practice creating som
1. Compute the mean of every column in `mtcars`.
1. Determine the type of each column in `nycflights13::flights`.
1. Compute the number of unique values in each column of `iris`.
1. Generate 10 random normals for each of $mu = -10$, $0$, $10$, and $100$.
1. Generate 10 random normals for each of $\mu = -10$, $0$, $10$, and $100$.
Think about the output, sequence, and body __before__ you start writing
the loop.
1. Eliminate the for loop in each of the following examples by taking
advantage of a built-in function that works with vectors:
advantage of an existing function that works with vectors:
```{r, eval = FALSE}
out <- ""
@ -134,13 +133,16 @@ That's all there is to the for loop! Now is a good time to practice creating som
1. Combine your function writing and for loop skills:
1. Convert the song "99 bottles of beer on the wall" to a function.
Generalise to any number of any vessel containing any liquid on
any surface.
1. Write a for loop that `prints()` the lyrics to the children's song
"Alice the camel".
1. Convert the nursery rhyme "ten in the bed" to a function. Generalise
it to any number of people in any sleeping structure.
1. Convert the song "99 bottles of beer on the wall" to a function.
Generalise to any number of any vessel containing any liquid on
any surface.
1. It's common to see for loops that don't preallocate the output and instead
increase the length of a vector at each step:
@ -152,7 +154,7 @@ That's all there is to the for loop! Now is a good time to practice creating som
output
```
How does this affect performance?
How does this affect performance? Design and execute an experiment.
## For loop variations
@ -187,14 +189,14 @@ df$c <- rescale01(df$c)
df$d <- rescale01(df$d)
```
To solve this with a for loop we use the same three components:
To solve this with a for loop we again think about the three components:
1. Output: we already have the output --- it's the same as the input!
1. __Output__: we already have the output --- it's the same as the input!
1. Sequence: we can think about a data frame as a list of columns, so
1. __Sequence__: we can think about a data frame as a list of columns, so
we can iterate over each column with `seq_along(df)`.
1. Body: apply `rescale01()`.
1. __Body__: apply `rescale01()`.
This gives us:
@ -204,7 +206,7 @@ for (i in seq_along(df)) {
}
```
Typically you'll be modifying a list or data frame with this sort of loop, so remember to use `[[`, not `[`. You might have spotted that I used `[[` in all my for loops: I think it's safer to use the subsetting operator that will work in all circumstances and it makes it clear than I'm working with a single value each time.
Typically you'll be modifying a list or data frame with this sort of loop, so remember to use `[[`, not `[`. You might have spotted that I used `[[` in all my for loops: I think it's better to use `[[` even for atomic vectors because it makes it clear that I want to work with a single element.
### Looping patterns
@ -248,7 +250,7 @@ for (i in seq_along(means)) {
str(output)
```
But this is not very efficient because in each iteration, R has to copy all the data from the previous iterations. In technical terms you get "quadratic" ($O(n^2)$) behaviour which means that a loop with three times as many elements would take nine times ($3^2$) as long to run.
But this is not very efficient because in each iteration, R has to copy all the data from the previous iterations. In technical terms you get "quadratic" ($O(n^2)$) behaviour which means that a loop with three times as many elements would take nine ($3^2$) times as long to run.
A better solution to save the results in a list, and then combine into a single vector after the loop is done:
@ -262,7 +264,7 @@ str(out)
str(unlist(out))
```
Here I've used `unlist()` to flatten a list of vectors into a single vector. A stricter option is to use `purrr::flatten_dbl()` - it will throw an error if the input isn't a list of doubles.
Here I've used `unlist()` to flatten a list of vectors into a single vector. A stricter option is to use `purrr::flatten_dbl()` --- it will throw an error if the input isn't a list of doubles.
This pattern occurs in other places too:
@ -280,9 +282,7 @@ Watch out for this pattern. Whenever you see it, switch to a more complex result
### Unknown sequence length
Sometimes you don't even know how long the input sequence should run for. This is common when doing simulations. For example, you might want to loop until you get three heads in a row. You can't do that sort of iteration with the for loop. Instead, you can use a while loop.
A while loop is simpler than for loop because it only has two components, a condition and a body:
Sometimes you don't even know how long the input sequence should run for. This is common when doing simulations. For example, you might want to loop until you get three heads in a row. You can't do that sort of iteration with the for loop. Instead, you can use a while loop. A while loop is simpler than for loop because it only has two components, a condition and a body:
```{r, eval = FALSE}
while (condition) {
@ -290,7 +290,7 @@ while (condition) {
}
```
A while loop is more general than a for loop, because you can rewrite any for loop as a while loop, but you can't rewrite every while loop as a for loop:
A while loop is also more general than a for loop, because you can rewrite any for loop as a while loop, but you can't rewrite every while loop as a for loop:
```{r, eval = FALSE}
for (i in seq_along(x)) {
@ -299,7 +299,7 @@ for (i in seq_along(x)) {
# Equivalent to
i <- 1
while (i < length(x)) {
while (i <= length(x)) {
# body
i <- i + 1
}
@ -324,7 +324,7 @@ while (nheads < 3) {
flips
```
I mention while loops only briefly, because I hardly ever use them. They're most often used for simulation, which is outside the scope of this book. However, it is good to know they exist, if you encounter a problem where the number of iterations is not known in advance.
I mention while loops only briefly, because I hardly ever use them. They're most often used for simulation, which is outside the scope of this book. However, it is good to know they exist so that you're prepared for problems where the number of iterations is not known in advance.
### Exercises
@ -334,6 +334,10 @@ I mention while loops only briefly, because I hardly ever use them. They're most
want to read each one with `read_csv()`. Write the for loop that will
load them into a single data frame.
1. What happens if you use `for (nm in names(x))` and `x` has no names?
What if only some of the elements are named? What if the names are
not unique?
1. Write a function that prints the mean of each numeric column in a data
frame, along with its name. For example, `show_mean(iris)` would print:
@ -380,7 +384,7 @@ df <- tibble::tibble(
Imagine you want to compute the mean of every column. You could do that with a for loop:
```{r}
output <- numeric(length(df))
output <- vector("double", length(df))
for (i in seq_along(df)) {
output[[i]] <- mean(df[[i]])
}
@ -391,7 +395,7 @@ You realise that you're going to want to compute the means of every column prett
```{r}
col_mean <- function(df) {
output <- numeric(length(df))
output <- vector("double", length(df))
for (i in seq_along(df)) {
output[i] <- mean(df[[i]])
}
@ -403,14 +407,14 @@ But then you think it'd also be helpful to be able to compute the median, and th
```{r}
col_median <- function(df) {
output <- numeric(length(df))
output <- vector("double", length(df))
for (i in seq_along(df)) {
output[i] <- median(df[[i]])
}
output
}
col_sd <- function(df) {
output <- numeric(length(df))
output <- vector("double", length(df))
for (i in seq_along(df)) {
output[i] <- sd(df[[i]])
}
@ -436,11 +440,11 @@ f <- function(x, i) abs(x - mean(x)) ^ i
You've reduced the chance of bugs (because you now have 1/3 less code), and made it easy to generalise to new situations.
We can do exactly the same thing with `col_mean()`, `col_median()` and `col_sd()`. We can add an argument that supplies the function to apply to each column:
We can do exactly the same thing with `col_mean()`, `col_median()` and `col_sd()` by adding an argument that supplies the function to apply to each column:
```{r}
col_summary <- function(df, fun) {
out <- vector("numeric", length(df))
out <- vector("double", length(df))
for (i in seq_along(df)) {
out[i] <- fun(df[[i]])
}
@ -450,7 +454,7 @@ col_summary(df, median)
col_summary(df, mean)
```
The idea of passing a function to another function is extremely powerful idea, and it's one of the reasons that R is called a functional programming language. It might take you a while to wrap your head around the idea, but it's worth the investment. In the rest of the chapter, you'll learn about and use the __purrr__ package, which provides functions that eliminate the need for many common for loops. The apply family of functions in base R (`apply()`, `lapply()`, `tapply()`, etc) solve a similar problem, but purrr is more consistent and thus is easier to learn.
The idea of passing a function to another function is extremely powerful idea, and it's one of the behaviours that makes R a functional programming language. It might take you a while to wrap your head around the idea, but it's worth the investment. In the rest of the chapter, you'll learn about and use the __purrr__ package, which provides functions that eliminate the need for many common for loops. The apply family of functions in base R (`apply()`, `lapply()`, `tapply()`, etc) solve a similar problem, but purrr is more consistent and thus is easier to learn.
The goal of using purrr functions instead of for loops is to allow you break common list manipulation challenges into independent pieces:
@ -565,7 +569,7 @@ models %>%
map_dbl("r.squared")
```
You can also use a numeric vector to select elements by position:
You can also use an integer to select elements by position:
```{r}
x <- list(list(1, 2, 3), list(4, 5, 6), list(7, 8, 9))
@ -576,9 +580,9 @@ x %>% map_dbl(2)
If you're familiar with the apply family of functions in base R, you might have noticed some similarities with the purrr functions:
* `lapply()` is basically identical to `map()`. There's no advantage to using
`map()` over `lapply()` except that it's consistent with all the other
functions in purrr, and you can use the shortcuts for `.f`.
* `lapply()` is basically identical to `map()`, except that `map()` is
consistent with all the other functions in purrr, and you can use the
shortcuts for `.f`.
* Base `sapply()` is a wrapper around `lapply()` that automatically
simplifies the output. This is useful for interactive work but is
@ -610,10 +614,17 @@ If you're familiar with the apply family of functions in base R, you might have
functions is that it can also produce matrices --- the map functions only
ever produce vectors.
I focus on purrr functions here because they have more consistent names and arguments, helpful shortcuts, and in a future release will provide easy parallelism and progress bars.
I focus on purrr functions here because they have more consistent names and arguments, helpful shortcuts, and in the future will provide easy parallelism and progress bars.
### Exercises
1. Write code that uses one of the map functions to:
1. Compute the mean of every column in `mtcars`.
1. Determine the type of each column in `nycflights13::flights`.
1. Compute the number of unique values in each column of `iris`.
1. Generate 10 random normals for each of $\mu = -10$, $0$, $10$, and $100$.
1. How can you create a single vector that for each column in a data frame
indicates whether or not it's a factor?
@ -690,16 +701,6 @@ Purrr provides two other useful adverbs:
x %>% map(quietly(log)) %>% str()
```
### Exercises
1. Given the following list, extract all the error messages with the smallest
amount of code possible:
```{r}
x <- list(1, 10, "a")
y <- x %>% map(safely(log))
```
## Mapping over multiple arguments
So far we've mapped along a single input. But often you have multiple related inputs that you need iterate along in parallel. That's the job of the `map2()` and `pmap()` functions. For example, imagine you want to simulate some random normals with different means. You know how to do that with `map()`:
@ -732,7 +733,7 @@ map2(mu, sigma, rnorm, n = 5) %>% str()
knitr::include_graphics("diagrams/lists-map2.png")
```
Note that the arguments that vary for each call come before the function name, and arguments that are the same for every function call come afterwards.
Note that the arguments that vary for each call come _before_ the function; arguments that are the same for every call come _after_.
Like `map()`, `map2()` is just a wrapper around a for loop:
@ -780,10 +781,11 @@ knitr::include_graphics("diagrams/lists-pmap-named.png")
Since the arguments are all the same length, it makes sense to store them in a data frame:
```{r}
params <- tibble::tibble(
mean = mu,
sd = sigma,
n = n
params <- tibble::tribble(
~mean, ~sd, ~n,
5, 1, 1,
10, 5, 3,
-3, 10, 5
)
params %>%
pmap(rnorm)
@ -810,13 +812,13 @@ To handle this case, you can use `invoke_map()`:
invoke_map(f, param, n = 5) %>% str()
```
```{r, echo = FALSE}
```{r, echo = FALSE, out.width = NULL}
knitr::include_graphics("diagrams/lists-invoke.png")
```
The first argument is a list of functions or character vector of function names. The second argument is a list of lists giving the arguments that vary for each function. The subsequent arguments are passed on to every function.
You can use `tibble::tribble()` to make creating these matching pairs a little easier:
And again, you can use `tibble::tribble()` to make creating these matching pairs a little easier:
```{r, eval = FALSE}
sim <- tribble(
@ -856,17 +858,22 @@ pwalk(list(paths, plots), ggsave, path = tempdir())
## Other patterns of for loops
Purrr provides a number of other functions that abstract over other types of for loops. You'll use them less frequently than the map functions, but they're useful to have in your back pocket. The goal here is to briefly illustrate each function, so hopefully it will come to mind if you see a similar problem in the future. Then you can go look up the documentation for more details.
Purrr provides a number of other functions that abstract over other types of for loops. You'll use them less frequently than the map functions, but they're useful to know about. The goal here is to briefly illustrate each function, so hopefully it will come to mind if you see a similar problem in the future. Then you can go look up the documentation for more details.
### Predicate functions
A number of functions work with __predicates__ functions that return either a single `TRUE` or `FALSE`.
A number of functions work with __predicate__ functions that return either a single `TRUE` or `FALSE`.
`keep()` and `discard()` keep elements of the input where the predicate is `TRUE` or `FALSE` respectively:
```{r}
iris %>% keep(is.factor) %>% str()
iris %>% discard(is.factor) %>% str()
iris %>%
keep(is.factor) %>%
str()
iris %>%
discard(is.factor) %>%
str()
```
`some()` and `every()` determine if the predicate is true for any or for all of
@ -874,8 +881,12 @@ the elements.
```{r}
x <- list(1:5, letters, list(10))
x %>% some(is_character)
x %>% every(is_vector)
x %>%
some(is_character)
x %>%
every(is_vector)
```
`detect()` finds the first element where the predicate is true; `detect_index()` returns its position.
@ -884,20 +895,26 @@ x %>% every(is_vector)
x <- sample(10)
x
x %>% detect(~ . > 5)
x %>% detect_index(~ . > 5)
x %>%
detect(~ . > 5)
x %>%
detect_index(~ . > 5)
```
`head_while()` and `tail_while()` take elements from the start or end of a vector while a predicate is true:
```{r}
head_while(x, ~ . > 5)
tail_while(x, ~ . > 5)
x %>%
head_while(~ . > 5)
x %>%
tail_while(~ . > 5)
```
### Reduce and accumulate
Sometimes you have a complex list that you want to reduce to a simple list by repeatedly applying a function that reduces two inputs to a single input. This useful if you want to apply a two-table dplyr verb to multiple tables. For example, you might have a list of data frames, and you want to reduce to a single data frame by joining the elements together
Sometimes you have a complex list that you want to reduce to a simple list by repeatedly applying a function that reduces a pair to a singleton. This useful if you want to apply a two-table dplyr verb to multiple tables. For example, you might have a list of data frames, and you want to reduce to a single data frame by joining the elements together:
```{r}
dfs <- list(
@ -909,6 +926,18 @@ dfs <- list(
dfs %>% reduce(dplyr::full_join)
```
Or maybe you have a list of vectors, and want to find the intersection:
```{r}
vs <- list(
c(1, 3, 5, 6, 10),
c(1, 2, 3, 7, 8, 10),
c(1, 2, 3, 4, 8, 9, 10)
)
vs %>% reduce(intersect)
```
The reduce function takes a "binary" function (i.e. a function with two primary inputs), and applies it repeatedly to a list until there is only a single element left.
Accumulate is similar but it keeps all the interim results. You could use it to implement a cumulative sum: