Tweaks to iteration

This commit is contained in:
hadley 2016-08-15 08:18:51 -05:00
parent d82f0fd314
commit 9d7851318d
1 changed files with 66 additions and 63 deletions

View File

@ -5,7 +5,7 @@
In [functions], we talked about how important it is to reduce duplication in your code. Reducing code duplication has three main benefits:
1. It's easier to see the intent of your code, because your eyes are
drawn to what is different, not what is the same.
drawn to what's changing, not what's staying the same.
1. It's easier to respond to changes in requirements. As your needs
change, you only need to make changes in one place, rather than
@ -15,26 +15,9 @@ In [functions], we talked about how important it is to reduce duplication in you
1. You're likely to have fewer bugs because each line of code is
used in more places.
One part of reducing duplication is writing functions. Functions allow you to identify repeated patterns of code and extract them out into independent pieces that you can reuse and easily update as code changes. Iteration helps you when you need to do the same thing to multiple inputs: repeating the same operation on different columns, or on different datasets. (Generally, you won't need to use explicit iteration to deal with different subsets of your data: in most cases the implicit iteration in dplyr will take care of that problem for you.)
One part of reducing duplication is writing functions. Functions allow you to identify repeated patterns of code and extract them out into independent pieces that you can reuse and easily update as code changes. __Iteration__ helps you when you need to do the same thing to multiple inputs: repeating the same operation on different columns, or on different datasets. (Generally, you won't need to use explicit iteration to deal with different subsets of your data: in most cases the implicit iteration in dplyr will take care of that problem for you.)
In this chapter you'll learn about two important iteration paradigms: imperative programming and functional programming, and the machinery each provides. On the imperative side you have things like for loops and while loops, which are a great place to start because they make iteration very explicit, so it's obvious what's happening. However, for loops are quite verbose, and include quite a bit of bookkeeping code, that is duplicated for every for loop. Functional programming (FP) offers tools to extract out this duplicated code, so each common for loop pattern gets its own function. Once you master the vocabulary of FP, you can solve many common iteration problems with less code, more ease, and fewer errors.
Some people will tell you to avoid for loops because they are slow. They're wrong! (Well at least they're rather out of date, as for loops haven't been slow for many years). The chief benefits of using FP functions like `lapply()` or `purrr::map()` is that they are more expressive and make code both easier to write and easier to read.
In later chapters you'll learn how to apply these iterating ideas when modelling. You can often use multiple simple models to help understand a complex dataset, or you might have multiple models because you're bootstrapping or cross-validating. The techniques you'll learn in this chapter will be invaluable.
The goal of using purrr functions instead of for loops is to allow you break common list manipulation challenges into independent pieces:
1. How can you solve the problem for a single element of the list? Once
you've solved that problem, purrr takes care of generalising your
solution to every element in the list.
1. If you're solving a complex problem, how can you break it down into
bite-sized pieces that allow you to advance one small step towards a
solution? With purrr, you get lots of small pieces that you can
compose together with the pipe.
This structure makes it easier to solve new problems. It also makes it easier to understand your solutions to old problems when you re-read your old code.
In this chapter you'll learn about two important iteration paradigms: imperative programming and functional programming. On the imperative side you have tools like for loops and while loops, which are a great place to start because they make iteration very explicit, so it's obvious what's happening. However, for loops are quite verbose, and require quite a bit of bookkeeping code that is duplicated for every for loop. Functional programming (FP) offers tools to extract out this duplicated code, so each common for loop pattern gets its own function. Once you master the vocabulary of FP, you can solve many common iteration problems with less code, more ease, and fewer errors.
### Prerequisites
@ -117,12 +100,12 @@ That's all there is to the for loop! Now is a good time to practice creating som
1. Write for loops to:
1. Compute the mean of every column in the `mtcars`.
1. Compute the mean of every column in `mtcars`.
1. Determine the type of each column in `nycflights13::flights`.
1. Compute the number of unique values in each column of `iris`.
1. Generate 10 random normals for each of $mu = -10$, $0$, $10$, and $100$.
Think about output, sequence, and body, __before__ you start writing
Think about the output, sequence, and body __before__ you start writing
the loop.
1. Eliminate the for loop in each of the following examples by taking
@ -149,7 +132,7 @@ That's all there is to the for loop! Now is a good time to practice creating som
}
```
1. Combine your function writing and for loop skills.
1. Combine your function writing and for loop skills:
1. Convert the song "99 bottles of beer on the wall" to a function.
Generalise to any number of any vessel containing any liquid on
@ -173,7 +156,7 @@ That's all there is to the for loop! Now is a good time to practice creating som
## For loop variations
Once you have the basic for loop under your belt, there are some variations on a theme that you should be aware of. These variations are important regardless of how you do iteration, so don't forget about them once you've master the FP techniques you'll learn about in the next section.
Once you have the basic for loop under your belt, there are some variations that you should be aware of. These variations are important regardless of how you do iteration, so don't forget about them once you've master the FP techniques you'll learn about in the next section.
There are four variations on the basic theme of the for loop:
@ -206,7 +189,7 @@ df$d <- rescale01(df$d)
To solve this with a for loop we use the same three components:
1. Output: we already have the output - it's the same as the input!
1. Output: we already have the output --- it's the same as the input!
1. Sequence: we can think about a data frame as a list of columns, so
we can iterate over each column with `seq_along(df)`.
@ -221,7 +204,7 @@ for (i in seq_along(df)) {
}
```
Typically you'll be modifying a list or data frame with this sort of loop, so remember to use `[[`, not `[`. You might have spotted that I used `[[` in all my for loops: I think it's safer to use the subsetting operator that will work in all circumstances (and it makes it clear than I'm working with a single value each time).
Typically you'll be modifying a list or data frame with this sort of loop, so remember to use `[[`, not `[`. You might have spotted that I used `[[` in all my for loops: I think it's safer to use the subsetting operator that will work in all circumstances and it makes it clear than I'm working with a single value each time.
### Looping patterns
@ -293,7 +276,7 @@ This pattern occurs in other places too:
`dplyr::bind_rows(output)` to combine the output into a single
data frame.
Watch out for this pattern. Whenever you see it, switch to a more complex results object, and then combine in one step at the end.
Watch out for this pattern. Whenever you see it, switch to a more complex result object, and then combine in one step at the end.
### Unknown sequence length
@ -341,7 +324,7 @@ while (nheads < 3) {
flips
```
I mention while loops briefly, because I hardly ever use them. They're most often used for simulation, which is outside the scope of this book. However, it is good to know they exist, if you encounter a problem where the number of iterations is not known in advance.
I mention while loops only briefly, because I hardly ever use them. They're most often used for simulation, which is outside the scope of this book. However, it is good to know they exist, if you encounter a problem where the number of iterations is not known in advance.
### Exercises
@ -363,7 +346,7 @@ I mention while loops briefly, because I hardly ever use them. They're most ofte
```
(Extra challenge: what function did I use to make sure that the numbers
lined up nicely, even though the variables had different names?)
lined up nicely, even though the variable names had different lengths?)
1. What does this code do? How does it work?
@ -469,29 +452,44 @@ col_summary(df, mean)
The idea of passing a function to another function is extremely powerful idea, and it's one of the reasons that R is called a functional programming language. It might take you a while to wrap your head around the idea, but it's worth the investment. In the rest of the chapter, you'll learn about and use the __purrr__ package, which provides functions that eliminate the need for many common for loops. The apply family of functions in base R (`apply()`, `lapply()`, `tapply()`, etc) solve a similar problem, but purrr is more consistent and thus is easier to learn.
The goal of using purrr functions instead of for loops is to allow you break common list manipulation challenges into independent pieces:
1. How can you solve the problem for a single element of the list? Once
you've solved that problem, purrr takes care of generalising your
solution to every element in the list.
1. If you're solving a complex problem, how can you break it down into
bite-sized pieces that allow you to advance one small step towards a
solution? With purrr, you get lots of small pieces that you can
compose together with the pipe.
This structure makes it easier to solve new problems. It also makes it easier to understand your solutions to old problems when you re-read your old code.
### Exercises
1. Read the documentation for `apply()`. In the 2d case, what two for loops
does it generalise?
1. Adapt `col_summary()` so that it only applies to numeric columns
You might want to start with an `is_numeric()` function that returns
a logical vector that has a TRUE corresponding to each numeric column.
1. Adapt `col_summary()` so that it only applies to numeric columns
You might want to start with an `is_numeric()` function that returns
a logical vector that has a TRUE corresponding to each numeric column.
## The map functions
The pattern of looping over a vector, doing something to each element and saving the results is so common that the purrr package provides a family of functions to do it for you. There is one function for each type of vector:
The pattern of looping over a vector, doing something to each element and saving the results is so common that the purrr package provides a family of functions to do it for you. There is one function for each type of output:
* `map()` returns a list.
* `map_lgl()` returns a logical vector.
* `map_int()` returns a integer vector.
* `map_dbl()` returns a double vector.
* `map_chr()` returns a character vector.
* `map()` makes a list.
* `map_lgl()` makes a logical vector.
* `map_int()` makes a integer vector.
* `map_dbl()` makes a double vector.
* `map_chr()` makes a character vector.
Each function takes a vector as input, applies a function to each piece, and then returns a new vector that's the same length (and has the same names) as the input. The type of the vector is determined by the suffix to the map function.
Once you master these functions, you'll find it takes much less time to solve iteration problems. But you should never feel bad about using a for loop instead of a map function. The map functions are a step up a tower of abstraction, and it can take a long time to get your head around how they work. The important thing is that you solve the problem that you're working on, not write the most concise and elegant code (although that's definitely something you want to strive towards!).
Some people will tell you to avoid for loops because they are slow. They're wrong! (Well at least they're rather out of date, as for loops haven't been slow for many years). The chief benefits of using functions like `map()` is not speed, but clarity: they make your code easier to write and to read.
We can use these functions to perform the same computations as the last for loop. Those summary functions returned doubles, so we need to use `map_dbl()`:
```{r}
@ -609,7 +607,7 @@ If you're familiar with the apply family of functions in base R, you might have
`vapply()` is that it's a lot of typing:
`vapply(df, is.numeric, logical(1))` is equivalent to
`map_lgl(df, is.numeric)`. One of advantage of `vapply()` over purrr's map
functions is that it can also produce matrices - the map functions only
functions is that it can also produce matrices --- the map functions only
ever produce vectors.
I focus on purrr functions here because they have more consistent names and arguments, helpful shortcuts, and in a future release will provide easy parallelism and progress bars.
@ -702,28 +700,27 @@ Purrr provides two other useful adverbs:
y <- x %>% map(safely(log))
```
1.
## Mapping over multiple arguments
So far we've mapped along a single list. But often you have multiple related lists that you need iterate along in parallel. That's the job of the `map2()` and `pmap()` functions.
For example, imagine you want to simulate some random normals with different means. You know how to do that with `map()`:
So far we've mapped along a single input. But often you have multiple related inputs that you need iterate along in parallel. That's the job of the `map2()` and `pmap()` functions. For example, imagine you want to simulate some random normals with different means. You know how to do that with `map()`:
```{r}
mu <- list(5, 10, -3)
mu %>% map(rnorm, n = 5) %>% str()
mu %>%
map(rnorm, n = 5) %>%
str()
```
What if you also want to vary the standard deviation? One way to do that would be to iterate over the indices and index into vectors of means and sds:
```{r}
sigma <- list(1, 5, 10)
seq_along(mu) %>% map(~rnorm(5, mu[[.]], sigma[[.]])) %>% str()
seq_along(mu) %>%
map(~rnorm(5, mu[[.]], sigma[[.]])) %>%
str()
```
However, that somewhat obfuscates the intent of the code. Instead we could use `map2()` which works with iterates over two vectors in parallel:
But that obfuscates the intent of the code. Instead we could use `map2()` which iterates over two vectors in parallel:
```{r}
map2(mu, sigma, rnorm, n = 5) %>% str()
@ -754,7 +751,9 @@ You could also imagine `map3()`, `map4()`, `map5()`, `map6()` etc, but that woul
```{r}
n <- list(1, 3, 5)
args1 <- list(n, mu, sigma)
args1 %>% pmap(rnorm) %>% str()
args1 %>%
pmap(rnorm) %>%
str()
```
That looks like:
@ -767,7 +766,9 @@ If you don't name the elements of list, `pmap()` will use positional matching wh
```{r, eval = FALSE}
args2 <- list(mean = mu, sd = sigma, n = n)
args2 %>% pmap(rnorm) %>% str()
args2 %>%
pmap(rnorm) %>%
str()
```
That generates longer, but safer, calls:
@ -779,13 +780,16 @@ knitr::include_graphics("diagrams/lists-pmap-named.png")
Since the arguments are all the same length, it makes sense to store them in a data frame:
```{r}
params <- tibble::tibble(mean = mu, sd = sigma, n = n)
params$result <- params %>% pmap(rnorm)
params
params <- tibble::tibble(
mean = mu,
sd = sigma,
n = n
)
params %>%
pmap(rnorm)
```
As soon as your code gets complicated, I think a data frame is a good approach because it ensures that each column has a name and is the same length as all the other columns.
We'll come back to this idea in [Handling hierarchy], and again when we explore the intersection of dplyr, purrr, and model fitting.
As soon as your code gets complicated, I think a data frame is a good approach because it ensures that each column has a name and is the same length as all the other columns. We'll come back to this idea in [Handling hierarchy], and again when we explore the intersection of dplyr, purrr, and model fitting.
### Invoking different functions
@ -812,23 +816,22 @@ knitr::include_graphics("diagrams/lists-invoke.png")
The first argument is a list of functions or character vector of function names. The second argument is a list of lists giving the arguments that vary for each function. The subsequent arguments are passed on to every function.
You can use `tibble::frame_data()` to make creating these matching pairs a little easier:
You can use `tibble::tribble()` to make creating these matching pairs a little easier:
```{r, include = FALSE}
tribble <- tibble::frame_data
```
```{r, eval = FALSE}
# Needs dev version of dplyr
sim <- tibble::frame_data(
sim <- tribble(
~f, ~params,
"runif", list(min = -1, max = -1),
"rnorm", list(sd = 5),
"rpois", list(lambda = 10)
)
sim$f %>% invoke_map(sim$params, n = 10) %>% str()
sim %>%
mutate(sim = invoke_map(f, params, n = 10))
```
### Exercises
1.
## Walk {#walk}
Walk is an alternative to map that you use when you want to call a function for its side effects, rather than for its return value. You typically do this because you want to render output to the screen or save files to disk - the important thing is the action, not the return value. Here's a very simple example: