Complete draft of lists chapter

This commit is contained in:
hadley 2015-12-08 10:11:53 -06:00
parent 27b148d9ea
commit afeae8396b
7 changed files with 119 additions and 92 deletions

BIN
diagrams/lists-invoke.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 50 KiB

BIN
diagrams/lists-map2.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 30 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 44 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 31 KiB

Binary file not shown.

201
lists.Rmd
View File

@ -12,9 +12,9 @@ source("images/embed_jpg.R")
# Lists
In this chapter, you'll learn how to handle lists, the data structure R uses for complex, hierarchical objects. You've already familiar with vectors, R's data structure for 1d objects. Lists extend these ideas to model objects that are like trees. Lists allow you to do this because unlike vectors, a list can contain other lists.
In this chapter, you'll learn how to handle lists, the data structure R uses for complex, hierarchical objects. You've already familiar with vectors, R's data structure for 1d objects. Lists extend these ideas to model objects that are like trees. You can create a hierarchical structure with a list because unlike vectors, a list can contain other lists.
If you've worked with list-like objects before, you're probably familiar with the for loop. I'll talk a little bit about for loops here, but the focus will be functions from the __purrr__ package. purrr makes it easier to work with lists by eliminating common for loop boilerplate so you can focus on the specific details. This is the same idea as the apply family of functions in base R (`apply()`, `lapply()`, `tapply()`, etc), but purrr is more consistent and easier to learn.
If you've worked with list-like objects before, you're probably familiar with the for loop. I'll talk a little bit about for loops here, but the focus will be functions from the __purrr__ package. purrr makes it easier to work with lists by eliminating common for loop boilerplate so you can focus on the specifics. The apply family of functions in base R (`apply()`, `lapply()`, `tapply()`, etc) solve a similar problem, but purrr is more consistent and easier to learn.
The goal of using purrr functions instead of for loops is to allow you break common list manipulation challenges into independent pieces:
@ -44,7 +44,7 @@ In later chapters you'll learn how to apply these ideas when modelling. You can
## List basics
To create a list, you use the `list()` function:
You create a list with `list()`:
```{r}
x <- list(1, 2, 3)
@ -70,9 +70,9 @@ str(z)
`str()` is very helpful when looking at lists because it focusses on the structure, not the contents.
## Visualising lists
### Visualising lists
It's helpful to have a visual representation of lists, so I'll use a nested set representation where each level of the hierarchy is nested in the previous. I'll always use rounded rectangles to represent lists, and regular rectangles to represent vectors. Note that single numbers (e.g. 1, 2), also called scalars, are not top-level objects in R and must always live inside a vector.
To explain more complicated list manipulation functions, it's helpful to have a visual representation of lists. For example, take these three lists:
```{r}
x1 <- list(c(1, 2), c(3, 4))
@ -80,15 +80,22 @@ x2 <- list(list(1, 2), list(3, 4))
x3 <- list(1, list(2, list(3)))
```
To make it easier to see the levels in the list, I colour each level a little darker than the previous. The orientiation of the elements (i.e. rows or columns) isn't important to the structure of the list (just the order of the elements), so I pick a row or column orientation to either save space or illustrate and important property of the operation.
I draw them as follows:
`r bookdown::embed_png("diagrams/lists-structure.png", dpi = 220)`
(Unfortunately there's no way to draw these diagrams automatically - I did them by hand, carefully picking the arrangement that I think best illustrates the point I'm trying to make)
* Lists are rounded rectangles that contain their children.
* I draw each child a little darker than its parent to make it easier to see
the hierarchy.
* The orientation of the children (i.e. rows or columns) isn't important,
so I pick a row or column orientation to either save space or illustrate
an important property in the example.
### Subsetting
There are three ways to subset a list, which I'll illustrate with this list:
There are three ways to subset a list, which I'll illustrate with `a`:
```{r}
a <- list(a = 1:3, b = "a string", c = pi, d = list(-1, -5))
@ -101,6 +108,9 @@ a <- list(a = 1:3, b = "a string", c = pi, d = list(-1, -5))
str(a[4])
```
Like subsetting vectors, you can use an integer vector to select by
position, or a character vector to select by name.
* `[[` extracts a single component from a list. It removes a level of
hierarchy from the list.
@ -110,7 +120,7 @@ a <- list(a = 1:3, b = "a string", c = pi, d = list(-1, -5))
```
* `$` is a shorthand for extracting named elements of a list. It works
very similarly to `[[` except that you don't need to use quotes.
similarly to `[[` except that you don't need to use quotes.
```{r}
a$a
@ -123,7 +133,7 @@ Or visually:
### Lists of condiments
It's easy to get confused between `[` and `[[`, but understanding the difference is critical when working with lists. A few months ago I stayed at a hotel with a pretty interesting pepper shaker that I hope will help remember these differences:
It's easy to get confused between `[` and `[[`, but it's important to understand the difference. A few months ago I stayed at a hotel with a pretty interesting pepper shaker that I hope will help remember these differences:
```{r, echo = FALSE}
embed_jpg("images/pepper.jpg", 300)
@ -158,9 +168,9 @@ embed_jpg("images/pepper-3.jpg", 300)
1. What happens if you subset a data frame as if you're subsetting a list?
What are the key differences between a list and a data frame?
## A common pattern of for loops
## For loops
Lets start by creating a stereotypical list: an eight element list where each element contains a random vector of random length. (You'll learn about `rerun()` later.)
To illustrate for loops, we'll start by creating a stereotypical list: an eight element list where each element contains a random vector of random length. (You'll learn about `rerun()` later.)
```{r}
x <- rerun(8, runif(sample(5, 1)))
@ -215,9 +225,9 @@ compute_length <- function(x) {
compute_length(x)
```
(And in fact base R has this already: it's called `lengths()`.)
(In fact base R has this function already: it's called `lengths()`.)
Now imagine we want to compute the `mean()` of each element. How would our function change? What if we wanted to compute the `median()`? You could create variations of `compute_lengths()` like this:
Now imagine we want to compute the `mean()` of each element. How would our function change? What if we wanted to compute the `median()`? You could create variations of `compute_lengths()` that looked like this:
```{r}
compute_mean <- function(x) {
@ -239,7 +249,7 @@ compute_median <- function(x) {
compute_median(x)
```
But this is only two functions we might want to apply to every element of a list, and there's already lot of duplication. Most of the code is for-loop boilerplate and it's hard to see the one function (`length()`, `mean()`, or `median()`) that's actually important.
But this is only two of the many functions we might want to apply to every element of a list, and there's already lot of duplication. Most of the code is for-loop boilerplate and it's hard to see the one function (`length()`, `mean()`, or `median()`) that's actually important.
What would you do if you saw a set of functions like this:
@ -249,7 +259,7 @@ f2 <- function(x) abs(x - mean(x)) ^ 2
f3 <- function(x) abs(x - mean(x)) ^ 3
```
You'd notice that there's a lot of duplication, and extract it in to an additional argument:
Hopefully, you'd notice that there's a lot of duplication, and extract it out into an additional argument:
```{r}
f <- function(x, i) abs(x - mean(x)) ^ i
@ -268,7 +278,7 @@ compute_summary <- function(x, f) {
compute_summary(x, mean)
```
Instead of hardcoding the summary function, we allow it to vary, by adding an addition argument that is a function. It can take a while to wrap your head around this, but it's very powerful technique. This is one of the reasons that R is known as a "functional" programming language.
Instead of hardcoding the summary function, we allow it to vary, by adding an additional argument that is a function. It can take a while to wrap your head around this, but it's very powerful technique. This is one of the reasons that R is known as a "functional" programming language.
### Exercises
@ -286,7 +296,7 @@ Instead of hardcoding the summary function, we allow it to vary, by adding an ad
results
```
How does this impact performance?
How does this affect performance?
## The map functions
@ -300,11 +310,11 @@ This pattern of looping over a list and doing something to each element is so co
* `map_df()` returns a data frame.
* `walk()` returns nothing. Walk is a little different to the others because
it's called exclusively its side effects, so it's described in more detail
later, [walk](#walk).
later in [walk](#walk).
If none of the specialised versions return exactly what you want, you can always use a `map()` because a list can contain any other object.
Each functions takes a list as input, applies a function to each piece, and then returns a new vector that's the same length as the input. The type of the vector is determine by the specific map function. Usually you want to use the most specific avaiable; using `map()` only as a fallback when there is no specialised equivalent available.
Each of these functions take a list as input, applies a function to each piece and then return a new vector that's the same length as the input. The following code uses purrr to do the same computations as the previous for loops:
We can use these functions to perform the same computations as the previous for loops:
```{r}
map_int(x, length)
@ -338,9 +348,7 @@ There are a few differences between `map_*()` and `compute_summary()`:
### Shortcuts
There are a few shortcuts that you can use with `.f` in order to save a little typing. Imagine you want to fit a linear model to each individual in a dataset. The following toy example splits the up the `mtcars` dataset in to three pieces and fits the same linear model to each piece:
<!-- Haven't covered modelling yet so might need a different motivating example -->
There are a few shortcuts that you can use with `.f` in order to save a little typing. Imagine you want to fit a linear model to each individual in a dataset. The following toy example splits the up the `mtcars` dataset in to three pieces (only for each value of cylinder) and fits the same linear model to each piece:
```{r}
models <- mtcars %>%
@ -348,8 +356,6 @@ models <- mtcars %>%
map(function(df) lm(mpg ~ wt, data = df))
```
(Fitting many models is a powerful technique which we'll come back to in the case study at the end of the chapter.)
The syntax for creating an anonymous function in R is quite verbose so purrr provides a convenient shortcut: a one-sided formula.
```{r}
@ -360,7 +366,7 @@ models <- mtcars %>%
Here I've used `.` as a pronoun: it refers to the "current" list element (in the same way that `i` referred to the number in the for loop). You can also use `.x` and `.y` to refer to up to two arguments. If you want to create an function with more than two arguments, do it the regular way!
When you're looking at many models, you might want to extract a summary static like the $R^2$. To do that we need to first run `summary()` and then extract the component called `r.squared`. We could do that using the shorthand for anonymous funtions:
When you're looking at many models, you might want to extract a summary statistic like the $R^2$. To do that we need to first run `summary()` and then extract the component called `r.squared`. We could do that using the shorthand for anonymous funtions:
```{r}
models %>%
@ -368,7 +374,7 @@ models %>%
map_dbl(~.$r.squared)
```
But extracting named components is a really common operation, so purrr provides an even shorter shortcut: you can use a string.
But extracting named components is a common operation, so purrr provides an even shorter shortcut: you can use a string.
```{r}
models %>%
@ -383,10 +389,6 @@ x <- list(list(1, 2, 3), list(4, 5, 6), list(7, 8, 9))
x %>% map_dbl(2)
```
### Map applications
???
### Base R
If you're familiar with the apply family of functions in base R, you might have noticed some similarities with the purrr functions:
@ -422,8 +424,8 @@ If you're familiar with the apply family of functions in base R, you might have
it's a lot of typing: `vapply(df, is.numeric, logical(1))` is equivalent to
`map_lgl(df, is.numeric)`.
One of advantage `vapply()` over the map functions is that it can also
produce matrices - the map functions always produce vectors.
One of advantage of `vapply()` over the map functions is that it can also
produce matrices - the map functions only ever produce vectors.
* `map_df(x, f)` is effectively the same as `do.call("rbind", lapply(x, f))`
but under the hood is much more efficient.
@ -443,18 +445,25 @@ If you're familiar with the apply family of functions in base R, you might have
## Handling hierarchy {#hierarchy}
As you start to use these functions more frequently, you'll find that you start to create quite complex trees. The techniques in this section will help you work with those structures.
The map functions apply a function to every element in a list. They are the most commonly used part of purrr, but not the only part. Since lists are often used to represent complex hierarchies, purrr also provides tools to work with hierarchy:
### Deep nesting
* You can extract deeply nested elements in a single call by supplying
a character vector to the map functions.
Some times you get data structures that are very deeply nested. A common source of hierarchical data is JSON from a web API. I've previously downloaded a list of GitHub issues related to this book and saved it as `issues.json`. Now I'm going to load it with jsonlite. By default `fromJSON()` tries to be helpful and simplifies the structure a little. Here I'm going to show you how to do it by hand, so I set `simplifyVector = FALSE`:
* You can remove a level of the hierarchy with the flatten functions.
* You can flip levels of the hierarchy with the transpose function.
### Extracting deeply nested elements
Some times you get data structures that are very deeply nested. A common source of sych data is JSON from a web API. I've previously downloaded a list of GitHub issues related to this book and saved it as `issues.json`. Now I'm going to load it into a list with jsonlite. By default `fromJSON()` tries to be helpful and simplifies the structure a little for you. Here I'm going to show you how to do it with purrr, so I set `simplifyVector = FALSE`:
```{r}
# From https://api.github.com/repos/hadley/r4ds/issues
issues <- jsonlite::fromJSON("issues.json", simplifyVector = FALSE)
```
There are eight issues, and each issue has a nested structure.
There are eight issues, and each issue is a nested list:
```{r}
length(issues)
@ -477,24 +486,24 @@ users %>% map_chr("login")
users %>% map_int("id")
```
Or by using a character vector, you can do it in one:
But by supplying a character _vector_ to `map_*`, you can do it in one:
```{r}
issues %>% map_chr(c("user", "login"))
issues %>% map_int(c("user", "id"))
```
This is particularly useful when you want to pull one element out of a deeply nested data structure.
### Removing a level of hierarchy
As well as indexing deeply into hierarchy, it's sometimes useful to flatten it. That's the job of the flatten family of functions: `flatten()`, `flatten_lgl()`, `flatten_int()`, `flatten_dbl()`, and `flatten_chr()`. In the code below we take a list of lists of double vectors, then flatten it to a list of double vectors, then to a double vector.
```{r}
x <- list(list(a = 1, b = 2), list(c = 3, d = 4))
x %>% str()
x %>% flatten() %>% str()
x %>% flatten() %>% flatten_dbl()
str(x)
y <- flatten(x)
str(y)
flatten_dbl(y)
```
Graphically, that sequence of operations looks like:
@ -503,7 +512,7 @@ Graphically, that sequence of operations looks like:
Whenever I get confused about a sequence of flattening operations, I'll often draw a diagram like this to help me understand what's going on.
Base R has `unlist()`, but I recommend avoiding it for the same reason I recommend avoiding `sapply()`: it always succeeds. Even if you data structure accidentally changes, `unlist()` will continue to work silently giving the wrong answer.
Base R has `unlist()`, but I recommend avoiding it for the same reason I recommend avoiding `sapply()`: it always succeeds. Even if your data structure accidentally changes, `unlist()` will continue to work silently the wrong type of output. This tends to create problems that are frustrating to debug.
### Switching levels in the hierarchy
@ -539,12 +548,12 @@ df %>% transpose() %>% str()
When you do many operations on a list, sometimes one will fail. When this happens, you'll get an error message, and no output. This is annoying: why does one failure prevent you from accessing all the other successes? How do you ensure that one bad apple doesn't ruin the whole barrel?
In this section you'll learn how to deal this situation with a new function: `safely()`. `safely()` is an adverb: it takes a function modifies it. In this case, the modified function never throws an error and always succeeds. Instead, it returns a list with two elements:
In this section you'll learn how to deal this situation with a new function: `safely()`. `safely()` is an adverb: it takes a function (a verb) and returns a modified version. In this case, the modified function will never throw an error. Instead, it always returns a list with two elements:
1. `result`: the original result. If there was an error, this will be `NULL`.
1. `result` is original result. If there was an error, this will be `NULL`.
1. `error`: the text of the error if it occured. If the operation was
successful this will be `NULL`.
1. `error` is an error object. If the operation was successful this will be
`NULL`.
(You might be familiar with the `try()` function in base R. It's similar, but because it sometimes returns the original result and it sometimes returns an error object it's more difficult to work with.)
@ -556,13 +565,13 @@ str(safe_log(10))
str(safe_log("a"))
```
When the function succeeds the `result` element contains the result and the error element is empty. When the function fails, the result element is empty and the error element contains the error.
When the function succeeds the `result` element contains the result and the error element is `NULL`. When the function fails, the result element is `NULL` and the error element contains an error object.
This makes it natural to work with map:
`safely()` is designed to work with map:
```{r}
x <- list(1, 10, "a")
y <- x %>% map(safe_log)
y <- x %>% map(safely(log))
str(y)
```
@ -581,20 +590,18 @@ x[!is_ok]
y$result[is_ok] %>% flatten_dbl()
```
(Note that you should always check that the error is null, not that the result is not-null. Sometimes the correct response is `NULL`.)
Purrr provides two other useful adverbs:
Other related functions:
* `possibly()`: if you don't care about the error message, and instead
just want a default value on failure.
* Like `safely()`, `possibly()` always succeeds. It's simpler than `safely()`,
because you give it a default value to return when there is an error.
```{r}
x <- list(1, 10, "a")
x %>% map_dbl(possibly(log, NA_real_))
```
* `quietly()`: does a similar job but for other outputs like printed
ouput, messages, and warnings.
* `quietly()` performs a similar role to `safely()`, but instead of capturing
errors, it captures printed output, messages, and warnings:
```{r}
x <- list(1, -1)
@ -610,7 +617,7 @@ Other related functions:
files <- dir("data", pattern = "\\.csv$")
files %>%
set_names(., basename(.)) %>%
map_df(readr::read_csv, .id = "filename") %>%
map_df(safely(readr::read_csv), .id = "filename") %>%
```
## Parallel maps
@ -618,18 +625,22 @@ Other related functions:
So far we've mapped along a single list. But often you have mutliple related lists that you need iterate along in parallel. That's the job of the `map2()` and `pmap()` functions. For example, imagine you want to simulate some random normals with different means. You know how to do that with `map()`:
```{r}
mu <- c(5, 10, -3)
mu <- list(5, 10, -3)
mu %>% map(rnorm, n = 10)
```
What if you also want to vary the standard deviation? You need to iterate along a vector of means and a vector of standard deviations in parallel. That's a job for `map2()` which works with two parallel sets of inputs:
```{r}
sd <- c(1, 5, 10)
map2(mu, sd, rnorm, n = 10)
sigma <- list(1, 5, 10)
map2(mu, sigma, rnorm, n = 10)
```
Note that arguments that vary for each call come before the function name, and arguments that are the same for every function call come afterwards.
`map2()` generates this series of function calls:
`r bookdown::embed_png("diagrams/lists-map2.png", dpi = 220)`
The arguments that vary for each call come before the function name, and arguments that are the same for every function call come afterwards.
Like `map()`, `map2()` is just a wrapper around a for loop:
@ -643,28 +654,38 @@ map2 <- function(x, y, f, ...) {
}
```
You could imagine `map3()`, `map4()`, `map5()`, `map6()` etc, but that would get tedious quickly. Instead, purrr provides `pmap()` which takes a list of arguments. You might use that if you wanted to vary the mean, standard deviation, and number of samples:
You could also imagine `map3()`, `map4()`, `map5()`, `map6()` etc, but that would get tedious quickly. Instead, purrr provides `pmap()` which takes a list of arguments. You might use that if you wanted to vary the mean, standard deviation, and number of samples:
```{r}
n <- c(1, 3, 5)
pmap(list(n, mu, sd), rnorm)
n <- list(1, 3, 5)
args1 <- list(n, mu, sigma)
args1 %>% pmap(rnorm) %>% str()
```
That looks like:
`r bookdown::embed_png("diagrams/lists-pmap-unnamed.png", dpi = 220)`
However, instead of relying on position matching, it's better to name the arguments. This is more verbose, but it makes the code clearer.
```{r}
pmap(list(mean = mu, sd = sd, n = n), rnorm)
args2 <- list(mean = mu, sd = sigma, n = n)
args2 %>% pmap(rnorm) %>% str()
```
That generates longer, but safer, calls:
`r bookdown::embed_png("diagrams/lists-pmap-named.png", dpi = 220)`
Since the arguments are all the same length, it makes sense to store them in a data frame:
```{r}
params <- dplyr::data_frame(mean = mu, sd = sd, n = n)
params <- dplyr::data_frame(mean = mu, sd = sigma, n = n)
params$result <- params %>% pmap(rnorm)
params
```
As soon as your code gets complicated, I think a data frame is a good approach because it ensures that each column has a name and is the same length as all the other columns.
As soon as your code gets complicated, I think a data frame is a good approach because it ensures that each column has a name and is the same length as all the other columns. We'll come back to this idea when we explore the intersection of dplyr, purr, and model fitting.
### Invoking different functions
@ -682,12 +703,14 @@ param <- list(
To handle this case, you can use `invoke_map()`:
```{r}
invoke_map(f, param, n = 5)
invoke_map(f, param, n = 5) %>% str()
```
The first argument is a list of functions or character vector of function names, the second argument is a list of lists giving the arguments that vary for each function. The subsequent arguments are passed on to every function.
`r bookdown::embed_png("diagrams/lists-invoke.png")`
You can use `dplyr::frame_data()` to create these matching pairs a little easier:
The first argument is a list of functions or character vector of function names. The second argument is a list of lists giving the arguments that vary for each function. The subsequent arguments are passed on to every function.
You can use `dplyr::frame_data()` to make creating these matching pairs a little easier:
```{r, eval = FALSE}
# Needs dev version of dplyr
@ -702,9 +725,18 @@ sim %>% dplyr::mutate(
)
```
### Walk {#walk}
## Walk {#walk}
Walk is useful when you want to call a function for its side effects. It returns its input, so you can easily use it in a pipe. Here's an example:
Walk is an alternative to map that you use when you want to call a function for its side effects, rather than for its return value. You typically do this because you want to render output to the screen or saving files to disk - the important thing is the action, not the return value. Here's a very simple example:
```{r}
x <- list(1, "a", 3)
x %>%
walk(print)
```
`walk()` is generally not that useful compared to `walk2()` or `pwalk()`. For example, if you had a list of plots and a vector of file names, you could use `pwalk()` to save each file to the corresponding location on disk:
```{r}
library(ggplot2)
@ -716,13 +748,7 @@ paths <- paste0(names(plots), ".pdf")
pwalk(list(paths, plots), ggsave, path = tempdir())
```
`walk()`, `walk2()` and `pwalk()` all invisibly return the first argument. This makes it easier to use them in chains. The following example prints
```{r, eval = FALSE}
plots %>%
walk(print) %>%
walk2(paths, ~ggsave(.y, .x, path = tempdir()))
```
`walk()`, `walk2()` and `pwalk()` all invisibly return the `.x`, the first argument. This makes them suitable for use in the middle of pipelines.
## Predicates
@ -740,7 +766,7 @@ col_sum <- function(df, f) {
}
```
`is_numeric()` is a __predicate__: a function that returns `TRUE` or `FALSE`. There are a number of of purrr functions designed to work specifically with predicates:
`is_numeric()` is a __predicate__: a function that returns either `TRUE` or `FALSE`. There are a number of of purrr functions designed to work specifically with predicates:
* `keep()` and `discard()` keeps/discards list elements where the predicate is
true.
@ -829,12 +855,3 @@ is_bare_integer(y)
1. Carefully read the documentation of `is.vector()`. What does it actually
test for?
## Data frames
i.e. how do dplyr and purrr intersect.
* Why use a data frame?
* List columns in a data frame
* Mutate & filter.
* Creating list columns with `group_by()` and `do()`.

View File

@ -14,6 +14,16 @@ options(digits = 3)
* Bootstrapping to understand uncertainty in parameters.
* Cross-validation to understand predictive quality.
## Purrr + dplyr
i.e. how do dplyr and purrr intersect.
* Why use a data frame?
* List columns in a data frame
* Mutate & filter.
* Creating list columns with `group_by()` and `do()`.
## Multiple models
A natural application of `map2()` is handling test-training pairs when doing model evaluation. This is an important modelling technique: you should never evaluate a model on the same data it was fit to because it's going to make you overconfident. Instead, it's better to divide the data up and use one piece to fit the model and the other piece to evaluate it. A popular technique for this is called k-fold cross validation. You randomly hold out x% of the data and fit the model to the rest. You need to repeat this a few times because of random variation.