r4ds/lists.Rmd

657 lines
22 KiB
Plaintext

---
layout: default
title: List manipulation
output: bookdown::html_chapter
---
```{r setup, include=FALSE}
library(purrr)
set.seed(1014)
options(digits = 3)
source("images/embed_jpg.R")
```
# Lists
In this chapter, you'll learn how to handle lists, the data structure R uses for complex, hierarchical objects. You've already familiar with vectors, R's data structure for 1d objects. Lists extend these ideas to model objects that are like trees. Lists allow you to do this because unlike vectors, a list can contain other lists.
If you've worked with list-like objects before, you're probably familiar with the for loop. I'll talk a little bit about for loops here, but the focus will be functions from the __purrr__ package. purrr makes it easier to work with lists by eliminating common for loop boilerplate so you can focus on the specific details. This is the same idea as the apply family of functions in base R (`apply()`, `lapply()`, `tapply()`, etc), but purrr is more consistent and easier to learn.
The goal of using purrr functions instead of for loops is to allow you break common list manipulation challenges into independent pieces:
1. How can you solve the problem for a single element of the list? Once
you've solved that problem, purrr takes care of generalising your
solution to every element in the list.
1. If you're solving a complex problem, how can you break it down into
bite sized pieces that allow you to advance one small step towards a
solution? With purrr, you get lots of small pieces that you can
combose together with the pipe.
This structure makes it easier to solve new problems. It also makes it easier to understand your solutions to old problems when you re-read your old code.
<!--
## Warm ups
* What does this for loop do?
* How is a data frame like a list?
* What does `mean()` mean? What does `mean` mean?
* How do you get help about the $ function? How do you normally write
`[[`(mtcars, 1) ?
* Argument order
-->
## List basics
To create a list, you use the `list()` function:
```{r}
x <- list(1, 2, 3)
str(x)
```
Unlike the atomic vectors, `lists()` can contain a mix of objects:
```{r}
y <- list("a", 1L, 1.5, TRUE)
str(y)
```
`str()` is very helpful when looking at lists because it focusses on the structure, not the contents.
Lists can even contain other lists!
```{r}
z <- list(list(1, 2), list(3, 4))
str(z)
```
There are three ways to subset a list:
* `[` extracts a sub-list. The result will always be a list.
```{r}
str(y[1:3])
str(y[1])
```
* `[[` extracts a single component from a list.
```{r}
str(y[[1]])
str(y[[3]])
```
* `$` is a shorthand for extracting named elements of a list. It works
very similarly to `[[` except that you don't need to use quotes.
```{r}
a <- list(x = 1:2, y = 3:4)
a$x
a[["y"]]
```
It's easy to get confused between `[` and `[[`, but understanding the difference is critical when working with lists. A few months ago I stayed at a hotel with a pretty interesting pepper shaker that I hope will help remember:
```{r, echo = FALSE}
embed_jpg("images/pepper.jpg", 300)
```
If this pepper shaker is your list `x`, then, `x[1]` is a pepper shaker containing a single pepper packet:
```{r, echo = FALSE}
embed_jpg("images/pepper-1.jpg", 300)
```
`x[2]` would look the same, but would contain the second packet. `x[1:2]` would be a pepper shaker containing two pepper packets.
`x[[1]]` is:
```{r, echo = FALSE}
embed_jpg("images/pepper-2.jpg", 300)
```
If you wanted to get the content of the pepper package, you'd need `x[[1]][[1]]`:
```{r, echo = FALSE}
embed_jpg("images/pepper-3.jpg", 300)
```
## A common pattern of for loops
Lets start by creating a stereotypical list: an eight element list where each element contains a random vector of random length. (You'll learn `rerun()` later.)
```{r}
x <- rerun(8, runif(sample(5, 1)))
str(x)
```
Imagine we want to compute the length of each element in this list. One way to do that is with a for loop:
```{r}
results <- vector("integer", length(x))
for (i in seq_along(x)) {
results[i] <- length(x[[i]])
}
results
```
There are three parts to a for loop:
1. The __results__: `results <- vector("integer", length(x))`.
This creates an integer vector the same length as the input. It's important
to enough space for all the results up front, otherwise you have to grow the
results vector at each iteration, which is very slow for large loops.
1. The __sequence__: `i in seq_along(x)`. This determines what to loop over:
each run of the for loop will assign `i` to a different value from
`seq_along(x)`, shorthand for `1:length(x)`.
1. The __body__: `results[i] <- length(x[[i]])`. This code is run repeatedly,
each time with a different value in `i`. The first iteration will run
`results[1] <- length(x[[1]])`, the second `results[2] <- length(x[[2]])`,
and so on.
This loop used a function you might not be familiar with: `seq_along()`. This is a safe version of the more familiar `1:length(l)`. There's one important difference in behaviour. If you have a zero-length vector, `seq_along()` does the right thing:
```{r}
y <- numeric(0)
seq_along(y)
1:length(y)
```
Figuring out the length of the elements of a list is a common operation, so it makes sense to turn it into a function so we can reuse it again and again:
```{r}
compute_length <- function(x) {
results <- vector("numeric", length(x))
for (i in seq_along(x)) {
results[i] <- length(x[[i]])
}
results
}
compute_length(x)
```
(And in fact base R has this already: it's called `lengths()`.)
Now imagine we want to compute the `mean()` of each element. How would our function change? What if we wanted to compute the `median()`? You could create variations of `compute_lengths()` like this:
```{r}
compute_mean <- function(x) {
results <- vector("numeric", length(x))
for (i in seq_along(x)) {
results[i] <- mean(x[[i]])
}
results
}
compute_mean(x)
compute_median <- function(x) {
results <- vector("numeric", length(x))
for (i in seq_along(x)) {
results[i] <- median(x[[i]])
}
results
}
compute_median(x)
```
But this is only two functions we might want to apply to every element of a list, and there's already lot of duplication. Most of the code is for-loop boilerplate and it's hard to see the one function (`length()`, `mean()`, or `median()`) that's actually important.
What would you do if you saw a set of functions like this:
```{r}
f1 <- function(x) abs(x - mean(x)) ^ 1
f2 <- function(x) abs(x - mean(x)) ^ 2
f3 <- function(x) abs(x - mean(x)) ^ 3
```
You'd notice that there's a lot of duplication, and extract it in to an additional argument:
```{r}
f <- function(x, i) abs(x - mean(x)) ^ i
```
You've reduce the chance of bugs (because you now have 1/3 less code), and made it easy to generalise to new situations. We can do exactly the same thing with `compute_length()`, `compute_median()` and `compute_mean()`:
```{r}
compute_summary <- function(x, f) {
results <- vector("numeric", length(x))
for (i in seq_along(x)) {
results[i] <- f(x[[i]])
}
results
}
compute_summary(x, mean)
```
Instead of hardcoding the summary function, we allow it to vary, by adding an addition argument that is a function. It can take a while to wrap your head around this, but it's very powerful technique. This is one of the reasons that R is known as a "functional" programming language.
## The map functions
This pattern of looping over a list and doing something to each element is so common that the purrr package provides a family of functions to do it for you. Each function always returns the same type of output so there are six variations based on what sort of result you want:
* `map()`: a list.
* `map_lgl()`: a logical vector.
* `map_int()`: a integer vector.
* `map_dbl()`: a double vector.
* `map_chr()`: a character vector.
* `map_df()`: a data frame.
* `walk(): nothing (called exclusively for side effects).
If none of the specialised versions return exactly what you want, you can always use a `map()` because a list can contain any other object.
Each of these functions take a list as input, applies a function to each piece and then return a new vector that's the same length as the input. The following code uses purrr to do the same computations we did above:
```{r}
map_int(x, length)
map_dbl(x, mean)
map_dbl(x, median)
```
There are a few differences between `map_*()` and `compute_summary()`:
* They are implemented in C code. This means you can't easily understand their
implementation, but it reduces a little overhead so they run even faster
than for loops.
* The second argument, `.f,` the function to apply to each element can be
a formula, a character vector, or an integer vector. You'll learn about
those handy shortcuts in the next section.
* You can pass on additional arguments to `.f`:
```{r}
map_dbl(x, mean, trim = 0.5)
map_dbl(x, function(x) mean(x, trim = 0.5))
```
* They preserve names:
```{r}
z <- list(x = 1:3, y = 4:5)
map_int(z, length)
```
If you're familiar with the apply family of functions in base R, you might have noticed some similarities with the purrr functions:
* `lapply()` is basically identical to `map()`. There's no advantage to using
`map()` over `lapply()` except that it's consistent with all the other
functions in purrr.
* The base `sapply()` is wrapper around `lapply()` that automatically tries
to simplify the results. This is useful for interactive work but is
problematic in a function because you never know what sort of output
you'll get:
```{r}
df <- data.frame(
a = 1L,
b = 1.5,
y = Sys.time(),
z = ordered(1)
)
str(sapply(df[1:4], class))
str(sapply(df[1:2], class))
str(sapply(df[3:4], class))
```
* `vapply()` is a safe alternative to `sapply()` because you supply an additional
argument that defines the type. The only problem with `vapply()` is that
it's a lot of typing: `vapply(df, is.numeric, logical(1))` is equivalent to
`map_lgl(df, is.numeric)`. One advantage to `vapply()` over the map
functions is that it can also produce matrices.
## Pipelines
`map()` is particularly useful when constructing more complex transformations because it both inputs and outputs a list. That makes it well suited for solving a problem a piece at a time. For example, imagine you want to fit a linear model to each individual in a dataset.
Let's start by working through the whole process on the complete dataset. It's always a good idea to start simple (with a single object), and figure out the basic workflow. Then you can generalise up to the harder problem of applying the same steps to multiple models.
TODO: find interesting dataset
You could start by creating a list where each element is a data frame for a different person:
```{r}
models <- mtcars %>%
split(.$cyl) %>%
map(function(df) lm(mpg ~ wt, data = df))
```
The syntax for creating a function in R is quite long so purrr provides a convenient shortcut. You can use a formula:
```{r}
models <- mtcars %>%
split(.$cyl) %>%
map(~lm(mpg ~ wt, data = .))
```
Here I've used the pronoun `.`. You can also use `.x`, `.y`, and `.z` to refer to up to three arguments. If you want to create an function with more than three arguments, do it the regular way!
A common application of these functions is extracting an element so purrr provides a shortcut. For example, to extract the R squared of a model, we need to first run `summary()` and then extract the component called "r.squared":
```{r}
models %>%
map(summary) %>%
map_dbl(~.$r.squared)
```
We can simplify this still further by using a character vector
```{r}
models %>%
map(summary) %>%
map_dbl("r.squared")
```
Similarly, you can use an integer vector to extract the element in a given position.
### Navigating hierarchy
These techniques are useful in general when working with complex nested object. One way to get such an object is to create many models or other complex things in R. Other times you get a complex object because you're reading in hierarchical data from another source.
A common source of hierarchical data is JSON from a web api.
```{r}
issues <- jsonlite::fromJSON("https://api.github.com/repos/hadley/r4ds/issues", simplifyVector = FALSE)
length(issues)
str(issues[[1]])
```
Note that you can use a chararacter vector in any of the map funtions. This will subset recursively, which is particularly useful when you want to dive deep into a nested data structure.
```{r}
issues %>% map_chr(c("user", "login"))
issues %>% map_int(c("user", "id"))
```
### Predicates
Imagine we want to summarise each numeric column of a data frame. We could write this:
```{r}
col_sum <- function(df, f) {
is_num <- df %>% map_lgl(is_numeric)
df[is_num] %>% map_dbl(f)
}
```
`is_numeric()` is a __predicate__: a function that returns a logical output. There are a couple of purrr functions designed to work specifically with predicate functions:
* `keep()` keeps all elements of a list where the predicate is true
* `discard()` throws aways away elements of the list where the predicate is
true
That allows us to simply the summary function to:
```{r}
col_sum <- function(df, f) {
df %>%
keep(is.numeric) %>%
map_dbl(f)
}
```
[Sidebar: list of predicate functions. Better to use purrr's underscore variants because they tend to do what you expect, and are implemented in R so if you're unsure you can read the source]
This is a nice example of the benefits of piping - we can more easily see the sequence of transformations done to the list. First we throw away non-numeric columns and then we apply the function `f` to each one.
Other predicate functionals: `head_while()`, `tail_while()`, `some()`, `every()`,
### Exercises
## Dealing with failure
When you start doing many operations with purrr, you'll soon discover that not everything always succeeds. For example, you might be fitting a bunch of more complicated models, and not every model will converge. How do you ensure that one bad apple doesn't ruin the whole barrel?
Dealing with errors is fundamentally painful because errors are sort of a side-channel to the way that functions usually return values. The best way to handle them is to turn them into a regular output with the `safe()` function. This function is similar to the `try()` function in base R, but instead of sometimes returning the original output and sometimes returning a error, `safe()` always returns the same type of object: a list with elements `result` and `error`. For any given run, one will always be `NULL`, but because the structure is always the same its easier to deal with.
Let's illustrate this with a simple example: `log()`:
```{r}
safe_log <- safe(log)
str(safe_log(10))
str(safe_log("a"))
```
You can see when the function succeeds the result element contains the result and the error element is empty. When the function fails, the result element is empty and the error element contains the error.
This makes it natural to work with map:
```{r}
x <- list(1, 10, "a")
y <- x %>% map(safe_log)
str(y)
```
This output would be easier to work with if we had two lists: one of all the errors and one of all the results:
```{r}
result <- y %>% map("result")
error <- y %>% map("error")
```
(Later on, you'll see another way to attack this problem with `transpose()`)
It's up to you how to deal with these errors, but typically you'd start by looking at the values of `x` where `y` is an error or working with the values of y that are ok:
```{r}
is_ok <- error %>% map_lgl(is.null)
x[!is_ok]
result[is_ok] %>% map_dbl(identity)
```
When we have related vectors, it's useful to store in a data frame:
```{r}
all <- dplyr::data_frame(
x = list(1, 10, "a"),
y = x %>% map(safe_log),
result = y %>% map("result"),
error = y %>% map("error"),
is_ok = error %>% map_lgl(is.null)
)
dplyr::filter(all, is_ok)
```
Other related functions:
* `maybe()`: if you don't care about the error message, and instead
just want a default value on failure.
* `outputs()`: does a similar job but for other outputs like printed
ouput, messages, and warnings.
Challenge: read all the csv files in this directory. Which ones failed
and why?
```{r, eval = FALSE}
files <- dir("data", pattern = "\\.csv$")
files %>%
set_names(., basename(.)) %>%
map_df(readr::read_csv, .id = "filename") %>%
```
## Multiple inputs
So far we've focussed on variants that differ primarily in their output. There is a family of useful variants that vary primarily in their input: `map2()`, `map3()` and `map_n()`.
Imagine you want to simulate some random normals with different means. You know how to do that with `map()`:
```{r}
mu <- c(5, 10, -3)
mu %>% map(rnorm, n = 10)
```
What if you also want to vary the standard deviation? That's a job for `map2()` which works with two parallel sets of inputs:
```{r}
sd <- c(1, 5, 10)
map2(mu, sd, rnorm, n = 10)
```
Note that arguments that vary for each call come before the function name, and arguments that are the same for every function call come afterwards.
Like `map()`, conceptually `map2()` is a simple wrapper around a for loop:
```{r}
map2 <- function(x, y, f, ...) {
out <- vector("list", length(x))
for (i in seq_along(x)) {
out[[i]] <- f(x[[i]], y[[i]], ...)
}
out
}
```
There's also `map3()` which allows you to vary three arguments at a time:
```{r}
n <- c(1, 5, 10)
map3(n, mu, sd, rnorm)
```
(Note that it's not that naturally to use `map2()` and `map3()` in a pipeline because they have mutliple primarily inputs.)
You could imagine `map4()`, `map5()`, `map6()` etc, but that would get tedious quickly. Instead, purrr provides `map_n()` which takes a list of arguments. Here's the `map_n()` call that's equivalent to the prevous `map3()` call:
```{r}
map_n(list(n, mu, sd), rnorm)
```
Another advantage of `map_n()` is that you can use named arguments instead of relying on positional matching:
```{r}
map_n(list(mean = mu, sd = sd, n = n), rnorm)
```
Since the arguments are all the same length, it makes sense to store them in a dataframe:
```{r}
params <- dplyr::data_frame(mean = mu, sd = sd, n = n)
params %>% map_n(rnorm)
```
As soon as you get beyond simple examples, I think using data frames + `map_n()` is the way to go because the data frame ensures that each column as a name, and is the same length as all the other columns. This makes your code easier to understand (once you've grasped this powerful pattern).
There's one more step up in complexity - as well as varying the arguments to the function you might be varying the function itself:
```{r}
f <- c("runif", "rnorm", "rpois")
param <- list(
list(min = -1, max = 1),
list(sd = 5),
list(lambda = 10)
)
```
To handle this case, you can use `invoke_map()`:
```{r}
invoke_map(f, param, n = 5)
```
The first argument is a list of functions or character vector of function names, the second argument is a list of lists giving the arguments that vary for each function. The subsequent arguments are passed on to every function.
You can use `dplyr::frame_data()` to create these matching pairs a little easier:
```{r}
sim <- dplyr::frame_data(
~f, ~params,
"runif", list(min = -1, max = -1),
"rnorm", list(sd = 5),
"rpois", list(lambda = 10)
)
sim %>% dplyr::mutate(
samples = invoke_map(f, params, n = 10)
)
```
### Models
A natural application of `map2()` is handling test-training pairs when doing model evaluation. This is an important modelling technique: you should never evaluate a model on the same data it was fit to because it's going to make you overconfident. Instead, it's better to divide the data up and use one piece to fit the model and the other piece to evaluate it. A popular technique for this is called k-fold cross validation. You randomly hold out x% of the data and fit the model to the rest. You need to repeat this a few times because of random variation.
Let's start by writing a function that partitions a dataset into test and training:
```{r}
partition <- function(df, p) {
n <- nrow(df)
groups <- rep(c(TRUE, FALSE), n * c(p, 1 - p))
sample(groups)
}
partition(mtcars, 0.1)
```
We'll generate 20 random test-training splits, and then create lists of test-training datasets:
```{r}
partitions <- rerun(200, partition(mtcars, 0.25))
tst <- partitions %>% map(~mtcars[.x, , drop = FALSE])
trn <- partitions %>% map(~mtcars[!.x, , drop = FALSE])
```
Then fit the models to each training dataset:
```{r}
mod <- trn %>% map(~lm(mpg ~ wt, data = .))
```
If we wanted, we could extract the coefficients using broom, and make a single data frame with `map_df()` and then visualise the distributions with ggplot2:
```{r}
coef <- mod %>%
map_df(broom::tidy, .id = "i")
coef
library(ggplot2)
ggplot(coef, aes(estimate)) +
geom_histogram(bins = 10) +
facet_wrap(~term, scales = "free_x")
```
But we're most interested in the quality of the models, so we make predictions for each test data set and compute the mean squared distance between predicted and actual:
```{r}
pred <- map2(mod, tst, predict)
actl <- map(tst, "mpg")
msd <- function(x, y) sqrt(mean((x - y) ^ 2))
mse <- map2_dbl(pred, actl, msd)
mean(mse)
mod <- lm(mpg ~ wt, data = mtcars)
base_mse <- msd(mtcars$mpg, predict(mod))
base_mse
ggplot(, aes(mse)) +
geom_histogram(binwidth = 0.25) +
geom_vline(xintercept = base_mse, colour = "red")
```
### Data frames
Why you should store related vectors (even if they're lists!) in a
data frame. Need example that has some covariates so you can (e.g.)
select all models for females, or under 30s, ...
## "Tidying" lists
I don't know know how to put this stuff in words yet, but I know it
when I see it, and I have a good intuition for what operation you
should do at each step. This is where I was 5 years for tidy data - I
can do it, but it's so internalised that I don't know what I'm doing
and I don't know how to teach it to other people.
Two key tools:
* flatten(), flatmap(), and lmap(): sometimes list doesn't have quite
the right grouping level and you need to change
* transpose(): sometimes list is "inside out"
Challenges: various weird json files?