r4ds/lists.Rmd

384 lines
13 KiB
Plaintext

# Lists
```{r setup-lists, include=FALSE}
library(purrr)
```
In this chapter, you'll learn how to handle lists, the data structure R uses for complex, hierarchical objects. You're already familiar with vectors, R's data structure for 1d objects. Lists extend these ideas to model objects that are like trees. You can create a hierarchical structure with a list because unlike vectors, a list can contain other lists.
If you've worked with list-like objects before, you're probably familiar with the for loop. I'll talk a little bit about for loops here, but the focus will be functions from the __purrr__ package. purrr makes it easier to work with lists by eliminating common for loop boilerplate so you can focus on the specifics. The apply family of functions in base R (`apply()`, `lapply()`, `tapply()`, etc) solve a similar problem, but purrr is more consistent and easier to learn.
The goal of using purrr functions instead of for loops is to allow you break common list manipulation challenges into independent pieces:
1. How can you solve the problem for a single element of the list? Once
you've solved that problem, purrr takes care of generalising your
solution to every element in the list.
1. If you're solving a complex problem, how can you break it down into
bite sized pieces that allow you to advance one small step towards a
solution? With purrr, you get lots of small pieces that you can
compose together with the pipe.
This structure makes it easier to solve new problems. It also makes it easier to understand your solutions to old problems when you re-read your old code.
In later chapters you'll learn how to apply these ideas when modelling. You can often use multiple simple models to help understand a complex dataset, or you might have multiple models because you're bootstrapping or cross-validating. The techniques you'll learn in this chapter will be invaluable.
<!--
## Warm ups
* What does this for loop do?
* How is a data frame like a list?
* What does `mean()` mean? What does `mean` mean?
* How do you get help about the $ function? How do you normally write
`[[`(mtcars, 1) ?
* Argument order
-->
## List basics
You create a list with `list()`:
```{r}
x <- list(1, 2, 3)
str(x)
x_named <- list(a = 1, b = 2, c = 3)
str(x_named)
```
Unlike atomic vectors, `lists()` can contain a mix of objects:
```{r}
y <- list("a", 1L, 1.5, TRUE)
str(y)
```
Lists can even contain other lists!
```{r}
z <- list(list(1, 2), list(3, 4))
str(z)
```
`str()` is very helpful when looking at lists because it focusses on the structure, not the contents.
### Visualising lists
To explain more complicated list manipulation functions, it's helpful to have a visual representation of lists. For example, take these three lists:
```{r}
x1 <- list(c(1, 2), c(3, 4))
x2 <- list(list(1, 2), list(3, 4))
x3 <- list(1, list(2, list(3)))
```
I draw them as follows:
```{r, echo = FALSE, out.width = "75%"}
knitr::include_graphics("diagrams/lists-structure.png")
```
* Lists are rounded rectangles that contain their children.
* I draw each child a little darker than its parent to make it easier to see
the hierarchy.
* The orientation of the children (i.e. rows or columns) isn't important,
so I'll pick a row or column orientation to either save space or illustrate
an important property in the example.
### Subsetting
There are three ways to subset a list, which I'll illustrate with `a`:
```{r}
a <- list(a = 1:3, b = "a string", c = pi, d = list(-1, -5))
```
* `[` extracts a sub-list. The result will always be a list.
```{r}
str(a[1:2])
str(a[4])
```
Like subsetting vectors, you can use an integer vector to select by
position, or a character vector to select by name.
* `[[` extracts a single component from a list. It removes a level of
hierarchy from the list.
```{r}
str(y[[1]])
str(y[[4]])
```
* `$` is a shorthand for extracting named elements of a list. It works
similarly to `[[` except that you don't need to use quotes.
```{r}
a$a
a[["b"]]
```
Or visually:
```{r, echo = FALSE, out.width = "75%"}
knitr::include_graphics("diagrams/lists-subsetting.png")
```
### Lists of condiments
It's easy to get confused between `[` and `[[`, but it's important to understand the difference. A few months ago I stayed at a hotel with a pretty interesting pepper shaker that I hope will help you remember these differences:
```{r, echo = FALSE, out.width = "25%"}
knitr::include_graphics("images/pepper.jpg")
```
If this pepper shaker is your list `x`, then, `x[1]` is a pepper shaker containing a single pepper packet:
```{r, echo = FALSE, out.width = "25%"}
knitr::include_graphics("images/pepper-1.jpg")
```
`x[2]` would look the same, but would contain the second packet. `x[1:2]` would be a pepper shaker containing two pepper packets.
`x[[1]]` is:
```{r, echo = FALSE, out.width = "25%"}
knitr::include_graphics("images/pepper-2.jpg")
```
If you wanted to get the content of the pepper package, you'd need `x[[1]][[1]]`:
```{r, echo = FALSE, out.width = "25%"}
knitr::include_graphics("images/pepper-3.jpg")
```
### Exercises
1. Draw the following lists as nested sets.
1. Generate the lists corresponding to these nested set diagrams.
1. What happens if you subset a data frame as if you're subsetting a list?
What are the key differences between a list and a data frame?
## Handling hierarchy {#hierarchy}
The map functions apply a function to every element in a list. They are the most commonly used part of purrr, but not the only part. Since lists are often used to represent complex hierarchies, purrr also provides tools to work with hierarchy:
* You can extract deeply nested elements in a single call by supplying
a character vector to the map functions.
* You can remove a level of the hierarchy with the flatten functions.
* You can flip levels of the hierarchy with the transpose function.
### Extracting deeply nested elements
Some times you get data structures that are very deeply nested. A common source of such data is JSON from a web API. I've previously downloaded a list of GitHub issues related to this book and saved it as `issues.json`. Now I'm going to load it into a list with jsonlite. By default `fromJSON()` tries to be helpful and simplifies the structure a little for you. Here I'm going to show you how to do it with purrr, so I set `simplifyVector = FALSE`:
```{r}
# From https://api.github.com/repos/hadley/r4ds/issues
issues <- jsonlite::fromJSON("issues.json", simplifyVector = FALSE)
```
There are eight issues, and each issue is a nested list:
```{r}
length(issues)
str(issues[[1]])
```
To work with this sort of data, you typically want to turn it into a data frame by extracting the related vectors that you're most interested in:
```{r}
issues %>% map_int("id")
issues %>% map_lgl("locked")
issues %>% map_chr("state")
```
You can use the same technique to extract more deeply nested structure. For example, imagine you want to extract the name and id of the user. You could do that in two steps:
```{r}
users <- issues %>% map("user")
users %>% map_chr("login")
users %>% map_int("id")
```
But by supplying a character _vector_ to `map_*`, you can do it in one:
```{r}
issues %>% map_chr(c("user", "login"))
issues %>% map_int(c("user", "id"))
```
### Removing a level of hierarchy
As well as indexing deeply into hierarchy, it's sometimes useful to flatten it. That's the job of the flatten family of functions: `flatten()`, `flatten_lgl()`, `flatten_int()`, `flatten_dbl()`, and `flatten_chr()`. In the code below we take a list of lists of double vectors, then flatten it to a list of double vectors, then to a double vector.
```{r}
x <- list(list(a = 1, b = 2), list(c = 3, d = 4))
str(x)
y <- flatten(x)
str(y)
flatten_dbl(y)
```
Graphically, that sequence of operations looks like:
```{r, echo = FALSE}
knitr::include_graphics("diagrams/lists-flatten.png")
```
Whenever I get confused about a sequence of flattening operations, I'll often draw a diagram like this to help me understand what's going on.
Base R has `unlist()`, but I recommend avoiding it for the same reason I recommend avoiding `sapply()`: it always succeeds. Even if your data structure accidentally changes, `unlist()` will continue to work silently the wrong type of output. This tends to create problems that are frustrating to debug.
### Switching levels in the hierarchy
Other times the hierarchy feels "inside out". You can use `transpose()` to flip the first and second levels of a list:
```{r}
x <- list(
x = list(a = 1, b = 3, c = 5),
y = list(a = 2, b = 4, c = 6)
)
x %>% str()
x %>% transpose() %>% str()
```
Graphically, this looks like:
```{r, echo = FALSE, out.width = "75%"}
knitr::include_graphics("diagrams/lists-transpose.png")
```
You'll see an example of this in the next section, as `transpose()` is particularly useful in conjunction with adverbs like `safely()` and `quietly()`.
It's called transpose by analogy to matrices. When you subset a transposed matrix, you switch indices: `x[i, j]` is the same as `t(x)[j, i]`. It's the same idea when transposing a list, but the subsetting looks a little different: `x[[i]][[j]]` is equivalent to `transpose(x)[[j]][[i]]`. Similarly, a transpose is its own inverse so `transpose(transpose(x))` is equal to `x`.
Transpose is also useful when working with JSON APIs. Many JSON APIs represent data frames in a row-based format, rather than R's column-based format. `transpose()` makes it easy to switch between the two:
```{r}
df <- dplyr::data_frame(x = 1:3, y = c("a", "b", "c"))
df %>% transpose() %>% str()
```
### Turning lists into data frames
* Have a deeply nested list with missing pieces
* Need a tidy data frame so you can visualise, transform, model etc.
* What do you do?
* By hand with purrr, talk about `fromJSON` and `tidyJSON`
### Exercises
## Predicates
Imagine we want to summarise each numeric column of a data frame. We could do it in two steps:
1. Find all numeric columns.
1. Summarise each column.
In code, that would look like:
```{r}
col_sum <- function(df, f) {
is_num <- df %>% map_lgl(is_numeric)
df[is_num] %>% map_dbl(f)
}
```
`is_numeric()` is a __predicate__: a function that returns either `TRUE` or `FALSE`. There are a number of of purrr functions designed to work specifically with predicates:
* `keep()` and `discard()` keeps/discards list elements where the predicate is
true.
* `head_while()` and `tail_while()` keep the first/last elements of a list until
you get the first element where the predicate is true.
* `some()` and `every()` determine if the predicate is true for any or all of
the elements.
* `detect()` and `detect_index()`
We could use `keep()` to simplify the summary function to:
```{r}
col_sum <- function(df, f) {
df %>%
keep(is.numeric) %>%
map_dbl(f)
}
```
I like this formulation because you can easily read the sequence of steps.
### Built-in predicates
Purrr comes with a number of predicate functions built-in:
| | lgl | int | dbl | chr | list | null |
|------------------|-----|-----|-----|-----|------|------|
| `is_logical()` | x | | | | | |
| `is_integer()` | | x | | | | |
| `is_double()` | | | x | | | |
| `is_numeric()` | | x | x | | | |
| `is_character()` | | | | x | | |
| `is_atomic()` | x | x | x | x | | |
| `is_list()` | | | | | x | |
| `is_vector()` | x | x | x | x | x | |
| `is_null()` | | | | | | x |
Compared to the base R functions, they only inspect the type of the object, not its attributes. This means they tend to be less surprising:
```{r}
is.atomic(NULL)
is_atomic(NULL)
is.vector(factor("a"))
is_vector(factor("a"))
```
Each predicate also comes with "scalar" and "bare" versions. The scalar version checks that the length is 1 and the bare version checks that the object is a bare vector with no S3 class.
```{r}
y <- factor(c("a", "b", "c"))
is_integer(y)
is_scalar_integer(y)
is_bare_integer(y)
```
### Exercises
1. A possible base R equivalent of `col_sum()` is:
```{r}
col_sum3 <- function(df, f) {
is_num <- sapply(df, is.numeric)
df_num <- df[, is_num]
sapply(df_num, f)
}
```
But it has a number of bugs as illustrated with the following inputs:
```{r, eval = FALSE}
df <- data.frame(z = c("a", "b", "c"), x = 1:3, y = 3:1)
# OK
col_sum3(df, mean)
# Has problems: don't always return numeric vector
col_sum3(df[1:2], mean)
col_sum3(df[1], mean)
col_sum3(df[0], mean)
```
What causes the bugs?
1. Carefully read the documentation of `is.vector()`. What does it actually
test for?