Polish list, loops, & map() intro.

This commit is contained in:
hadley 2015-11-19 07:03:51 +13:00
parent d30bded405
commit 269867d60c
6 changed files with 212 additions and 78 deletions

28
images/embed_jpg.R Normal file
View File

@ -0,0 +1,28 @@
is_latex <- function() {
identical(knitr::opts_knit$get('rmarkdown.pandoc.to'), "latex")
}
embed_jpg <- function(path, dpi) {
dim <- jpg_dim(path)
if (is_latex()) {
width <- round(dim[2] / dpi, 2)
knitr::asis_output(paste0(
"\\includegraphics[",
"width=", width, "in",
"]{", path, "}"
))
} else {
knitr::asis_output(paste0(
"<img src='", path, "'",
" width='", round(dim[2] / (dpi / 96)), "'",
" height='", round(dim[1] / (dpi / 96)), "'",
" />"
))
}
}
jpg_dim <- function(path) {
dim(jpeg::readJPEG(path, native = TRUE))
}

BIN
images/pepper-1.jpg Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 186 KiB

BIN
images/pepper-2.jpg Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 102 KiB

BIN
images/pepper-3.jpg Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 67 KiB

BIN
images/pepper.jpg Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 176 KiB

262
lists.Rmd
View File

@ -1,6 +1,6 @@
---
layout: default
title: String manipulation
title: List manipulation
output: bookdown::html_chapter
---
@ -8,27 +8,27 @@ output: bookdown::html_chapter
library(purrr)
set.seed(1014)
options(digits = 3)
source("images/embed_jpg.R")
```
# Lists
In this chapter, you'll learn how to handle lists, R's primarily hierarchical data structure. Lists are sometimes called recursive data structures, because they're one of the few datastructures in R than can contain themselves; a list can have a list as a child.
In this chapter, you'll learn how to handle lists, the data structure R uses for complex, hierarchical objects. You've already familiar with vectors, R's data structure for 1d objects. Lists extend these ideas to model objects that are like trees. Lists allow you to do this because unlike vectors, a list can contain other lists.
If you've worked with list-like objects in other environments, you're probably familiar with the for-loop. We'll discuss for loops a little here, but we'll mostly focus on a number functions from the __purrr__ package. The purrr package is designed to make it easy to work with lists by taking care of the details and allowing you to focus on the specific transformation, not the generic boilerplate.
If you've worked with list-like objects before, you're probably familiar with the for loop. I'll talk a little bit about for loops here, but the focus will be functions from the __purrr__ package. purrr makes it easier to work with lists by eliminating common for loop boilerplate so you can focus on the specific details. This is the same idea as the apply family of functions in base R (`apply()`, `lapply()`, `tapply()`, etc), but purrr is more consistent and easier to learn.
The goal is to allow you to think only about:
The goal of using purrr functions instead of for loops is to allow you break common list manipulation challenges into independent pieces:
1. Each element of the list in isolate. You need to figure out how to
manipulate a single element of the list; purrr takes care of generalising
that to every element in the list.
1. How can you solve the problem for a single element of the list? Once
you've solved that problem, purrr takes care of generalising your
solution to every element in the list.
1. How do you move that element a small step towards your final goal.
Purrr provides lots of small pieces that you compose together to
solve complex problems.
1. If you're solving a complex problem, how can you break it down into
bite sized pieces that allow you to advance one small step towards a
solution? With purrr, you get lots of small pieces that you can
combose together with the pipe.
Together, these features allow you to tackle complex problems by dividing them up into bite size pieces. The resulting code is easy to understand when you re-read it in the future.
Many of the functions in purrr have equivalent in base R. We'll provide you with a few guideposts into base R, but we'll focus on purrr because its functions are more consistent and have fewer surprises.
This structure makes it easier to solve new problems. It also makes it easier to understand your solutions to old problems when you re-read your old code.
<!--
## Warm ups
@ -43,23 +43,93 @@ Many of the functions in purrr have equivalent in base R. We'll provide you with
## List basics
* Creating
* `[` vs `[[`
* `str()`
## A common pattern of for loops
Lets start by creating a stereotypical list: a 10 element list where each element is contains some random values:
To create a list, you use the `list()` function:
```{r}
x <- rerun(10, runif(sample(10, 1)))
x <- list(1, 2, 3)
str(x)
```
Imagine we want to compute the length of each element in this list. We might use a for loop:
Unlike the atomic vectors, `lists()` can contain a mix of objects:
```{r}
results <- vector("numeric", length(x))
y <- list("a", 1L, 1.5, TRUE)
str(y)
```
`str()` is very helpful when looking at lists because it focusses on the structure, not the contents.
Lists can even contain other lists!
```{r}
z <- list(list(1, 2), list(3, 4))
str(z)
```
There are three ways to subset a list:
* `[` extracts a sub-list. The result will always be a list.
```{r}
str(y[1:3])
str(y[1])
```
* `[[` extracts a single component from a list.
```{r}
str(y[[1]])
str(y[[3]])
```
* `$` is a shorthand for extracting named elements of a list. It works
very similarly to `[[` except that you don't need to use quotes.
```{r}
a <- list(x = 1:2, y = 3:4)
a$x
a[["y"]]
```
It's easy to get confused between `[` and `[[`, but understanding the difference is critical when working with lists. A few months ago I stayed at a hotel with a pretty interesting pepper shaker that I hope will help remember:
```{r, echo = FALSE}
embed_jpg("images/pepper.jpg", 300)
```
If this pepper shaker is your list `x`, then, `x[1]` is a pepper shaker containing a single pepper packet:
```{r, echo = FALSE}
embed_jpg("images/pepper-1.jpg", 300)
```
`x[2]` would look the same, but would contain the second packet. `x[1:2]` would be a pepper shaker containing two pepper packets.
`x[[1]]` is:
```{r, echo = FALSE}
embed_jpg("images/pepper-2.jpg", 300)
```
If you wanted to get the content of the pepper package, you'd need `x[[1]][[1]]`:
```{r, echo = FALSE}
embed_jpg("images/pepper-3.jpg", 300)
```
## A common pattern of for loops
Lets start by creating a stereotypical list: an eight element list where each element contains a random vector of random length. (You'll learn `rerun()` later.)
```{r}
x <- rerun(8, runif(sample(5, 1)))
str(x)
```
Imagine we want to compute the length of each element in this list. One way to do that is with a for loop:
```{r}
results <- vector("integer", length(x))
for (i in seq_along(x)) {
results[i] <- length(x[[i]])
}
@ -68,29 +138,29 @@ results
There are three parts to a for loop:
1. We start by creating a place to store the results of the for loop. We use
`vector()` to create an integer vector that's the same length as the input.
It's important to make sure we allocate enough space for all the results
up front, otherwise we'll need to grow the results multiple times which
is slow.
1. The __results__: `results <- vector("integer", length(x))`.
This creates an integer vector the same length as the input. It's important
to enough space for all the results up front, otherwise you have to grow the
results vector at each iteration, which is very slow for large loops.
1. We determine what to loop over: `i in seq_along(l)`. Each run of the for
loop will assign `i` to a different value from `seq_along(l)`.
`seq_along(l)` is equivalent to the more familiar `1:length(l)`
with one important difference.
1. The __sequence__: `i in seq_along(x)`. This determines what to loop over:
each run of the for loop will assign `i` to a different value from
`seq_along(x)`, shorthand for `1:length(x)`.
What happens if `l` is length zero? Well, `length(l)` will be 0 so we
get `1:0` which yields `c(1, 0)`. That's likely to cause problems! You
may be sceptical that such a problem would ever occur to you in practice,
but once you start writing production code which is run unattended, its
easy for inputs to not be what you expect. I recommend taking some common
safety measures to avoid problems in future.
1. The __body__: `results[i] <- length(x[[i]])`. This code is run repeatedly,
each time with a different value in `i`. The first iteration will run
`results[1] <- length(x[[1]])`, the second `results[2] <- length(x[[2]])`,
and so on.
1. The body of the loop - this does two things. It calculates what we're
really interested (`length()`) and then it stores it in the output
vector.
This loop used a function you might not be familiar with: `seq_along()`. This is a safe version of the more familiar `1:length(l)`. There's one important difference in behaviour. If you have a zero-length vector, `seq_along()` does the right thing:
Because we're likely to use this operation a lot, it makes sense to turn it into a function:
```{r}
y <- numeric(0)
seq_along(y)
1:length(y)
```
Figuring out the length of the elements of a list is a common operation, so it makes sense to turn it into a function so we can reuse it again and again:
```{r}
compute_length <- function(x) {
@ -103,7 +173,9 @@ compute_length <- function(x) {
compute_length(x)
```
Now imagine we want to compute the `mean()` of each element. How would our function change? What if we wanted to compute the `median()`?
(And in fact base R has this already: it's called `lengths()`.)
Now imagine we want to compute the `mean()` of each element. How would our function change? What if we wanted to compute the `median()`? You could create variations of `compute_lengths()` like this:
```{r}
compute_mean <- function(x) {
@ -125,7 +197,7 @@ compute_median <- function(x) {
compute_median(x)
```
There are a lot of duplication in these functions! Most of the code is for-loop boilerplot and it's hard to see that one function (`mean()` or `median()`) that's actually important.
But this is only two functions we might want to apply to every element of a list, and there's already lot of duplication. Most of the code is for-loop boilerplate and it's hard to see the one function (`length()`, `mean()`, or `median()`) that's actually important.
What would you do if you saw a set of functions like this:
@ -141,9 +213,7 @@ You'd notice that there's a lot of duplication, and extract it in to an addition
f <- function(x, i) abs(x - mean(x)) ^ i
```
You've reduce the chance of bugs (because you now have 1/3 less code), and made it easy to generalise to new situations.
We can do exactly the same thing with `compute_length()`, `compute_median()` and `compute_mean()`:
You've reduce the chance of bugs (because you now have 1/3 less code), and made it easy to generalise to new situations. We can do exactly the same thing with `compute_length()`, `compute_median()` and `compute_mean()`:
```{r}
compute_summary <- function(x, f) {
@ -156,45 +226,83 @@ compute_summary <- function(x, f) {
compute_summary(x, mean)
```
Instead of hard coding the summary function, we allow it to vary. This is an incredibly powerful technique is is why R is known as a "function" programming language: the arguments to a function can be other functions.
Instead of hardcoding the summary function, we allow it to vary, by adding an addition argument that is a function. It can take a while to wrap your head around this, but it's very powerful technique. This is one of the reasons that R is known as a "functional" programming language.
This is such a common use of for loops, that the purrr package has five functions that do exactly that. There's one functions for each type of output:
## The map functions
* `map()`: list
* `map_lgl()`: logical vector
* `map_int()`: integer vector
* `map_dbl()`: double vector
* `map_chr()`: character vector
* `map_df()`: a data frame
This pattern of looping over a list and doing something to each element is so common that the purrr package provides a family of functions to do it for you. Each function always returns the same type of output so there are six variations based on what sort of result you want:
Each of these functions take a list as input, apply a function to each piece and then return a new vector that's the same length as the input. Because the first element is the list to transform, it also makes them particularly suitable for piping:
* `map()`: a list.
* `map_lgl()`: a logical vector.
* `map_int()`: a integer vector.
* `map_dbl()`: a double vector.
* `map_chr()`: a character vector.
* `map_df()`: a data frame.
* `walk(): nothing (called exclusively for side effects).
If none of the specialised versions return exactly what you want, you can always use a `map()` because a list can contain any other object.
Each of these functions take a list as input, applies a function to each piece and then return a new vector that's the same length as the input. The following code uses purrr to do the same computations we did above:
```{r}
map_int(x, length)
map_dbl(x, mean)
map_dbl(x, median)
```
Note that additional arguments to the map function are passed on to the functions being mapped. That means these two calls are equivalent:
There are a few differences between `map_*()` and `compute_summary()`:
```{r}
map_dbl(x, mean, trim = 0.5)
map_dbl(x, function(x) mean(x, trim = 0.5))
```
* They are implemented in C code. This means you can't easily understand their
implementation, but it reduces a little overhead so they run even faster
than for loops.
* The second argument, `.f,` the function to apply to each element can be
a formula, a character vector, or an integer vector. You'll learn about
those handy shortcuts in the next section.
* You can pass on additional arguments to `.f`:
### Base equivalents
```{r}
map_dbl(x, mean, trim = 0.5)
map_dbl(x, function(x) mean(x, trim = 0.5))
```
* `lapply()` is effectively identical to `map()`. The advantage to using
`map()` is that it shares a consistent naming scheme with the other functions
in purrr. As you'll learn in the next section, `map()` functions also work
with things other than functions to save you typing.
* They preserve names:
* `sapply()` is like a box of chocolates: you'll never know what you're going
to get.
* `vapply()` is a safe alternative to `sapply()` because you supply an additional
argument that defines the type. But it's long: `vapply(df, is.numeric, logical(1))`
is equivalent to `map_lgl(df, is.numeric)`. Can also produce matrices, but
that's rarely useful.
```{r}
z <- list(x = 1:3, y = 4:5)
map_int(z, length)
```
If you're familiar with the apply family of functions in base R, you might have noticed some similarities with the purrr functions:
* `lapply()` is basically identical to `map()`. There's no advantage to using
`map()` over `lapply()` except that it's consistent with all the other
functions in purrr.
* The base `sapply()` is wrapper around `lapply()` that automatically tries
to simplify the results. This is useful for interactive work but is
problematic in a function because you never know what sort of output
you'll get:
```{r}
df <- data.frame(
a = 1L,
b = 1.5,
y = Sys.time(),
z = ordered(1)
)
str(sapply(df[1:4], class))
str(sapply(df[1:2], class))
str(sapply(df[3:4], class))
```
* `vapply()` is a safe alternative to `sapply()` because you supply an additional
argument that defines the type. The only problem with `vapply()` is that
it's a lot of typing: `vapply(df, is.numeric, logical(1))` is equivalent to
`map_lgl(df, is.numeric)`. One advantage to `vapply()` over the map
functions is that it can also produce matrices.
## Pipelines
@ -496,8 +604,7 @@ If we wanted, we could extract the coefficients using broom, and make a single d
```{r}
coef <- mod %>%
map(broom::tidy) %>%
map_df(.id = "i")
map_df(broom::tidy, .id = "i")
coef
library(ggplot2)
@ -513,8 +620,7 @@ pred <- map2(mod, tst, predict)
actl <- map(tst, "mpg")
msd <- function(x, y) sqrt(mean((x - y) ^ 2))
# TODO: use map2_dbl when available.
mse <- map2(pred, actl, msd) %>% flatten
mse <- map2_dbl(pred, actl, msd)
mean(mse)
mod <- lm(mpg ~ wt, data = mtcars)
@ -545,6 +651,6 @@ Two key tools:
* flatten(), flatmap(), and lmap(): sometimes list doesn't have quite
the right grouping level and you need to change
* zip_n(): sometimes list is "inside out"
* transpose(): sometimes list is "inside out"
Challenges: various weird json files?