Merge pull request #44 from radugrosu/patch-7

Update lists.Rmd
This commit is contained in:
Hadley Wickham 2016-02-13 09:52:05 -06:00
commit b45e701434
1 changed files with 23 additions and 23 deletions

View File

@ -10,7 +10,7 @@ library(purrr)
source("common.R")
```
In this chapter, you'll learn how to handle lists, the data structure R uses for complex, hierarchical objects. You've already familiar with vectors, R's data structure for 1d objects. Lists extend these ideas to model objects that are like trees. You can create a hierarchical structure with a list because unlike vectors, a list can contain other lists.
In this chapter, you'll learn how to handle lists, the data structure R uses for complex, hierarchical objects. You're already familiar with vectors, R's data structure for 1d objects. Lists extend these ideas to model objects that are like trees. You can create a hierarchical structure with a list because unlike vectors, a list can contain other lists.
If you've worked with list-like objects before, you're probably familiar with the for loop. I'll talk a little bit about for loops here, but the focus will be functions from the __purrr__ package. purrr makes it easier to work with lists by eliminating common for loop boilerplate so you can focus on the specifics. The apply family of functions in base R (`apply()`, `lapply()`, `tapply()`, etc) solve a similar problem, but purrr is more consistent and easier to learn.
@ -23,11 +23,11 @@ The goal of using purrr functions instead of for loops is to allow you break com
1. If you're solving a complex problem, how can you break it down into
bite sized pieces that allow you to advance one small step towards a
solution? With purrr, you get lots of small pieces that you can
combose together with the pipe.
compose together with the pipe.
This structure makes it easier to solve new problems. It also makes it easier to understand your solutions to old problems when you re-read your old code.
In later chapters you'll learn how to apply these ideas when modelling. You can often use multiple simple models to help understand a complex dataset, or you might have multiple models because you're bootstrapping or cross-validating. The techniques you learn in this chapter will be invaluable.
In later chapters you'll learn how to apply these ideas when modelling. You can often use multiple simple models to help understand a complex dataset, or you might have multiple models because you're bootstrapping or cross-validating. The techniques you'll learn in this chapter will be invaluable.
<!--
## Warm ups
@ -90,7 +90,7 @@ knitr::include_graphics("diagrams/lists-structure.png")
the hierarchy.
* The orientation of the children (i.e. rows or columns) isn't important,
so I pick a row or column orientation to either save space or illustrate
so I'll pick a row or column orientation to either save space or illustrate
an important property in the example.
### Subsetting
@ -135,7 +135,7 @@ knitr::include_graphics("diagrams/lists-subsetting.png")
### Lists of condiments
It's easy to get confused between `[` and `[[`, but it's important to understand the difference. A few months ago I stayed at a hotel with a pretty interesting pepper shaker that I hope will help remember these differences:
It's easy to get confused between `[` and `[[`, but it's important to understand the difference. A few months ago I stayed at a hotel with a pretty interesting pepper shaker that I hope will help you remember these differences:
```{r, echo = FALSE, out.width = "25%"}
knitr::include_graphics("images/pepper.jpg")
@ -252,7 +252,7 @@ col_summary(df, median)
col_summary(df, min)
```
The idea of using a function as an argument to another function is extremely powerful. It might take you a while to wrap your head around it, but it's worth the investment. In the rest of the chapter, you'll learn about and use the purrr package which provides a set of functions that eliminate the need for for-loops for many comon scenarios.
The idea of using a function as an argument to another function is extremely powerful. It might take you a while to wrap your head around it, but it's worth the investment. In the rest of the chapter, you'll learn about and use the purrr package which provides a set of functions that eliminate the need for for-loops for many common scenarios.
### Exercises
@ -277,7 +277,7 @@ This pattern of looping over a list and doing something to each element is so co
it's called exclusively for its side effects, so it's described in more detail
later in [walk](#walk).
Each functions takes a list as input, applies a function to each piece, and then returns a new vector that's the same length as the input. The type of the vector is determine by the specific map function. Usually you want to use the most specific avaiable; using `map()` only as a fallback when there is no specialised equivalent available.
Each function takes a list as input, applies a function to each piece, and then returns a new vector that's the same length as the input. The type of the vector is determined by the specific map function. Usually you want to use the most specific available, using `map()` only as a fallback when there is no specialised equivalent available.
We can use these functions to perform the same computations as the previous for loops:
@ -298,7 +298,7 @@ There are a few differences between `map_*()` and `compute_summary()`:
character vector, or an integer vector. You'll learn about those handy
shortcuts in the next section.
* Any arguments after `.f` will be passed on to it each time its called:
* Any arguments after `.f` will be passed on to it each time it's called:
```{r}
map_dbl(x, mean, trim = 0.5)
@ -313,7 +313,7 @@ There are a few differences between `map_*()` and `compute_summary()`:
### Shortcuts
There are a few shortcuts that you can use with `.f` in order to save a little typing. Imagine you want to fit a linear model to each individual in a dataset. The following toy example splits the up the `mtcars` dataset in to three pieces (only for each value of cylinder) and fits the same linear model to each piece:
There are a few shortcuts that you can use with `.f` in order to save a little typing. Imagine you want to fit a linear model to each group in a dataset. The following toy example splits the up the `mtcars` dataset in to three pieces (one for each value of cylinder) and fits the same linear model to each piece:
```{r}
models <- mtcars %>%
@ -329,9 +329,9 @@ models <- mtcars %>%
map(~lm(mpg ~ wt, data = .))
```
Here I've used `.` as a pronoun: it refers to the "current" list element (in the same way that `i` referred to the number in the for loop). You can also use `.x` and `.y` to refer to up to two arguments. If you want to create an function with more than two arguments, do it the regular way!
Here I've used `.` as a pronoun: it refers to the current list element (in the same way that `i` referred to the current index in the for loop). You can also use `.x` and `.y` to refer to up to two arguments. If you want to create a function with more than two arguments, do it the regular way!
When you're looking at many models, you might want to extract a summary statistic like the $R^2$. To do that we need to first run `summary()` and then extract the component called `r.squared`. We could do that using the shorthand for anonymous funtions:
When you're looking at many models, you might want to extract a summary statistic like the $R^2$. To do that we need to first run `summary()` and then extract the component called `r.squared`. We could do that using the shorthand for anonymous functions:
```{r}
models %>%
@ -379,9 +379,9 @@ If you're familiar with the apply family of functions in base R, you might have
c(0.39, 0.01, 0.38, 0.87, 0.34)
)
threshhold <- function(x, cutoff = 0.8) x[x > cutoff]
str(sapply(x1, threshhold))
str(sapply(x2, threshhold))
threshold <- function(x, cutoff = 0.8) x[x > cutoff]
str(sapply(x1, threshold))
str(sapply(x2, threshold))
```
* `vapply()` is a safe alternative to `sapply()` because you supply an additional
@ -421,7 +421,7 @@ The map functions apply a function to every element in a list. They are the most
### Extracting deeply nested elements
Some times you get data structures that are very deeply nested. A common source of sych data is JSON from a web API. I've previously downloaded a list of GitHub issues related to this book and saved it as `issues.json`. Now I'm going to load it into a list with jsonlite. By default `fromJSON()` tries to be helpful and simplifies the structure a little for you. Here I'm going to show you how to do it with purrr, so I set `simplifyVector = FALSE`:
Some times you get data structures that are very deeply nested. A common source of such data is JSON from a web API. I've previously downloaded a list of GitHub issues related to this book and saved it as `issues.json`. Now I'm going to load it into a list with jsonlite. By default `fromJSON()` tries to be helpful and simplifies the structure a little for you. Here I'm going to show you how to do it with purrr, so I set `simplifyVector = FALSE`:
```{r}
# From https://api.github.com/repos/hadley/r4ds/issues
@ -504,7 +504,7 @@ You'll see an example of this in the next section, as `transpose()` is particula
It's called transpose by analogy to matrices. When you subset a transposed matrix, you switch indices: `x[i, j]` is the same as `t(x)[j, i]`. It's the same idea when transposing a list, but the subsetting looks a little different: `x[[i]][[j]]` is equivalent to `transpose(x)[[j]][[i]]`. Similarly, a transpose is its own inverse so `transpose(transpose(x))` is equal to `x`.
Transpose is also useful when working with JSON apis. Many JSON APIs represent data frames in a row-based format, rather than R's column-based format. `transpose()` makes it easy to switch between the two:
Transpose is also useful when working with JSON APIs. Many JSON APIs represent data frames in a row-based format, rather than R's column-based format. `transpose()` makes it easy to switch between the two:
```{r}
df <- dplyr::data_frame(x = 1:3, y = c("a", "b", "c"))
@ -526,7 +526,7 @@ When you do many operations on a list, sometimes one will fail. When this happen
In this section you'll learn how to deal this situation with a new function: `safely()`. `safely()` is an adverb: it takes a function (a verb) and returns a modified version. In this case, the modified function will never throw an error. Instead, it always returns a list with two elements:
1. `result` is original result. If there was an error, this will be `NULL`.
1. `result` is the original result. If there was an error, this will be `NULL`.
1. `error` is an error object. If the operation was successful this will be
`NULL`.
@ -541,7 +541,7 @@ str(safe_log(10))
str(safe_log("a"))
```
When the function succeeds the `result` element contains the result and the error element is `NULL`. When the function fails, the result element is `NULL` and the error element contains an error object.
When the function succeeds the `result` element contains the result and the `error` element is `NULL`. When the function fails, the `result` element is `NULL` and the `error` element contains an error object.
`safely()` is designed to work with map:
@ -598,7 +598,7 @@ Purrr provides two other useful adverbs:
## Parallel maps
So far we've mapped along a single list. But often you have mutliple related lists that you need iterate along in parallel. That's the job of the `map2()` and `pmap()` functions. For example, imagine you want to simulate some random normals with different means. You know how to do that with `map()`:
So far we've mapped along a single list. But often you have multiple related lists that you need iterate along in parallel. That's the job of the `map2()` and `pmap()` functions. For example, imagine you want to simulate some random normals with different means. You know how to do that with `map()`:
```{r}
mu <- list(5, 10, -3)
@ -667,7 +667,7 @@ params$result <- params %>% pmap(rnorm)
params
```
As soon as your code gets complicated, I think a data frame is a good approach because it ensures that each column has a name and is the same length as all the other columns. We'll come back to this idea when we explore the intersection of dplyr, purr, and model fitting.
As soon as your code gets complicated, I think a data frame is a good approach because it ensures that each column has a name and is the same length as all the other columns. We'll come back to this idea when we explore the intersection of dplyr, purrr, and model fitting.
### Invoking different functions
@ -711,7 +711,7 @@ sim %>% dplyr::mutate(
## Walk {#walk}
Walk is an alternative to map that you use when you want to call a function for its side effects, rather than for its return value. You typically do this because you want to render output to the screen or saving files to disk - the important thing is the action, not the return value. Here's a very simple example:
Walk is an alternative to map that you use when you want to call a function for its side effects, rather than for its return value. You typically do this because you want to render output to the screen or save files to disk - the important thing is the action, not the return value. Here's a very simple example:
```{r}
x <- list(1, "a", 3)
@ -739,7 +739,7 @@ pwalk(list(paths, plots), ggsave, path = tempdir())
Imagine we want to summarise each numeric column of a data frame. We could do it in two steps:
1. Find all numeric columns.
1. Sumarise summarise each column.
1. Summarise each column.
In code, that would look like:
@ -791,7 +791,7 @@ Purrr comes with a number of predicate functions built-in:
| `is_vector()` | x | x | x | x | x | |
| `is_null()` | | | | | | x |
Compared to the base R functions, they only inspect the type of the object, not its attributes. This means they tend to be less suprising:
Compared to the base R functions, they only inspect the type of the object, not its attributes. This means they tend to be less surprising:
```{r}
is.atomic(NULL)