Drop handling hierarchy

It's just a bit too raw - and rather than polishing it, it would be better to put the time in to (e.g.) ggplot2 scales
This commit is contained in:
hadley 2016-08-15 09:18:56 -05:00
parent 9d7851318d
commit 92d7665920
7 changed files with 64 additions and 16 deletions

View File

@ -22,7 +22,6 @@ rmd_files: [
"functions.Rmd",
"vectors.Rmd",
"iteration.Rmd",
"hierarchy.Rmd",
"model.Rmd",
"model-basics.Rmd",

View File

@ -1,7 +1,16 @@
# Handling hierarchy {#hierarchy}
# Hierarchical data {#hierarchy}
## Introduction
This chapter belongs in [wrangle](#wrangle-intro): it will give you a set of tools for working with hierarchical data, such as the deeply nested lists you often get when working with JSON. However, you can only learn it now because working with hierarchical structures requires some programming skills, particularly an understanding of data structures, functions, and iteration. Now you have those tools under your belt, you can learn how to work with hierarchical data.
The
As well as tools to simplify iteration, purrr provides tools for handling deeply nested lists. There are three common sources of such data:
* JSON and XML
*
The map functions apply a function to every element in a list. They are the most commonly used part of purrr, but not the only part. Since lists are often used to represent complex hierarchies, purrr also provides tools to work with hierarchy:
* You can extract deeply nested elements in a single call by supplying
@ -19,7 +28,7 @@ This chapter focusses mostly on purrr. As well as the tools for iteration that y
library(purrr)
```
## Extracting deeply nested elements
## Initial exploration
Sometimes you get data structures that are very deeply nested. A common source of such data is JSON from a web API. I've previously downloaded a list of GitHub issues related to this book and saved it as `issues.json`. Now I'm going to load it into a list with jsonlite. By default `fromJSON()` tries to be helpful and simplifies the structure a little for you. Here I'm going to show you how to do it with purrr, so I set `simplifyVector = FALSE`:
@ -28,16 +37,35 @@ Sometimes you get data structures that are very deeply nested. A common source o
issues <- jsonlite::fromJSON("issues.json", simplifyVector = FALSE)
```
There are eight issues, and each issue is a nested list:
You might be tempted to use `str()` on this data. Unfortunately, however, `str()` is not designed for lists that are both deep and wide, and you'll tend to get overwhelemd by the output. A better strategy is to pull the list apart piece by piece.
First, figure out how many elements are in the list, take a look at one, and then check they're all the same structure. In this case there are eight elements, and the first element is another list.
```{r}
length(issues)
str(issues[[1]])
```
(In this case we got lucky and the structure is (just) simple enough to print out with `str()`. If you're unlucky, you may need to repeat this procedure.)
```{r}
tibble::tibble(
i = seq_along(issues),
names = issues %>% map(names)
) %>%
tidyr::unnest(names) %>%
table() %>%
t()
```
Another alternative is the __listviewer__ package, <https://github.com/timelyportfolio/listviewer>.
## Extracting deeply nested elements
To work with this sort of data, you typically want to turn it into a data frame by extracting the related vectors that you're most interested in:
```{r}
issues %>% map_int("id")
issues %>% map_lgl("locked")
issues %>% map_chr("state")
@ -58,6 +86,33 @@ issues %>% map_chr(c("user", "login"))
issues %>% map_int(c("user", "id"))
```
What happens if that path is missing in some of the elements? For example, lets try and extract the HTML url to the pull request:
```{r, error = TRUE}
issues %>% map_chr(c("pull_request", "html_url"))
```
Unfortunately that doesn't work. Whenever you see an error from purrr complaining about the "type" of the result, it's because it's trying to shove it into a simple vector (here a character). You can diagnose the problem more easily if you use `map()`:
```{r}
issues %>% map(c("pull_request", "html_url"))
```
To get the results into a character vector, we need to tell purrr what it should change `NULL` to. You can do that with the `.null` argument. The most common value to use is `NA`:
```{r}
issues %>% map_chr(c("pull_request", "html_url"), .null = NA)
```
(You might wonder why that isn't the default value since it's so useful. Well, if it was the default, you'd never get an error message if you had a typo in the names. You'd just get a vector of missing values. That would be annoying to debug becase it's a silent failure.)
It's possible to mix position and named indexing by using a list
```{r}
issues %>% map_chr(list("pull_request", 1), .null = NA)
```
## Removing a level of hierarchy
As well as indexing deeply into hierarchy, it's sometimes useful to flatten it. That's the job of the flatten family of functions: `flatten()`, `flatten_lgl()`, `flatten_int()`, `flatten_dbl()`, and `flatten_chr()`. In the code below we take a list of lists of double vectors, then flatten it to a list of double vectors, then to a double vector.

View File

@ -628,6 +628,6 @@ To get other types of data into R, we recommend starting with the tidyverse pack
__RSQLite__, __RPostgreSQL__ etc) allows you to run SQL queries against a
database and return a data frame.
For hierarchical data: use __jsonlite__ (by Jeroen Ooms) for json, and __xml2__ for XML. You will need to convert them to data frames using the tools on [handling hierarchy].
For hierarchical data: use __jsonlite__ (by Jeroen Ooms) for json, and __xml2__ for XML.
For other file types, try the [R data import/export manual](https://cran.r-project.org/doc/manuals/r-release/R-data.html) and the [__rio__](https://github.com/leeper/rio) package.

View File

@ -262,7 +262,7 @@ str(out)
str(unlist(out))
```
Here I've used `unlist()` to flatten a list of vectors into a single vector. You'll learn about other options in [Removing a level of hierarchy].
Here I've used `unlist()` to flatten a list of vectors into a single vector. A stricter option is to use `purrr::flatten_dbl()` - it will throw an error if the input isn't a list of doubles.
This pattern occurs in other places too:
@ -657,7 +657,7 @@ y <- x %>% map(safely(log))
str(y)
```
This would be easier to work with if we had two lists: one of all the errors and one of all the output. That's easy to get with `purrr::transpose()` (you'll learn more about `transpose()` in [Switching levels in the hierarchy])
This would be easier to work with if we had two lists: one of all the errors and one of all the output. That's easy to get with `purrr::transpose()`:
```{r}
y <- y %>% transpose()
@ -789,7 +789,7 @@ params %>%
pmap(rnorm)
```
As soon as your code gets complicated, I think a data frame is a good approach because it ensures that each column has a name and is the same length as all the other columns. We'll come back to this idea in [Handling hierarchy], and again when we explore the intersection of dplyr, purrr, and model fitting.
As soon as your code gets complicated, I think a data frame is a good approach because it ensures that each column has a name and is the same length as all the other columns.
### Invoking different functions

View File

@ -15,7 +15,7 @@ In this chapter you're going to learn three powerful ideas that help you to work
because once you have tidy data, you can apply all of the techniques that
you've learned about in earlier in the book.
These ideas are particularly powerful in conjunction with the ideas of functional programming, so make sure you've read [iteration] and [handling hierarchy] before starting this chapter.
These ideas are particularly powerful in conjunction with the ideas of functional programming, so make sure you've read [iteration] before starting this chapter.
We'll start by diving in to a motivating example using data about life expectancy around the world. It's a small dataset but it illustrates how important modelling can be for improving your visualisations. We'll use a large number of simple models to partition out some of the strongest signal so we can see the subtler signals that remain. We'll also see how model summaries can help us pick out outliers and unusual trends.

View File

@ -32,12 +32,6 @@ In the following chapters, you'll learn important programming skills:
grounding in R's data structures provided by [vectors]. You must master
the four common atomic vectors, the three important S3 classes built on
top of them, and understand the mysteries of the list and data frame.
1. One of the particularly important data structures in R is the list.
Lists are important because a list can contain other lists, so it is
__hierarchical__. Two common scenarios where hierarchical structures
arise are json, and fitting many models. In [handling hierach] you'll
new tools handle these problems as easily as possible.
Writing code is similar in many ways to writing prose. One parallel which I find particularly useful is that in both cases rewriting is the key to clarity. The first expression of your ideas is unlikely to be particularly clear, and you may need to rewrite multiple times. After solving a data analysis challenge, it's often worth looking at your code and thinking about whether or not it's obvious what you've done. If you spend a little time rewriting your code while the ideas are fresh, you can save a lot of time later trying to recreate what your code did. But this doesn't mean you should rewrite every function: you need to balance what you need to achieve now with saving time in the long run. (But the more you rewrite your functions the more likely you'll first attempt will be clear.)

View File

@ -551,7 +551,7 @@ This is a common pattern for stringr functions, because working with a single ma
str_extract_all(more, colour_match)
```
You'll learn more about lists in [lists](#lists) and [handling hierarchy].
You'll learn more about lists in [lists](#lists) and [iteration].
If you use `simplify = TRUE`, `str_extract_all()` will return a matrix with short matches expanded to the same length as the longest: