Rectangling polish

This commit is contained in:
Hadley Wickham 2022-09-01 08:27:21 -05:00
parent 279611af8a
commit fc3641a376
1 changed files with 82 additions and 53 deletions

View File

@ -10,17 +10,17 @@ status("polishing")
## Introduction
In this chapter, you'll learn the art of data **rectangling**, taking data that is fundamentally tree-like and converting it into a rectangular data frames made up of rows and columns.
This is important because hierarchical data is surprisingly common, especially when working with data that comes from a web API.
This is important because hierarchical data is surprisingly common, especially when working with data that comes from the web.
To learn about rectangling, you'll first learn about lists, the data structure that makes hierarchical data possible in R.
Then you'll learn about two crucial tidyr functions: `tidyr::unnest_longer()`, which converts children in rows, and `tidyr::unnest_wider()`, which converts children into columns.
We'll then show you a few case studies, applying these simple function multiple times to solve real problems.
To learn about rectangling, you'll need to first learn about lists, the data structure that makes hierarchical data possible.
Then you'll learn about two crucial tidyr functions: `tidyr::unnest_longer()` and `tidyr::unnest_wider()`.
We'll then show you a few case studies, applying these simple functions again and again to solve real problems.
We'll finish off by talking about JSON, the most frequent source of hierarchical datasets and a common format for data exchange on the web.
### Prerequisites
In this chapter we'll use many functions from tidyr, a core member of the tidyverse.
We'll also use repurrrsive to provide some interesting datasets rectangling practice, and we'll finish up with a little jsonlite, which we'll use to read JSON files into R lists.
We'll also use repurrrsive to provide some interesting datasets for rectangling practice, and we'll finish by using jsonlite to read JSON files into R lists.
```{r}
#| label: setup
@ -33,8 +33,8 @@ library(jsonlite)
## Lists
So far we've used simple vectors like integers, numbers, characters, date-times, and factors.
These vectors are simple because they're homogeneous: every element is same type.
So far you've worked with data frames that contain simple vectors like integers, numbers, characters, date-times, and factors.
These vectors are simple because they're homogeneous: every element is the same type.
If you want to store element of different types in the same vector, you'll need a **list**, which you create with `list()`:
```{r}
@ -86,16 +86,21 @@ x5 <- list(1, list(2, list(3, list(4, list(5)))))
str(x5)
```
As lists get even large and more complex, even `str()` starts to fail, you'll need to switch to `View()`[^rectangling-1].
@fig-view-collapsed shows the result of calling `View(x4)`. The viewer starts by showing just the top level of the list, but you can interactively expand any of the components to see more, as in @fig-view-expand-1. RStudio will also show you the code you need to access that element, as in @fig-view-expand-2. We'll come back to how this code works in @sec-vector-subsetting.
As lists get even larger and more complex, `str()` eventually starts to fail, and you'll need to switch to `View()`[^rectangling-1].
@fig-view-collapsed shows the result of calling `View(x4)`. The viewer starts by showing just the top level of the list, but you can interactively expand any of the components to see more, as in @fig-view-expand-1. RStudio will also show you the code you need to access that element, as in @fig-view-expand-2. We'll come back to how this code works in @sec-lists.
[^rectangling-1]: This is an RStudio feature.
```{r}
#| label: fig-view-collapsed
#| fig.cap: >
#| The RStudio allows you to interactively explore a complex list.
#| The RStudio view lets you interactively explore a complex list.
#| The viewer opens showing only the top level of the list.
#| fig.alt: >
#| A screenshot of RStudio showing the list-viewer. It shows the
#| two children of x4: the first child is a double vector and the
#| second child is a list. A rightward facing triable indicates that the
#| second child itself has children but you can't see them.
#| echo: false
#| out-width: NULL
knitr::include_graphics("screenshots/View-1.png", dpi = 220)
@ -106,6 +111,10 @@ knitr::include_graphics("screenshots/View-1.png", dpi = 220)
#| fig.cap: >
#| Clicking on the rightward facing triangle expands that component
#| of the list so that you can also see its children.
#| fig.alt: >
#| Another screenshot of the list-viewer having expand the second
#| child of x2. It also has two children, a double vector and another
#| list.
#| echo: false
#| out-width: NULL
knitr::include_graphics("screenshots/View-2.png", dpi = 220)
@ -115,9 +124,12 @@ knitr::include_graphics("screenshots/View-2.png", dpi = 220)
#| label: fig-view-expand-2
#| fig.cap: >
#| You can repeat this operation as many times as needed to get to the
#| data you're interested in. Note the bottom-right corner: if you click
#| data you're interested in. Note the bottom-left corner: if you click
#| an element of the list, RStudio will give you the subsetting code
#| needed to access it, in this case `x4[[2]][[2]][[2]]`.
#| fig.alt: >
#| Another screenshot, having expanded the grandchild of x4 to see its
#| two children, again a double vector and a list.
#| echo: false
#| out-width: NULL
knitr::include_graphics("screenshots/View-3.png", dpi = 220)
@ -173,11 +185,11 @@ It's possible to put a list in a column of a `data.frame`, but it's a lot fiddli
data.frame(x = list(1:3, 3:5))
```
You can force `data.frame()` to treat a list as a list of rows by wrapping it in list `I()`, but the result doesn't print particularly usefully:
You can force `data.frame()` to treat a list as a list of rows by wrapping it in list `I()`, but the result doesn't print particularly well:
```{r}
data.frame(
x = I(list(1:3, 3:5)),
x = I(list(1:2, 3:5)),
y = c("1, 2", "3, 4, 5")
)
```
@ -188,14 +200,12 @@ It's easier to use list-columns with tibbles because `tibble()` treats lists lik
## Unnesting
Now that you've learned the basics of lists and list-columns, let's explore how you can turn them back into regular rows and columns.
We'll start with very simple sample data so you can get the basic idea, and then switch to more realistic examples in the next section.
Here we'll use very simple sample data so you can get the basic idea; in the next section we'll switch to real data.
List-columns tend to come in two basic forms: named and unnamed.
When the children are **named**, they tend to have the same names in every row.
When the children are **unnamed**, the number of elements tends to vary from row-to-row.
The following code creates an example of each.
In `df1`, every element of list-column `y` has two elements named `a` and `b`.
In `df2`, the elements of list-column `y` are unnamed and vary in length.
For example, in `df1`, every element of list-column `y` has two elements named `a` and `b`.
Named list-columns naturally unnest into columns: each named element becomes a new named column.
```{r}
df1 <- tribble(
@ -204,6 +214,13 @@ df1 <- tribble(
2, list(a = 21, b = 22),
3, list(a = 31, b = 32),
)
```
When the children are **unnamed**, the number of elements tends to vary from row-to-row.
For example, in `df2`, the elements of list-column `y` are unnamed and vary in length from one to three.
Unnamed list-columns naturally unnest in to rows: you'll get one row for each child.
```{r}
df2 <- tribble(
~x, ~y,
@ -213,9 +230,7 @@ df2 <- tribble(
)
```
Named list-columns naturally unnest into columns: each named element becomes a new named column.
Unnamed list-columns naturally unnested in to rows: you'll get one row for each child.
tidyr provides two functions for these two case: `unnest_wider()` and `unnest_longer()`.
tidyr provides two functions for these two cases: `unnest_wider()` and `unnest_longer()`.
The following sections explain how they work.
### `unnest_wider()`
@ -227,7 +242,7 @@ df1 |>
unnest_wider(y)
```
By default, the names of the new columns come exclusively from the names of the list, but you can use the `names_sep` argument to request that they combine the column name and the list names.
By default, the names of the new columns come exclusively from the names of the list elements, but you can use the `names_sep` argument to request that they combine the column name and the element name.
This is useful for disambiguating repeated names.
```{r}
@ -255,7 +270,7 @@ df2 |>
```
Note how `x` is duplicated for each element inside of `y`: we get one row of output for each element inside the list-column.
But what happens if the list-column is empty, as in the following example?
But what happens if one of the elements is empty, as in the following example?
```{r}
df6 <- tribble(
@ -270,15 +285,15 @@ df6 |> unnest_longer(y)
We get zero rows in the output, so the row effectively disappears.
Once <https://github.com/tidyverse/tidyr/issues/1339> is fixed, you'll be able to keep this row, replacing `y` with `NA` by setting `keep_empty = TRUE`.
You can also unnest named list-columns, like `df1$y` into the rows.
Because the elements are named, and those names might be useful data, puts them in a new column with the suffix `_id`:
You can also unnest named list-columns, like `df1$y`, into rows.
Because the elements are named, and those names might be useful data, tidyr puts them in a new column with the suffix `_id`:
```{r}
df1 |>
unnest_longer(y)
```
If you don't want these `ids`, you can suppress this with `indices_include = FALSE`.
If you don't want these `ids`, you can suppress them with `indices_include = FALSE`.
On the other hand, it's sometimes useful to retain the position of unnamed elements in unnamed list-columns.
You can do this with `indices_include = TRUE`:
@ -311,7 +326,7 @@ df4 |>
As you can see, the output contains a list-column, but every element of the list-column contains a single element.
Because `unnest_longer()` can't find a common type of vector, it keeps the original types in a list-column.
You might wonder if this breaks the commandment that every element of a column must be the same type --- not quite, because every element is a still a list, and each component of that list contains something different.
You might wonder if this breaks the commandment that every element of a column must be the same type --- not quite: every element is a still a list, even though the contents of each element is a different type.
What happens if you find this problem in a dataset you're trying to rectangle?
There are two basic options.
@ -328,8 +343,7 @@ Another option would be to filter down to the rows that have values of a specifi
```{r}
df4 |>
unnest_longer(y) |>
rowwise() |>
filter(is.numeric(y))
filter(map_lgl(y, is.numeric))
```
Then you can call `unnest_longer()` once more:
@ -337,20 +351,21 @@ Then you can call `unnest_longer()` once more:
```{r}
df4 |>
unnest_longer(y) |>
rowwise() |>
filter(is.numeric(y)) |>
filter(map_lgl(y, is.numeric)) |>
unnest_longer(y)
```
You'll learn more about `map_lgl()` in @sec-iteration.
### Other functions
tidyr has a few other useful rectangling functions that we're not going to cover in this book:
- `unnest_auto()` automatically picks between `unnest_longer()` and `unnest_wider()` based on the structure of the list-column. It's a great for rapid exploration, but ultimately its a bad idea because it doesn't force you to understand how your data is structured, and makes your code harder to understand.
- `unnest()` expands both rows and columns. It's useful when you have a list-column that contains a 2d structure like a data frame, which we don't see in this book.
- `unnest()` expands both rows and columns. It's useful when you have a list-column that contains a 2d structure like a data frame, which you don't see in this book.
- `hoist()` allows you to reach into a deeply nested list and extract just the components that you need. It's mostly equivalent to repeated invocations of `unnest_wider()` + `select()` so read up on it if you're trying to extract just a couple of important variables embedded in a bunch of data that you don't care about.
These are good to know about when you're other people's code and for tackling rarer rectangling challenges.
These are good to know about when you're reading other people's code or tackling rarer rectangling challenges.
### Exercises
@ -370,13 +385,12 @@ These are good to know about when you're other people's code and for tackling ra
## Case studies
So far you've learned about the simplest case of list-columns, where rectangling only requires a single call to `unnest_longer()` or `unnest_wider()`.
The main difference between real data and these simple examples is that real data typically contains multiple levels of nesting that require multiple calls to `unnest_longer()` and `unnest_wider()`.
This section will work through four real rectangling challenges using datasets from the repurrrsive package that are inspired by datasets that we've encountered in the wild.
The main difference between the simple examples we used above and real data is that real data typically contains multiple levels of nesting that require multiple calls to `unnest_longer()` and/or `unnest_wider()`.
This section will work through four real rectangling challenges using datasets from the repurrrsive package, inspired by datasets that we've encountered in the wild.
### Very wide data
We'll start by exploring `gh_repos`.
We'll with `gh_repos`.
This is a list that contains data about a collection of GitHub repositories retrieved using the GitHub API. It's a very deeply nested list so it's difficult to show the structure in this book; you might want to explore a little on your own with `View(gh_repos)` before we continue.
`gh_repos` is a list, but our tools work with list-columns, so we'll begin by putting it into a tibble.
@ -389,7 +403,7 @@ repos
This tibble contains 6 rows, one row for each child of `gh_repos`.
Each row contains a unnamed list with either 26 or 30 rows.
Since these are unnamed, we'll start with an `unnest_longer()` to put each child in its own row:
Since these are unnamed, we'll start with `unnest_longer()` to put each child in its own row:
```{r}
repos |>
@ -437,6 +451,8 @@ repos |>
unnest_wider(owner)
```
<!--# TODO: https://github.com/tidyverse/tidyr/issues/1390 -->
Uh oh, this list column also contains an `id` column and we can't have two `id` columns in the same data frame.
Rather than following the advice to use `names_repair` (which would also work), we'll instead use `names_sep`:
@ -461,14 +477,14 @@ chars <- tibble(json = got_chars)
chars
```
The `json` column contains named values, so we'll start by widening it:
The `json` column contains named elements, so we'll start by widening it:
```{r}
chars |>
unnest_wider(json)
```
And selecting a few columns just to make it easier to read:
And selecting a few columns to make it easier to read:
```{r}
characters <- chars |>
@ -508,16 +524,15 @@ titles <- chars |>
titles
```
Now, for example, we could use this table to all the characters that are captains and see all their titles:
Now, for example, we could use this table tofind all the characters that are captains and see all their titles:
```{r}
captains <- titles |> filter(str_detect(title, "Captain"))
captains
characters |>
semi_join(captains, by = "id") |>
select(id, name) |>
left_join(titles, by = "id", multiple = "all")
inner_join(titles, by = "id", multiple = "all")
```
You could imagine creating a table like this for each of the list-columns, then using joins to combine them with the character data as you need it.
@ -540,7 +555,7 @@ titles |>
unnest_longer(word)
```
And then we can count that column to find the most common:
And then we can count that column to find the most common words:
```{r}
titles |>
@ -680,6 +695,7 @@ If these case studies have whetted your appetite for more real-life rectangling,
Why does it work for `got_chars` but might not work in general?
```{r}
#| results: false
tibble(json = got_chars) |>
unnest_wider(json) |>
select(id, where(is.list)) %>%
@ -699,7 +715,7 @@ If these case studies have whetted your appetite for more real-life rectangling,
## JSON
All of the case studies in the previous section were sourced from wild-caught JSON files.
All of the case studies in the previous section were sourced from wild-caught JSON.
JSON is short for **j**ava**s**cript **o**bject **n**otation and is the way that most web APIs return data.
It's important to understand it because while JSON and R's data types are pretty similar, there isn't a perfect 1-to-1 mapping, so it's good to understand a bit about JSON if things go wrong.
@ -709,27 +725,28 @@ JSON is a simple format designed to be easily read and written by machines, not
It has six key data types.
Four of them are scalars:
- The simplest type is a null, which is written `null`, which plays the same role as both `NULL` and `NA` in R. It represents the absence of data.
- A **string** is much like a string in R, but must use double quotes, not single quotes.
- A **number** is similar to R's numbers: they can be integer (e.g. 123), decimal (e.g. 123.45), or scientific (e.g. 1.23e3) notation. JSON doesn't support Inf, -Inf, or NaN.
- A **boolean** is similar to R's `TRUE` and `FALSE`, but use lower case `true` and `false`.
- The simplest type is a null (`null`) which plays the same role as both `NULL` and `NA` in R. It represents the absence of data.
- A **string** is much like a string in R, but must always use double quotes.
- A **number** is similar to R's numbers: they can use integer (e.g. 123), decimal (e.g. 123.45), or scientific (e.g. 1.23e3) notation. JSON doesn't support Inf, -Inf, or NaN.
- A **boolean** is similar to R's `TRUE` and `FALSE`, but uses lowercase `true` and `false`.
JSON's strings, numbers, and booleans are pretty similar to R's character, numeric, and logical vectors.
The main difference is that JSON's scalars can only represent a single value.
To represent multiple values you need to use one of the two remaining types, arrays and objects.
To represent multiple values you need to use one of the two remaining types: arrays and objects.
Both arrays and objects are similar to lists in R; the difference is whether or not they're named.
An **array** is like an unnamed list, and is written with `[]`.
For example `[1, 2, 3]` is an array containing 3 numbers, and `[null, 1, "string", false]` is an array that contains a null, a number, a string, and a boolean.
An **object** is like a named list, and it's written with `{}`.
An **object** is like a named list, and is written with `{}`.
The names (keys in JSON terminology) are strings, so must be surrounded by quotes.
For example, `{"x": 1, "y": 2}` is an object that maps `x` to 1 and `y` to 2.
### jsonlite
To convert JSON into R data structures, we recommend that you use the jsonlite package, by Jeroen Oooms.
To convert JSON into R data structures, we recommend the jsonlite package, by Jeroen Ooms.
We'll use only two jsonlite functions: `read_json()` and `parse_json()`.
In real life, you'll use `read_json()` to read a JSON file from disk.
For example, the repurrsive package also provides the source for `gh_user` as a JSON file:
For example, the repurrsive package also provides the source for `gh_user` as a JSON file and you can read it with `read_json()`:
```{r}
# A path to a json file inside the package:
@ -767,6 +784,7 @@ json <- '[
]'
df <- tibble(json = parse_json(json))
df
df |>
unnest_wider(json)
```
@ -785,6 +803,7 @@ json <- '{
'
df <- tibble(json = list(parse_json(json)))
df
df |>
unnest_wider(json) |>
unnest_longer(results) |>
@ -828,3 +847,13 @@ Apply `readr::parse_double()` as needed to the get correct variable type.
df_col <- tibble(json = list(json_col))
df_row <- tibble(json = json_row)
```
## Summary
In this chapter, you learned what lists are, how you can generate the from JSON files, and how turn them into rectangular data frames.
Surprisingly we only need two new functions: `unnest_longer()` to put list elements into rows and `unnest_wider()` to put list elements into columns.
It doesn't matter how deeply nested the list-column is, all you need to do is repeatedly call these two functions.
JSON is the most common data format returned by web APIs.
What happens if the website doesn't have an API, but you can see data you want on the website?
That's the topic of the next chapter: web scraping, extracting data from HTML webpages.