638 lines
20 KiB
Plaintext
638 lines
20 KiB
Plaintext
# Data rectangling {#sec-rectangle-data}
|
||
|
||
```{r}
|
||
#| results: "asis"
|
||
#| echo: false
|
||
source("_common.R")
|
||
status("drafting")
|
||
```
|
||
|
||
## Introduction
|
||
|
||
Often you have to deal with data that is fundamentally tree-like --- rather than a rectangular structure of rows and columns, you have items that with one or more children.
|
||
In this chapter, you'll learn the art of "rectangling", taking complex hierarchical data and turning it into a data frame that you can easily work with using the tools you learned earlier in the book.
|
||
|
||
We'll start by talking about lists, an new type of vector that makes hierarchical data possible.
|
||
Then you'll learn about three key functions for rectangling from tidyr: `tidyr::unnest_longer()`, `tidyr::unnest_wider()`, and `tidyr::hoist()`.
|
||
Then see how these ideas apply to some real data from the repurrrsive package.
|
||
Finish off by talking about JSON, source of many hierarchical datasets.
|
||
|
||
### Prerequisites
|
||
|
||
In this chapter we'll continue using tidyr, which also provides a bunch of tools to rectangle your datasets.
|
||
tidyr is a member of the core tidyverse.
|
||
We'll also use repurrrsive to supply some interesting datasets to practice your rectangling skills.
|
||
We'll finish up with a little jsonlite, since JSON is a typical source of deeply nested data.
|
||
|
||
```{r}
|
||
#| label: setup
|
||
#| message: false
|
||
|
||
library(tidyverse)
|
||
library(repurrrsive)
|
||
library(jsonlite)
|
||
|
||
```
|
||
|
||
## Lists
|
||
|
||
So far we've focused on the simple vectors like integers, numbers, characters, date-times, and factors.
|
||
These all share the property that they're flat and homogeneous: every element is of the same type.
|
||
The next step up in complexity are lists, which can contain any vector.
|
||
You create a list with `list()`:
|
||
|
||
```{r}
|
||
x1 <- list(1:4, "a", TRUE)
|
||
x1
|
||
```
|
||
|
||
It's also common to name the components of a list, which works much like naming the columns of a tibble:
|
||
|
||
```{r}
|
||
x2 <- list(a = 1:2, b = 1:3, c = 1:4)
|
||
x2
|
||
```
|
||
|
||
Even for these very simple lists, printing takes up quite a lot of space, and it gets even worse as the lists get more complex.
|
||
A very useful alternative is `str()`, short for structure, because it focuses on a compact display of **str**ucture, de-emphasizing the contents:
|
||
|
||
```{r}
|
||
str(x1)
|
||
str(x2)
|
||
```
|
||
|
||
`str()` display each element (or **child**) of a list on its own line.
|
||
It displays the name if present, then an abbreviation of the type, then a sample of the values.
|
||
|
||
### Hierarchy
|
||
|
||
Lists can even contain other lists!
|
||
This makes them suitable for representing hierarchical or tree-like structures.
|
||
|
||
```{r}
|
||
x3 <- list(list(1, 2), list(3, 4))
|
||
str(x3)
|
||
```
|
||
|
||
You can see how `str()` starts to get even more useful as the lists get more complex, and you can easily see the multiple layers at a glance.
|
||
|
||
```{r}
|
||
x4 <- list(1, list(2, list(3, list(4, list(5)))))
|
||
str(x4)
|
||
```
|
||
|
||
However, at some point, even `str()` starts to fail, if you're working with deeply nested lists in RStudio, you may need to switch to `View()`.
|
||
@fig-view-collapsed shows the result of calling `View(x4)`.
|
||
The viewer starts by showing just the top level of the list, but you can interactively expand any of the components to see more, as in @fig-view-expand-1.
|
||
You can do this as many times as needed and RStudio will also show you the subsetting code you need to access that element, as in @fig-view-expand-2.
|
||
We'll come back to how this code works in @sec-vector-subsetting.
|
||
|
||
```{r}
|
||
#| label: fig-view-collapsed
|
||
#| fig.cap: >
|
||
#| The RStudio allows you to interactively explore a complex list.
|
||
#| The viewer opens showing only the top level of the list.
|
||
#| echo: false
|
||
#| out-width: NULL
|
||
knitr::include_graphics("screenshots/View-1.png", dpi = 220)
|
||
```
|
||
|
||
```{r}
|
||
#| label: fig-view-expand-1
|
||
#| fig.cap: >
|
||
#| Clicking on the rightward facing triangle expands that component
|
||
#| of the list so that you can also see its children.
|
||
#| echo: false
|
||
#| out-width: NULL
|
||
knitr::include_graphics("screenshots/View-2.png", dpi = 220)
|
||
```
|
||
|
||
```{r}
|
||
#| label: fig-view-expand-2
|
||
#| fig.cap: >
|
||
#| You can repeat this operation as many times as needed to get to the
|
||
#| data you're interested in. Note the bottom-right corner: if you click
|
||
#| an element of the list, RStudio will give you the subsetting code needed
|
||
#| to access it.
|
||
#| echo: false
|
||
#| out-width: NULL
|
||
knitr::include_graphics("screenshots/View-3.png", dpi = 220)
|
||
```
|
||
|
||
### List columns
|
||
|
||
You can even put lists in the column of a tibble:
|
||
|
||
```{r}
|
||
df <- tibble(
|
||
x = 1:2,
|
||
y = c("a", "b"),
|
||
z = list(1:3, 4:5)
|
||
)
|
||
df
|
||
```
|
||
|
||
This is a powerful idea because it allows you to store arbitrarily complex objects in a data frame; even things that wouldn't typically belong there.
|
||
This idea is used a lot in tidymodels, because it allows you to store things like models or resamples in a data frame.
|
||
|
||
And those things are carried along like any other column:
|
||
|
||
```{r}
|
||
df |>
|
||
filter(x == 1)
|
||
```
|
||
|
||
The default print method just displays a rough summary of the contents.
|
||
The list column could be arbitrarily complex, so there's no good way to print it.
|
||
If you want to see it, you'll need to pull the list-column out and apply of the techniques that you learned above:
|
||
|
||
```{r}
|
||
df |>
|
||
filter(x == 1) |>
|
||
pull(z) |>
|
||
str()
|
||
```
|
||
|
||
Similarly, if you `View()` a data frame in RStudio, you'll get the standard tabular view, which doesn't allow you to selectively expand list columns.
|
||
To explore those fields you'll need to `pull()` and view, e.g.
|
||
`View(pull(df, z))`
|
||
|
||
::: callout-note
|
||
## Base R
|
||
|
||
It's possible to put a list in a column of a `data.frame`, but it's a lot fiddlier.
|
||
List-columns are implicit in the definition of the data frame: a data frame is a named list of equal length vectors.
|
||
A list is a vector, so it's always been legitimate to use a list as a column of a data frame.
|
||
However, base R doesn't make it easy to create list-columns because `data.frame()` treats a list as a list of columns:
|
||
|
||
```{r}
|
||
data.frame(x = list(1:3, 3:5))
|
||
```
|
||
|
||
You can prevent `data.frame()` from doing this with `I()`, but the result doesn't print particularly well:
|
||
|
||
```{r}
|
||
data.frame(
|
||
x = I(list(1:3, 3:5)),
|
||
y = c("1, 2", "3, 4, 5")
|
||
)
|
||
```
|
||
|
||
Tibbles make it easier to work with list-columns because `tibble()` doesn't modify its inputs and the print method is designed with lists in mind.
|
||
:::
|
||
|
||
## Unnesting
|
||
|
||
Now that you've learned the basics of lists and how you can use them as a column of a data frame, lets start to see how you can turn them back into regular columns and rows so you can use them with the tidyverse functions you've already learned about.
|
||
We'll start with very simple sample data so you can get the idea of how things work, and then in the next section switch to more realistic examples.
|
||
|
||
Lists tend to come in two basic forms:
|
||
|
||
- A named list where every row has the same number of children with the same names.
|
||
- An unnamed list where the number of children varies from row to row.
|
||
|
||
The following code creates an example of each.
|
||
In `df1`, every element of list-column `y` has two elements named `a` and `b`.
|
||
If `df2`, the elements of list-column `y` are unnamed and vary in length.
|
||
|
||
```{r}
|
||
df1 <- tribble(
|
||
~x, ~y,
|
||
1, list(a = 11, b = 12),
|
||
2, list(a = 21, b = 22),
|
||
3, list(a = 31, b = 32),
|
||
)
|
||
|
||
df2 <- tribble(
|
||
~x, ~y,
|
||
1, list(11, 12, 13),
|
||
2, list(21),
|
||
3, list(31, 32),
|
||
)
|
||
```
|
||
|
||
These two cases correspond to two tools from tidyr: `unnest_wider()` and `unnest_longer()`.
|
||
Their suffixes have the same meaning as `pivot_wider()` and `pivot_longer()`: `_wider()` adds more columns and `_longer()` adds more rows.
|
||
If your situation isn't as clear cut as these cases, you'll still need to use one of `unnest_longer()` and `unnest_wider()`; you'll just need to do a bit more thinking and experimentation to figure out which one is best.
|
||
|
||
The main difference between these simple examples and real data is that there's only one level of nesting here.
|
||
In real-life, there will often be many, and you'll need to use multiple calls to `unnest_wider()` and `unnest_longer()` to handle it.
|
||
|
||
### `unnest_wider()`
|
||
|
||
When each row has the same number of elements with the same names, like `df1`, it's natural to put each component into its own column with `unnest_wider()`:
|
||
|
||
```{r}
|
||
df1 |>
|
||
unnest_wider(y)
|
||
```
|
||
|
||
By default, the names of the new columns come exclusively from the names of the list, but you can use the `names_sep` argument to request that they combine the original column with the new column.
|
||
As you'll learn in the next section, this is useful for disambiguating repeated names.
|
||
|
||
```{r}
|
||
df1 |>
|
||
unnest_wider(y, names_sep = "_")
|
||
```
|
||
|
||
We can also use `unnest_wider()` with unnamed list-columns, as in `df2`.
|
||
It's not as naturally well suited, because it's not clear what the columns should be named.
|
||
So `unnest_wider()` gives them numbers:
|
||
|
||
```{r}
|
||
df2 |>
|
||
unnest_wider(y, names_sep = "_")
|
||
```
|
||
|
||
You'll notice that `unnested_wider()`, much like `pivot_wider()`, turns implicit missing values in to explicit missing values.
|
||
Another challenge is that if you're working with live data, you won't know exactly how many columns you'll end up with.
|
||
|
||
### `unnest_longer()`
|
||
|
||
When each row contains an unnamed list, it's most natural to put each element into its own row with `unnest_longer()`:
|
||
|
||
```{r}
|
||
df2 |>
|
||
unnest_longer(y)
|
||
```
|
||
|
||
You can also apply the same operation to named list-columns, like `df1$y`:
|
||
|
||
```{r}
|
||
df1 |>
|
||
unnest_longer(y)
|
||
```
|
||
|
||
Note the new `y_id` column.
|
||
Because the elements are named, and those names might be useful data, tidyr keeps them in the result data in a new column with the `_id` suffix.
|
||
You can suppress this with `indices_include = FALSE`.
|
||
|
||
You might also use `indices_include = TRUE` if the position of the elements is important in the unnamed case:
|
||
|
||
```{r}
|
||
df2 |>
|
||
unnest_longer(y, indices_include = TRUE)
|
||
```
|
||
|
||
### Other functions
|
||
|
||
There are few other useful rectangling functions that we're not going to talk about here:
|
||
|
||
- `unnest_auto()` automatically picks between `unnest_longer()` and `unnest_wider()`based on the structure of the list-column. It's a great for rapid exploration, but I think it's ultimately a bad idea because it doesn't force you to understand how your data is structured, and makes your code harder to understand.
|
||
- `unnest()` modifies rows and columns simultaneously. It's useful when you have a list-column that contains a 2d structure like a data frame (which we often call a nested data frame), which we don't otherwise use in this book.
|
||
- `hoist()` allows you to reach into a deeply nested list and extract just the components that you need. It's mostly equivalent to repeated invocations of `unnest_wider()` + `select()` so you should read up on it if there's just a couple of important variables that you want to pull out, embedded in a bunch of data that you don't care about.
|
||
|
||
### Exercises
|
||
|
||
1. From time-to-time you encounter data frames with multiple list-columns with aligned values.
|
||
For example, in the following data frame, the values of `y` and `z` are aligned (i.e. `y` and `z` will always have the same length within a row, and the first value of `y` corresponds to the first value of `z`).
|
||
What happens if you apply two `unnest_longer()` calls to this data frame?
|
||
How can you preserve the relationship between `x` and `y`?
|
||
(Hint: carefully read the docs).
|
||
|
||
```{r}
|
||
df4 <- tribble(
|
||
~x, ~y, ~z,
|
||
"a", list("y-a-1", "y-a-2"), list("z-a-1", "z-a-2"),
|
||
"b", list("y-b-1", "y-b-2", "y-b-3"), list("z-b-1", "z-b-2", "z-b-3")
|
||
)
|
||
```
|
||
|
||
## Case studies
|
||
|
||
Now that you understand the basics of `unnest_wider()` and `unnest_longer()` lets use them to tackle some real rectangling challenges.
|
||
These challenges share the common feature that they're mostly just a sequence of multiple `unnest_wider()` and/or `unnest_longer()` calls, with a little dash of dplyr where needed.
|
||
See `vignette("rectangling", package = "tidyr")` for more.
|
||
|
||
### Very wide data
|
||
|
||
We'll start with `gh_repos` --- this is some data about GitHub repositories retrived from GitHub API. It's a very deeply nested list so it's hard for me to display in this book; you might want to explore a little on your own with `View(gh_repos)` before we continue.
|
||
To make it more manageable I'm going to put it in a tibble in a column called `json` (for reasons we'll get to later)
|
||
|
||
```{r}
|
||
repos <- tibble(json = gh_repos)
|
||
repos
|
||
```
|
||
|
||
There are row rows, and each row contains a unnamed list with either 26 or 30 rows.
|
||
Since these are unnamed, we'll start with an `unnest_longer()` to put each child in its own row:
|
||
|
||
```{r}
|
||
repos |>
|
||
unnest_longer(json)
|
||
```
|
||
|
||
At first glance, it might seem like we haven't improved the situation --- while we have more rows now (176 instead of 6) it seems like each element of `json` is still a list.
|
||
However, there's an important difference: now each element is a **named** list so we can use `unnamed_wider()` to put each element into its own column:
|
||
|
||
```{r}
|
||
repos |>
|
||
unnest_longer(json) |>
|
||
unnest_wider(json)
|
||
```
|
||
|
||
This is a bit overwhelming --- there are so many columns that tibble doesn't even print all of them!
|
||
We can see them all with `names()`:
|
||
|
||
```{r}
|
||
repos |>
|
||
unnest_longer(json) |>
|
||
unnest_wider(json) |>
|
||
names()
|
||
```
|
||
|
||
Let's select a few that look interesting:
|
||
|
||
```{r}
|
||
repos |>
|
||
unnest_longer(json) |>
|
||
unnest_wider(json) |>
|
||
select(id, full_name, owner, description)
|
||
```
|
||
|
||
`owner` is another list-column, and since it contains named list, we can use `unnest_wider()` to get at the values:
|
||
|
||
```{r, error = TRUE}
|
||
repos |>
|
||
unnest_longer(json) |>
|
||
unnest_wider(json) |>
|
||
select(id, full_name, owner, description) |>
|
||
unnest_wider(owner)
|
||
```
|
||
|
||
Uh oh, this list column also contains an `id` column and we can't have two `id` columns in the same data frame.
|
||
Rather than following the advice to use `names_repair` (which would also work), I'll instead use `names_sep`:
|
||
|
||
```{r}
|
||
repos |>
|
||
unnest_longer(json) |>
|
||
unnest_wider(json) |>
|
||
select(id, full_name, owner, description) |>
|
||
unnest_wider(owner, names_sep = "_")
|
||
```
|
||
|
||
### Relational data
|
||
|
||
When you get nested data, it's not uncommon for it to contain data that we'd normally spread out into multiple data frames.
|
||
Take `got_chars`
|
||
|
||
```{r}
|
||
chars <- tibble(json = got_chars)
|
||
chars
|
||
```
|
||
|
||
The `json` column contains named values, so we'll start by widening it:
|
||
|
||
```{r}
|
||
chars |>
|
||
unnest_wider(json)
|
||
```
|
||
|
||
And selecting a few columns just to make it easier to read:
|
||
|
||
```{r}
|
||
characters <- chars |>
|
||
unnest_wider(json) |>
|
||
select(id, name, gender, culture, born, died, alive)
|
||
characters
|
||
```
|
||
|
||
There are also many list-columns:
|
||
|
||
```{r}
|
||
chars |>
|
||
unnest_wider(json) |>
|
||
select(id, where(is.list))
|
||
```
|
||
|
||
Lets explore a couple, starting with `titles`:
|
||
|
||
```{r}
|
||
chars |>
|
||
unnest_wider(json) |>
|
||
select(id, titles) |>
|
||
unnest_longer(titles)
|
||
```
|
||
|
||
You might expect to see this in its own table:
|
||
|
||
```{r}
|
||
titles <- chars |>
|
||
unnest_wider(json) |>
|
||
select(id, titles) |>
|
||
unnest_longer(titles) |>
|
||
filter(titles != "") |>
|
||
rename(title = titles)
|
||
titles
|
||
```
|
||
|
||
Because you could then join it on as needed.
|
||
For example, we find all the characters that are captains:
|
||
|
||
```{r}
|
||
captains <- titles |> filter(str_detect(title, "Captain"))
|
||
captains
|
||
|
||
characters |>
|
||
semi_join(captains)
|
||
```
|
||
|
||
You could imagine creating a table like this for each of the list-columns, and then using joins to combine when needed.
|
||
|
||
### Text analysis
|
||
|
||
```{r}
|
||
titles |>
|
||
mutate(word = str_split(title, " "), .keep = "unused") |>
|
||
unnest_longer(word) |>
|
||
count(word, sort = TRUE)
|
||
```
|
||
|
||
The tidytext package uses this idea.
|
||
Learn more at <https://www.tidytextmining.com>.
|
||
|
||
### Deeply nested
|
||
|
||
We'll finish off with an that is very deeply nested and requires repeated rounds of `unnest_wider()` and `unnest_longer()` to unravel: `gmaps_cities`.
|
||
This is a two column tibble containing five cities names and the results of using Google's [geocoding API](https://developers.google.com/maps/documentation/geocoding) to determine their location:
|
||
|
||
```{r}
|
||
gmaps_cities
|
||
```
|
||
|
||
`json` is list-column with internal names, so we start with an `unnest_wider()`:
|
||
|
||
```{r}
|
||
gmaps_cities |>
|
||
unnest_wider(json)
|
||
```
|
||
|
||
This gives us a status column and the actual results.
|
||
We'll drop the status column since they're all `OK`.
|
||
In a real analysis, you'd also want separately capture all the rows where `status != "OK"` so you could figure out what went wrong.
|
||
`results` is an unnamed list, with either one or two elements.
|
||
We'll figure to out why shortly.
|
||
|
||
```{r}
|
||
gmaps_cities |>
|
||
unnest_wider(json) |>
|
||
select(-status) |>
|
||
unnest_longer(results)
|
||
```
|
||
|
||
Now results is a named list, so we'll `unnest_wider()`:
|
||
|
||
```{r}
|
||
locations <- gmaps_cities |>
|
||
unnest_wider(json) |>
|
||
select(-status) |>
|
||
unnest_longer(results) |>
|
||
unnest_wider(results)
|
||
locations
|
||
```
|
||
|
||
Now we can see why Washington and Arlington got two results: Washington matched both the state and the city (DC), and Arlington matched Arlington Virginia and Arlington Texas.
|
||
|
||
There are few different places we could go from here.
|
||
We might want to determine the exact location of the match stored in the `geometry` list-column:
|
||
|
||
```{r}
|
||
locations |>
|
||
select(city, formatted_address, geometry) |>
|
||
unnest_wider(geometry)
|
||
```
|
||
|
||
That gives us new `bounds` (which gives a rectangular region) and the midpoint in `location`, which we can unnest to get latitude (`lat`) and longitude (`lng`):
|
||
|
||
```{r}
|
||
locations |>
|
||
select(city, formatted_address, geometry) |>
|
||
unnest_wider(geometry) |>
|
||
unnest_wider(location)
|
||
```
|
||
|
||
Extracting the bounds requires a few more steps
|
||
|
||
```{r}
|
||
locations |>
|
||
select(city, formatted_address, geometry) |>
|
||
unnest_wider(geometry) |>
|
||
# focus on the variables of interest
|
||
select(!location:viewport) |>
|
||
unnest_wider(bounds)
|
||
```
|
||
|
||
I then rename `southwest` and `northeast` (the corners of the rectangle) so I can use `names_sep` to create short but evocative names:
|
||
|
||
```{r}
|
||
locations |>
|
||
select(city, formatted_address, geometry) |>
|
||
unnest_wider(geometry) |>
|
||
select(!location:viewport) |>
|
||
unnest_wider(bounds) |>
|
||
rename(ne = northeast, sw = southwest) |>
|
||
unnest_wider(c(ne, sw), names_sep = "_")
|
||
```
|
||
|
||
Note that I take advantage of the fact that you can unnest multiple columns at a time by supplying a vector of variable names to `unnest_wider()`.
|
||
|
||
This one place where `hoist()`, which we mentioned briefly above can be useful.
|
||
Once you've discovered the path to get to the components you're interested in, you can extract them directly using `hoist()`:
|
||
|
||
```{r}
|
||
locations |>
|
||
select(city, formatted_address, geometry) |>
|
||
hoist(
|
||
geometry,
|
||
ne_lat = c("bounds", "northeast", "lat"),
|
||
sw_lat = c("bounds", "southwest", "lat"),
|
||
ne_lng = c("bounds", "northeast", "lng"),
|
||
sw_lng = c("bounds", "southwest", "lng"),
|
||
)
|
||
```
|
||
|
||
### Exercises
|
||
|
||
1. The `owner` column of `gh_repo` contains a lot of duplicated information because each owner can have many repos. Can you construct a `owners` data frame that contains one row for each owner? (Hint: does `distinct()` work with `list-cols`?)
|
||
|
||
2. Explain the following code. Why is it interesting? Why does it work for this dataset but might not work in general?
|
||
|
||
```{r}
|
||
tibble(json = got_chars) |>
|
||
unnest_wider(json) |>
|
||
select(id, where(is.list)) %>%
|
||
pivot_longer(where(is.list), names_to = "media", values_to = "value") %>%
|
||
unnest_longer(value)
|
||
```
|
||
|
||
## JSON
|
||
|
||
All of the case studies in the previous section came originally as JSON, one of the most common sources of hierarchical data.
|
||
In this section, you'll learn more about JSON and some common problems you might have.
|
||
JSON, short for javascript object notation, is a data format that grew out of the javascript programming language and has become an extremely common way of representing data.
|
||
|
||
``` json
|
||
{
|
||
"name1": "value1",
|
||
"name2": "value2"
|
||
}
|
||
```
|
||
|
||
Which in R you might represent as:
|
||
|
||
```{r}
|
||
list(
|
||
name1 = "value1",
|
||
name2 = "value2"
|
||
)
|
||
```
|
||
|
||
There are five types of things that JSON can represent
|
||
|
||
``` json
|
||
{
|
||
"strings": "are surrounded by double doubles",
|
||
"numbers": 123456,
|
||
"boolean": [false, true],
|
||
"arrays": [1, 2, 3, 4, 5],
|
||
"objects": {
|
||
"name1": "value1",
|
||
"name2": "value2"
|
||
},
|
||
"null": null
|
||
}
|
||
```
|
||
|
||
You'll notice that these types don't embrace many of the types you've learned earlier in the book like factors, dates, date-times, and tibbles.
|
||
This is important and we'll come back to it later.
|
||
|
||
Most of the time you won't deal with JSON directly, instead you'll use the jsonlite package, by Jeroen Oooms, to load it into R as a nested list.
|
||
|
||
### Data frames
|
||
|
||
JSON doesn't have any 2-dimension data structures, so how would you represent a data frame?
|
||
|
||
```{r}
|
||
df <- tribble(
|
||
~x, ~y,
|
||
"a", 10,
|
||
"x", 3
|
||
)
|
||
```
|
||
|
||
There are two ways: you can either make an struct of arrays, or an array of structs.
|
||
|
||
``` json
|
||
{
|
||
"x": ["a", "x"],
|
||
"y": [10, 3]
|
||
}
|
||
```
|
||
|
||
``` {.json .josn}
|
||
[
|
||
{"x": "a", "y": 10},
|
||
{"x": "x", "y": 3}
|
||
]
|
||
```
|