783 lines
28 KiB
Plaintext
783 lines
28 KiB
Plaintext
# Data rectangling {#sec-rectangle-data}
|
|
|
|
```{r}
|
|
#| results: "asis"
|
|
#| echo: false
|
|
source("_common.R")
|
|
status("polishing")
|
|
```
|
|
|
|
## Introduction
|
|
|
|
In this chapter, you'll learn the art of data **rectangling**, taking data that is fundamentally tree-like and converting it into a rectangular data frames made up of rows and columns.
|
|
This is important because hierarchical data is surprisingly common, especially when working with data that comes from a web API.
|
|
|
|
To learn about rectangling, you'll first learn about lists, the data structure that makes hierarchical data possible in R.
|
|
Then you'll learn about two crucial tidyr functions: `tidyr::unnest_longer()`, which converts children in rows, and `tidyr::unnest_wider()`, which converts children into columns.
|
|
We'll then show you a few case studies, applying these simple function multiple times to solve real problems.
|
|
We'll finish off by talking about JSON, the most frequent source of hierarchical datasets and a common format for data exchange on the web.
|
|
|
|
### Prerequisites
|
|
|
|
In this chapter we'll use many functions from tidyr, a core member of the tidyverse.
|
|
We'll also use repurrrsive to provide some interesting datasets rectangling practice, and we'll finish up with a little jsonlite, which we'll use to read JSON files into R lists.
|
|
|
|
```{r}
|
|
#| label: setup
|
|
#| message: false
|
|
|
|
library(tidyverse)
|
|
library(repurrrsive)
|
|
library(jsonlite)
|
|
```
|
|
|
|
## Lists
|
|
|
|
So far we've used simple vectors like integers, numbers, characters, date-times, and factors.
|
|
These vectors are simple because they're homogeneous: every element is same type.
|
|
If you want to store element of different types in the same vector, you'll need a **list**, which you create with `list()`:
|
|
|
|
```{r}
|
|
x1 <- list(1:4, "a", TRUE)
|
|
x1
|
|
```
|
|
|
|
It's often convenient to name the components, or **children**, of a list, which you can do in the same way as naming the columns of a tibble:
|
|
|
|
```{r}
|
|
x2 <- list(a = 1:2, b = 1:3, c = 1:4)
|
|
x2
|
|
```
|
|
|
|
Even for these very simple lists, printing takes up quite a lot of space.
|
|
A useful alternative is `str()`, which generates a compact display of the **str**ucture, de-emphasizing the contents:
|
|
|
|
```{r}
|
|
str(x1)
|
|
str(x2)
|
|
```
|
|
|
|
As you can see, `str()` displays each child of the list on its own line.
|
|
It displays the name, if present, then an abbreviation of the type, then the first few values.
|
|
|
|
### Hierarchy
|
|
|
|
Lists can contain any type of object, including other lists.
|
|
This makes them suitable for representing hierarchical (tree-like) structures:
|
|
|
|
```{r}
|
|
x3 <- list(list(1, 2), list(3, 4))
|
|
str(x3)
|
|
```
|
|
|
|
This is notably different to `c()`, which generates a flat vector:
|
|
|
|
```{r}
|
|
c(c(1, 2), c(3, 4))
|
|
|
|
x4 <- c(list(1, 2), list(3, 4))
|
|
str(x4)
|
|
```
|
|
|
|
As lists get more complex, `str()` gets more useful, as it lets you see the hierarchy at a glance:
|
|
|
|
```{r}
|
|
x5 <- list(1, list(2, list(3, list(4, list(5)))))
|
|
str(x5)
|
|
```
|
|
|
|
As lists get even large and more complex, even `str()` starts to fail, you'll need to switch to `View()`[^rectangling-1].
|
|
@fig-view-collapsed shows the result of calling `View(x4)`. The viewer starts by showing just the top level of the list, but you can interactively expand any of the components to see more, as in @fig-view-expand-1. RStudio will also show you the code you need to access that element, as in @fig-view-expand-2. We'll come back to how this code works in @sec-vector-subsetting.
|
|
|
|
[^rectangling-1]: This is an RStudio feature.
|
|
|
|
```{r}
|
|
#| label: fig-view-collapsed
|
|
#| fig.cap: >
|
|
#| The RStudio allows you to interactively explore a complex list.
|
|
#| The viewer opens showing only the top level of the list.
|
|
#| echo: false
|
|
#| out-width: NULL
|
|
knitr::include_graphics("screenshots/View-1.png", dpi = 220)
|
|
```
|
|
|
|
```{r}
|
|
#| label: fig-view-expand-1
|
|
#| fig.cap: >
|
|
#| Clicking on the rightward facing triangle expands that component
|
|
#| of the list so that you can also see its children.
|
|
#| echo: false
|
|
#| out-width: NULL
|
|
knitr::include_graphics("screenshots/View-2.png", dpi = 220)
|
|
```
|
|
|
|
```{r}
|
|
#| label: fig-view-expand-2
|
|
#| fig.cap: >
|
|
#| You can repeat this operation as many times as needed to get to the
|
|
#| data you're interested in. Note the bottom-right corner: if you click
|
|
#| an element of the list, RStudio will give you the subsetting code
|
|
#| needed to access it, in this case `x4[[2]][[2]][[2]]`.
|
|
#| echo: false
|
|
#| out-width: NULL
|
|
knitr::include_graphics("screenshots/View-3.png", dpi = 220)
|
|
```
|
|
|
|
### List-columns
|
|
|
|
Lists can also live inside a tibble, where we call them list-columns.
|
|
List-columns are useful because they allow you to shoehorn in objects that wouldn't wouldn't usually belong in a tibble.
|
|
In particular, list-columns are are used a lot in the [tidymodels](https://www.tidymodels.org) ecosystem, because they allows you to store things like models or resamples in a data frame.
|
|
|
|
Here's a simple example of a list-column:
|
|
|
|
```{r}
|
|
df <- tibble(
|
|
x = 1:2,
|
|
y = c("a", "b"),
|
|
z = list(list(1, 2), list(3, 4, 5))
|
|
)
|
|
df
|
|
```
|
|
|
|
There's nothing special about lists in a tibble; they behave like any other column:
|
|
|
|
```{r}
|
|
df |>
|
|
filter(x == 1)
|
|
```
|
|
|
|
Computing with list-columns is harder, but that's because computing with lists is harder in general; we'll come back to that in @sec-iteration.
|
|
In this chapter, we'll focus on unnesting list-columns out into regular variables so you can use your existing tools on them.
|
|
|
|
The default print method just displays a rough summary of the contents.
|
|
The list column could be arbitrarily complex, so there's no good way to print it.
|
|
If you want to see it, you'll need to pull the list-column out and apply one of the techniques that you learned above:
|
|
|
|
```{r}
|
|
df |>
|
|
filter(x == 1) |>
|
|
pull(z) |>
|
|
str()
|
|
```
|
|
|
|
Similarly, if you `View()` a data frame in RStudio, you'll get the standard tabular view, which doesn't allow you to selectively expand list columns.
|
|
To explore those fields you'll need to `pull()` and view, e.g. `df |> pull(z) |> View()`.
|
|
|
|
::: callout-note
|
|
## Base R
|
|
|
|
It's possible to put a list in a column of a `data.frame`, but it's a lot fiddlier because `data.frame()` treats a list as a list of columns:
|
|
|
|
```{r}
|
|
data.frame(x = list(1:3, 3:5))
|
|
```
|
|
|
|
You can force `data.frame()` to treat a list as a list of rows by wrapping it in list `I()`, but the result doesn't print particularly usefully:
|
|
|
|
```{r}
|
|
data.frame(
|
|
x = I(list(1:3, 3:5)),
|
|
y = c("1, 2", "3, 4, 5")
|
|
)
|
|
```
|
|
|
|
It's easier to use list-columns with tibbles because `tibble()` treats lists like either vectors and the print method has been designed with lists in mind.
|
|
:::
|
|
|
|
## Unnesting
|
|
|
|
Now that you've learned the basics of lists and list-columns, lets explore how you can turn them back into regular rows and columns.
|
|
We'll start with very simple sample data so you can get the basic idea, and then switch to more realistic examples in the next section.
|
|
|
|
List-columns tend to come in two basic forms: named and unnamed.
|
|
When the children are **named**, they tend to have the same names in every row.
|
|
When the children are **unnamed**, the number of elements tends to vary from row-to-row.
|
|
The following code creates an example of each.
|
|
In `df1`, every element of list-column `y` has two elements named `a` and `b`.
|
|
If `df2`, the elements of list-column `y` are unnamed and vary in length.
|
|
|
|
```{r}
|
|
df1 <- tribble(
|
|
~x, ~y,
|
|
1, list(a = 11, b = 12),
|
|
2, list(a = 21, b = 22),
|
|
3, list(a = 31, b = 32),
|
|
)
|
|
|
|
df2 <- tribble(
|
|
~x, ~y,
|
|
1, list(11, 12, 13),
|
|
2, list(21),
|
|
3, list(31, 32),
|
|
)
|
|
```
|
|
|
|
Named list-columns naturally unnest into columns: each named element becomes a new named column.
|
|
Unnamed list-columns naturally unnested in to rows: you'll get one row for each child.
|
|
tidyr provides two functions for these two case: `unnest_wider()` and `unnest_longer()`.
|
|
The following sections explain how they work.
|
|
|
|
### `unnest_wider()`
|
|
|
|
When each row has the same number of elements with the same names, like `df1`, it's natural to put each component into its own column with `unnest_wider()`:
|
|
|
|
```{r}
|
|
df1 |>
|
|
unnest_wider(y)
|
|
```
|
|
|
|
By default, the names of the new columns come exclusively from the names of the list, but you can use the `names_sep` argument to request that they combine the column name and the list names.
|
|
This is useful for disambiguating repeated names.
|
|
|
|
```{r}
|
|
df1 |>
|
|
unnest_wider(y, names_sep = "_")
|
|
```
|
|
|
|
We can also use `unnest_wider()` with unnamed list-columns, as in `df2`.
|
|
Since columns require names but the list lacks them, `unnest_wider()` will label them with consecutive integers:
|
|
|
|
```{r}
|
|
df2 |>
|
|
unnest_wider(y, names_sep = "_")
|
|
```
|
|
|
|
You'll notice that `unnested_wider()`, much like `pivot_wider()`, turns implicit missing values in to explicit missing values.
|
|
|
|
### `unnest_longer()`
|
|
|
|
When each row contains an unnamed list, it's most natural to put each element into its own row with `unnest_longer()`:
|
|
|
|
```{r}
|
|
df2 |>
|
|
unnest_longer(y)
|
|
```
|
|
|
|
Note how `x` is duplicated for each element inside of `y`: we get one row of output for each element inside the list-column.
|
|
But what happens if the list-column is empty, as in the following example?
|
|
|
|
```{r}
|
|
df6 <- tribble(
|
|
~x, ~y,
|
|
"a", list(1, 2),
|
|
"b", list(3),
|
|
"c", list()
|
|
)
|
|
df6 |> unnest_longer(y)
|
|
```
|
|
|
|
We get zero rows in the output, so the row effectively disappears.
|
|
Once <https://github.com/tidyverse/tidyr/issues/1339> is fixed, you'll be able to keep this row, replacing `y` with `NA` by setting `keep_empty = TRUE`.
|
|
|
|
You can also unnest named list-columns, like `df1$y` into the rows.
|
|
Because the elements are named, and those names might be useful data, puts them in a new column with the suffix`_id`:
|
|
|
|
```{r}
|
|
df1 |>
|
|
unnest_longer(y)
|
|
```
|
|
|
|
If you don't want these `ids`, you can suppress this with `indices_include = FALSE`.
|
|
On the other hand, it's sometimes useful to retain the position of unnamed elements in unnamed list-columns.
|
|
You can do this with `indices_include = TRUE`:
|
|
|
|
```{r}
|
|
df2 |>
|
|
unnest_longer(y, indices_include = TRUE)
|
|
```
|
|
|
|
### Inconsistent types
|
|
|
|
What happens if you unnest a list-column contains different types of vector?
|
|
For example, take the following dataset where the list-column `y` contains two numbers, a factor, and a logical, which can't normally be mixed in a single column.
|
|
|
|
```{r}
|
|
df4 <- tribble(
|
|
~x, ~y,
|
|
"a", list(1, "a"),
|
|
"b", list(TRUE, factor("a"), 5)
|
|
)
|
|
```
|
|
|
|
`unnest_longer()` always keeps the set of columns change, while changing the number of rows.
|
|
So what happens?
|
|
How does `unnest_longer()` produce five rows while keeping everything in `y`?
|
|
|
|
```{r}
|
|
df4 |>
|
|
unnest_longer(y)
|
|
```
|
|
|
|
As you can see, the output contains a list-column, but every element of the list-column contains a single element.
|
|
Because `unnest_longer()` can't find a common type of vector, it keeps the original types in a list-column.
|
|
You might wonder if this breaks the commandment that every element of a column must be the same type --- not quite, because every element is a still a list, and each component of that list contains something different.
|
|
|
|
What happens if you find this problem in a dataset you're trying to rectangle?
|
|
I think there are two basic options.
|
|
You could use the `transform` argument to coerce all inputs to a common type.
|
|
It's not particularly useful here because there's only really one class that these five class can be converted to: character.
|
|
|
|
```{r}
|
|
df4 |>
|
|
unnest_longer(y, transform = as.character)
|
|
```
|
|
|
|
Another option would be to filter down to the rows that have values of a specific type:
|
|
|
|
```{r}
|
|
df4 |>
|
|
unnest_longer(y) |>
|
|
rowwise() |>
|
|
filter(is.numeric(y))
|
|
```
|
|
|
|
Then you can call `unnest_longer()` once more:
|
|
|
|
```{r}
|
|
df4 |>
|
|
unnest_longer(y) |>
|
|
rowwise() |>
|
|
filter(is.numeric(y)) |>
|
|
unnest_longer(y)
|
|
```
|
|
|
|
### Other functions
|
|
|
|
tidyr has a few other useful rectangling functions that we're not going to cover in this book:
|
|
|
|
- `unnest_auto()` automatically picks between `unnest_longer()` and `unnest_wider()` based on the structure of the list-column. It's a great for rapid exploration, but I think it's ultimately a bad idea because it doesn't force you to understand how your data is structured, and makes your code harder to understand.
|
|
- `unnest()` expands both rows and columns. It's useful when you have a list-column that contains a 2d structure like a data frame, which we don't see in this book.
|
|
- `hoist()` allows you to reach into a deeply nested list and extract just the components that you need. It's mostly equivalent to repeated invocations of `unnest_wider()` + `select()` so read up on it if you're trying to extract just a couple of important variables embedded in a bunch of data that you don't care about.
|
|
|
|
These are good to know about when you're other people's code and for tackling rarer rectangling challenges.
|
|
|
|
### Exercises
|
|
|
|
1. From time-to-time you encounter data frames with multiple list-columns with aligned values.
|
|
For example, in the following data frame, the values of `y` and `z` are aligned (i.e. `y` and `z` will always have the same length within a row, and the first value of `y` corresponds to the first value of `z`).
|
|
What happens if you apply two `unnest_longer()` calls to this data frame?
|
|
How can you preserve the relationship between `x` and `y`?
|
|
(Hint: carefully read the docs).
|
|
|
|
```{r}
|
|
df4 <- tribble(
|
|
~x, ~y, ~z,
|
|
"a", list("y-a-1", "y-a-2"), list("z-a-1", "z-a-2"),
|
|
"b", list("y-b-1", "y-b-2", "y-b-3"), list("z-b-1", "z-b-2", "z-b-3")
|
|
)
|
|
```
|
|
|
|
## Case studies
|
|
|
|
So far you've learned about the simplest case of list-columns, where rectangling only requires a single call to `unnest_longer()` or `unnest_wider()`.
|
|
The main difference between real data and these simple examples is that real data typically containsmultiple levels of nesting that requires multiple calls to `unnest_longer()` and `unnest_wider()`.
|
|
This section will work through four real rectangling challenges using datasets from the repurrrsive package that are inspired by datasets that we've encountered in the wild.
|
|
|
|
### Very wide data
|
|
|
|
We'll start by exploring `gh_repos`.
|
|
This is a list that contains data about a collection of GitHub repositories retrieved using the GitHub API. It's a very deeply nested list so it's difficult to show the structure in this book; you might want to explore a little on your own with `View(gh_repos)` before we continue.
|
|
|
|
`gh_repos` is a list, but our tools work with list-columns, so we'll begin by putting it into a tibble.
|
|
I call the column `json` for reasons we'll get to later.
|
|
|
|
```{r}
|
|
repos <- tibble(json = gh_repos)
|
|
repos
|
|
```
|
|
|
|
This tibble contains 6 rows, one row for each child of `gh_repos`.
|
|
Each row contains a unnamed list with either 26 or 30 rows.
|
|
Since these are unnamed, we'll start with an `unnest_longer()` to put each child in its own row:
|
|
|
|
```{r}
|
|
repos |>
|
|
unnest_longer(json)
|
|
```
|
|
|
|
At first glance, it might seem like we haven't improved the situation: while we have more rows (176 instead of 6) each element of `json` is still a list.
|
|
However, there's an important difference: now each element is a **named** list so we can use `unnamed_wider()` to put each element into its own column:
|
|
|
|
```{r}
|
|
repos |>
|
|
unnest_longer(json) |>
|
|
unnest_wider(json)
|
|
```
|
|
|
|
This has worked but the result is a little overwhelming: there are so many columns that tibble doesn't even print all of them!
|
|
We can see them all with `names()`:
|
|
|
|
```{r}
|
|
repos |>
|
|
unnest_longer(json) |>
|
|
unnest_wider(json) |>
|
|
names()
|
|
```
|
|
|
|
Let's select a few that look interesting:
|
|
|
|
```{r}
|
|
repos |>
|
|
unnest_longer(json) |>
|
|
unnest_wider(json) |>
|
|
select(id, full_name, owner, description)
|
|
```
|
|
|
|
You can use this to work back to understand how `gh_repos` was strucured: each child was a GitHub user containing a list of up to 30 GitHub repositories that they created.
|
|
|
|
`owner` is another list-column, and since it a contains a named list, we can use `unnest_wider()` to get at the values:
|
|
|
|
```{r}
|
|
#| error: true
|
|
repos |>
|
|
unnest_longer(json) |>
|
|
unnest_wider(json) |>
|
|
select(id, full_name, owner, description) |>
|
|
unnest_wider(owner)
|
|
```
|
|
|
|
Uh oh, this list column also contains an `id` column and we can't have two `id` columns in the same data frame.
|
|
Rather than following the advice to use `names_repair` (which would also work), I'll instead use `names_sep`:
|
|
|
|
```{r}
|
|
repos |>
|
|
unnest_longer(json) |>
|
|
unnest_wider(json) |>
|
|
select(id, full_name, owner, description) |>
|
|
unnest_wider(owner, names_sep = "_")
|
|
```
|
|
|
|
This gives another wide dataset, but you can see that `owner` appears to contain a lot of additional data about the person who "owns" the repository.
|
|
|
|
### Relational data
|
|
|
|
Nested data is sometimes used to represent data that we'd usually spread out into multiple data frames.
|
|
For example, take `got_chars`.
|
|
Like `gh_repos` it's a list, so we start by turning it into a list-column of a tibble:
|
|
|
|
```{r}
|
|
chars <- tibble(json = got_chars)
|
|
chars
|
|
```
|
|
|
|
The `json` column contains named values, so we'll start by widening it:
|
|
|
|
```{r}
|
|
chars |>
|
|
unnest_wider(json)
|
|
```
|
|
|
|
And selecting a few columns just to make it easier to read:
|
|
|
|
```{r}
|
|
characters <- chars |>
|
|
unnest_wider(json) |>
|
|
select(id, name, gender, culture, born, died, alive)
|
|
characters
|
|
```
|
|
|
|
There are also many list-columns:
|
|
|
|
```{r}
|
|
chars |>
|
|
unnest_wider(json) |>
|
|
select(id, where(is.list))
|
|
```
|
|
|
|
Lets explore the `titles` column.
|
|
It's an unnamed list-column, so we'll unnest it into rows:
|
|
|
|
```{r}
|
|
chars |>
|
|
unnest_wider(json) |>
|
|
select(id, titles) |>
|
|
unnest_longer(titles)
|
|
```
|
|
|
|
You might expect to see this data in its own table because it would be easy to join to the characters data as needed.
|
|
To do so, we'll do a little cleaning: removing the rows containing empty strings and renaming `titles` to `title` since each row now only contains a single title.
|
|
|
|
```{r}
|
|
titles <- chars |>
|
|
unnest_wider(json) |>
|
|
select(id, titles) |>
|
|
unnest_longer(titles) |>
|
|
filter(titles != "") |>
|
|
rename(title = titles)
|
|
titles
|
|
```
|
|
|
|
Now, for example, we could use this table to all the characters that are captains and see all their titles:
|
|
|
|
```{r}
|
|
captains <- titles |> filter(str_detect(title, "Captain"))
|
|
captains
|
|
|
|
characters |>
|
|
semi_join(captains, by = "id") |>
|
|
select(id, name) |>
|
|
left_join(titles, by = "id", multiple = "all")
|
|
```
|
|
|
|
You could imagine creating a table like this for each of the list-columns, then using joins to combine them with the character data as you need it.
|
|
|
|
### A dash of text analysis
|
|
|
|
What if we wanted to find the most common words in the title?
|
|
One simple approach starts by using `str_split()` to break each element of `title` up into words by spitting on `" "`:
|
|
|
|
```{r}
|
|
titles |>
|
|
mutate(word = str_split(title, " "), .keep = "unused")
|
|
```
|
|
|
|
This creates a unnamed variable length list-column, so we can use `unnest_longer()`:
|
|
|
|
```{r}
|
|
titles |>
|
|
mutate(word = str_split(title, " "), .keep = "unused") |>
|
|
unnest_longer(word)
|
|
```
|
|
|
|
And then we can count that column to find the most common:
|
|
|
|
```{r}
|
|
titles |>
|
|
mutate(word = str_split(title, " "), .keep = "unused") |>
|
|
unnest_longer(word) |>
|
|
count(word, sort = TRUE)
|
|
```
|
|
|
|
Some of those words are not very interesting so we could create a list of common words to drop.
|
|
In text analysis these is commonly called stop words.
|
|
|
|
```{r}
|
|
stop_words <- tibble(word = c("of", "the"))
|
|
|
|
titles |>
|
|
mutate(word = str_split(title, " "), .keep = "unused") |>
|
|
unnest_longer(word) |>
|
|
anti_join(stop_words) |>
|
|
count(word, sort = TRUE)
|
|
```
|
|
|
|
Breaking up text into individual fragments is a powerful idea that underlies much of text analysis.
|
|
If this sounds interesting, a good place to learn more is [Text Mining with R](https://www.tidytextmining.com) by Julia Silge and David Robinson.
|
|
|
|
### Deeply nested
|
|
|
|
We'll finish off these case studies with a list-column that's very deeply nested and requires repeated rounds of `unnest_wider()` and `unnest_longer()` to unravel: `gmaps_cities`.
|
|
This is a two column tibble containing five city names and the results of using Google's [geocoding API](https://developers.google.com/maps/documentation/geocoding) to determine their location:
|
|
|
|
```{r}
|
|
gmaps_cities
|
|
```
|
|
|
|
`json` is a list-column with internal names, so we start with an `unnest_wider()`:
|
|
|
|
```{r}
|
|
gmaps_cities |>
|
|
unnest_wider(json)
|
|
```
|
|
|
|
This gives us the `status` and the `results`.
|
|
We'll drop the status column since they're all `OK`; in a real analysis, you'd also want capture all the rows where `status != "OK"` and figure out what went wrong.
|
|
`results` is an unnamed list, with either one or two elements (we'll see why shortly) so we'll unnest it into rows:
|
|
|
|
```{r}
|
|
gmaps_cities |>
|
|
unnest_wider(json) |>
|
|
select(-status) |>
|
|
unnest_longer(results)
|
|
```
|
|
|
|
Now `results` is a named list, so we'll use `unnest_wider()`:
|
|
|
|
```{r}
|
|
locations <- gmaps_cities |>
|
|
unnest_wider(json) |>
|
|
select(-status) |>
|
|
unnest_longer(results) |>
|
|
unnest_wider(results)
|
|
locations
|
|
```
|
|
|
|
Now we can see why two cities got two results: Washington matched both Washington state and Washington, DC, and Arlington matched Arlington, Virginia and Arlington, Texas.
|
|
|
|
There are few different places we could go from here.
|
|
We might want to determine the exact location of the match, which is stored in the `geometry` list-column:
|
|
|
|
```{r}
|
|
locations |>
|
|
select(city, formatted_address, geometry) |>
|
|
unnest_wider(geometry)
|
|
```
|
|
|
|
That gives us new `bounds` (a rectangular region) and `location` (a point).
|
|
We can unnest `location` to see the latitude (`lat`) and longitude (`lng`):
|
|
|
|
```{r}
|
|
locations |>
|
|
select(city, formatted_address, geometry) |>
|
|
unnest_wider(geometry) |>
|
|
unnest_wider(location)
|
|
```
|
|
|
|
Extracting the bounds requires a few more steps
|
|
|
|
```{r}
|
|
locations |>
|
|
select(city, formatted_address, geometry) |>
|
|
unnest_wider(geometry) |>
|
|
# focus on the variables of interest
|
|
select(!location:viewport) |>
|
|
unnest_wider(bounds)
|
|
```
|
|
|
|
I then rename `southwest` and `northeast` (the corners of the rectangle) so I can use `names_sep` to create short but evocative names:
|
|
|
|
```{r}
|
|
locations |>
|
|
select(city, formatted_address, geometry) |>
|
|
unnest_wider(geometry) |>
|
|
select(!location:viewport) |>
|
|
unnest_wider(bounds) |>
|
|
rename(ne = northeast, sw = southwest) |>
|
|
unnest_wider(c(ne, sw), names_sep = "_")
|
|
```
|
|
|
|
Note how we unnest two columns simultaneously by supplying a vector of variable names to `unnest_wider()`.
|
|
|
|
This somewhere that `hoist()`, mentioned briefly above, can be useful.
|
|
Once you've discovered the path to get to the components you're interested in, you can extract them directly using `hoist()`:
|
|
|
|
```{r}
|
|
locations |>
|
|
select(city, formatted_address, geometry) |>
|
|
hoist(
|
|
geometry,
|
|
ne_lat = c("bounds", "northeast", "lat"),
|
|
sw_lat = c("bounds", "southwest", "lat"),
|
|
ne_lng = c("bounds", "northeast", "lng"),
|
|
sw_lng = c("bounds", "southwest", "lng"),
|
|
)
|
|
```
|
|
|
|
If these case studies have whetted your appetite for more real-life rectangling, you can see a few more examples in `vignette("rectangling", package = "tidyr")`.
|
|
|
|
### Exercises
|
|
|
|
1. Roughly estimate when `gh_repos` was created.
|
|
Why can you only roughly estimate the date?
|
|
|
|
2. The `owner` column of `gh_repo` contains a lot of duplicated information because each owner can have many repos.
|
|
Can you construct a `owners` data frame that contains one row for each owner?
|
|
(Hint: does `distinct()` work with `list-cols`?)
|
|
|
|
3. Explain the following code line-by-line.
|
|
Why is it interesting?
|
|
Why does it work for `got_chars` but might not work in general?
|
|
|
|
```{r}
|
|
tibble(json = got_chars) |>
|
|
unnest_wider(json) |>
|
|
select(id, where(is.list)) %>%
|
|
pivot_longer(
|
|
where(is.list),
|
|
names_to = "name",
|
|
values_to = "value"
|
|
) %>%
|
|
unnest_longer(value)
|
|
```
|
|
|
|
4. In `gmaps_cities`, what does `address_components` contain?
|
|
Why does the length vary between rows?
|
|
Unnest it appropriately to figure it out.
|
|
(Hint: `types` always appears to contain two elements. Does `unnest_wider()` make it easier to work with than `unnest_longer()`?)
|
|
.
|
|
|
|
## JSON
|
|
|
|
All of the case studies in the previous section came from data stored in JSON format.
|
|
JSON is short for **j**ava**s**cript **o**bject **n**otation and is the way that most web APIs return data.
|
|
It's important to understand it because while JSON and R are pretty similar, there isn't a perfect 1-to-1 mapping between JSON and R data types.
|
|
In this section, you'll learn a little more about JSON and how to read it into R; once you've done that you can use the rectangling tools described above to get it into a data frame for further analysis.
|
|
|
|
### Data types
|
|
|
|
JSON is a simple format designed to be easily read and written by machines, not humans.
|
|
It has six key data types.
|
|
Four of them are scalars, which are similar to atomic vectors in R: there's no way to break them down further.
|
|
|
|
- The simplest type is `null`, which plays the same role as both `NULL` and `NA` in R. It represents the absence of data.
|
|
- **Strings** are written much like in R, but can only use double quotes, not single quotes.
|
|
- **Numbers** are similar to R's numbers: they can be integer (e.g. 123), decimal (e.g. 123.45), or scientific (e.g. 1.23e3). JSON doesn't support Inf, -Inf, or NaN.
|
|
- **Booleans**, are similar to R's logical vectors, but use `true` and `false` instead of `TRUE` and `FALSE`.
|
|
|
|
The biggest different between JSON's scalars and atomic vectors is that scalars only represent a single item.
|
|
To create a vector of multiple items you need to use of the two remaining two types, **arrays** and **objects**.
|
|
An array is like an unnamed list in R, and is written with `[]`.
|
|
For example `[1, 2, 3]` is an array containing 3 numbers, and `[null, 1, "string", false]` is an array that contains a null, a number, a string, and a boolean.
|
|
Objects are like a named list in R, and are written with `{}`.
|
|
For example, `{"x": 1, "y": 2}` is an object that maps `x` to 1 and `y` to 2.
|
|
|
|
You might already be starting to imagine some of the challenges converting JSON to R data structures.
|
|
|
|
### jsonlite
|
|
|
|
Most of the time you won't deal with JSON directly, instead you'll use the jsonlite package, by Jeroen Oooms, to load it into R as a nested list.
|
|
We'll focus on two functions from jsonlite.
|
|
Most of the time you'll use `read_json()` to read a json file from disk, but sometimes you'll also need `parse_json()` which takes json stored in a string in R.
|
|
|
|
```{r}
|
|
str(parse_json('[1, 2, 3]'))
|
|
str(parse_json('{"x": [1, 2, 3]}'))
|
|
```
|
|
|
|
Note that the rectangling approach described above is designed around the most common case where the API returns multiple "things", e.g. multiple pages, or multiple records, or multiple results.
|
|
In this case, you just do `tibble(json)` and each element becomes a row.
|
|
If the JSON returns a single "thing", then you'll need to do `tibble(json = list(json))` so you start with a data frame containing a single row.
|
|
|
|
Note that jsonlite has another important function called `fromJSON()`.
|
|
We don't use it here because it uses `simplifyVector = TRUE` which attempts to automatically unnest the JSON in a data frame.
|
|
This often works well, particularly in simple cases.
|
|
But we think you're better off doing the rectangling yourself so you know exactly what's happening and can more easily handle the most complicated nested structures.
|
|
Doing it yourself also means you'll use the standard tidyverse rules for recycling and vector coercion: there's nothing wrong with jsonlite's rules, but they're different and we don't want to get in to the details here.
|
|
|
|
### Translation challenges
|
|
|
|
There isn't a perfect match between json's data types and R's data types.
|
|
So when reading a json file into R, we have to make some assumptions:
|
|
|
|
```{r}
|
|
str(parse_json('[1, 2, 3]'))
|
|
str(parse_json('[true, false, true]'))
|
|
```
|
|
|
|
JSON doesn't have any way to represent dates or date-times, so they're normally stored as ISO8601 date times in strings, and you'll need to use `readr::parse_date()` or `readr::parse_datetime()` to turn them into the correct data structure.
|
|
|
|
### Exercises
|
|
|
|
1. Rectangle the `df_col` and `df_row` below.
|
|
They represent the two ways of encoding a data frame in JSON.
|
|
|
|
```{r}
|
|
json_col <- parse_json('
|
|
{
|
|
"x": ["a", "x", "z"],
|
|
"y": [10, null, 3]
|
|
}
|
|
')
|
|
json_row <- parse_json('
|
|
[
|
|
{"x": "a", "y": 10},
|
|
{"x": "x", "y": null},
|
|
{"x": "z", "y": 3}
|
|
]
|
|
')
|
|
|
|
df_col <- tibble(json = list(json_col))
|
|
df_row <- tibble(json = list(json_row))
|
|
```
|