Rectangling polish
This commit is contained in:
parent
279611af8a
commit
fc3641a376
135
rectangling.qmd
135
rectangling.qmd
|
@ -10,17 +10,17 @@ status("polishing")
|
||||||
## Introduction
|
## Introduction
|
||||||
|
|
||||||
In this chapter, you'll learn the art of data **rectangling**, taking data that is fundamentally tree-like and converting it into a rectangular data frames made up of rows and columns.
|
In this chapter, you'll learn the art of data **rectangling**, taking data that is fundamentally tree-like and converting it into a rectangular data frames made up of rows and columns.
|
||||||
This is important because hierarchical data is surprisingly common, especially when working with data that comes from a web API.
|
This is important because hierarchical data is surprisingly common, especially when working with data that comes from the web.
|
||||||
|
|
||||||
To learn about rectangling, you'll first learn about lists, the data structure that makes hierarchical data possible in R.
|
To learn about rectangling, you'll need to first learn about lists, the data structure that makes hierarchical data possible.
|
||||||
Then you'll learn about two crucial tidyr functions: `tidyr::unnest_longer()`, which converts children in rows, and `tidyr::unnest_wider()`, which converts children into columns.
|
Then you'll learn about two crucial tidyr functions: `tidyr::unnest_longer()` and `tidyr::unnest_wider()`.
|
||||||
We'll then show you a few case studies, applying these simple function multiple times to solve real problems.
|
We'll then show you a few case studies, applying these simple functions again and again to solve real problems.
|
||||||
We'll finish off by talking about JSON, the most frequent source of hierarchical datasets and a common format for data exchange on the web.
|
We'll finish off by talking about JSON, the most frequent source of hierarchical datasets and a common format for data exchange on the web.
|
||||||
|
|
||||||
### Prerequisites
|
### Prerequisites
|
||||||
|
|
||||||
In this chapter we'll use many functions from tidyr, a core member of the tidyverse.
|
In this chapter we'll use many functions from tidyr, a core member of the tidyverse.
|
||||||
We'll also use repurrrsive to provide some interesting datasets rectangling practice, and we'll finish up with a little jsonlite, which we'll use to read JSON files into R lists.
|
We'll also use repurrrsive to provide some interesting datasets for rectangling practice, and we'll finish by using jsonlite to read JSON files into R lists.
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
#| label: setup
|
#| label: setup
|
||||||
|
@ -33,8 +33,8 @@ library(jsonlite)
|
||||||
|
|
||||||
## Lists
|
## Lists
|
||||||
|
|
||||||
So far we've used simple vectors like integers, numbers, characters, date-times, and factors.
|
So far you've worked with data frames that contain simple vectors like integers, numbers, characters, date-times, and factors.
|
||||||
These vectors are simple because they're homogeneous: every element is same type.
|
These vectors are simple because they're homogeneous: every element is the same type.
|
||||||
If you want to store element of different types in the same vector, you'll need a **list**, which you create with `list()`:
|
If you want to store element of different types in the same vector, you'll need a **list**, which you create with `list()`:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
|
@ -86,16 +86,21 @@ x5 <- list(1, list(2, list(3, list(4, list(5)))))
|
||||||
str(x5)
|
str(x5)
|
||||||
```
|
```
|
||||||
|
|
||||||
As lists get even large and more complex, even `str()` starts to fail, you'll need to switch to `View()`[^rectangling-1].
|
As lists get even larger and more complex, `str()` eventually starts to fail, and you'll need to switch to `View()`[^rectangling-1].
|
||||||
@fig-view-collapsed shows the result of calling `View(x4)`. The viewer starts by showing just the top level of the list, but you can interactively expand any of the components to see more, as in @fig-view-expand-1. RStudio will also show you the code you need to access that element, as in @fig-view-expand-2. We'll come back to how this code works in @sec-vector-subsetting.
|
@fig-view-collapsed shows the result of calling `View(x4)`. The viewer starts by showing just the top level of the list, but you can interactively expand any of the components to see more, as in @fig-view-expand-1. RStudio will also show you the code you need to access that element, as in @fig-view-expand-2. We'll come back to how this code works in @sec-lists.
|
||||||
|
|
||||||
[^rectangling-1]: This is an RStudio feature.
|
[^rectangling-1]: This is an RStudio feature.
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
#| label: fig-view-collapsed
|
#| label: fig-view-collapsed
|
||||||
#| fig.cap: >
|
#| fig.cap: >
|
||||||
#| The RStudio allows you to interactively explore a complex list.
|
#| The RStudio view lets you interactively explore a complex list.
|
||||||
#| The viewer opens showing only the top level of the list.
|
#| The viewer opens showing only the top level of the list.
|
||||||
|
#| fig.alt: >
|
||||||
|
#| A screenshot of RStudio showing the list-viewer. It shows the
|
||||||
|
#| two children of x4: the first child is a double vector and the
|
||||||
|
#| second child is a list. A rightward facing triable indicates that the
|
||||||
|
#| second child itself has children but you can't see them.
|
||||||
#| echo: false
|
#| echo: false
|
||||||
#| out-width: NULL
|
#| out-width: NULL
|
||||||
knitr::include_graphics("screenshots/View-1.png", dpi = 220)
|
knitr::include_graphics("screenshots/View-1.png", dpi = 220)
|
||||||
|
@ -106,6 +111,10 @@ knitr::include_graphics("screenshots/View-1.png", dpi = 220)
|
||||||
#| fig.cap: >
|
#| fig.cap: >
|
||||||
#| Clicking on the rightward facing triangle expands that component
|
#| Clicking on the rightward facing triangle expands that component
|
||||||
#| of the list so that you can also see its children.
|
#| of the list so that you can also see its children.
|
||||||
|
#| fig.alt: >
|
||||||
|
#| Another screenshot of the list-viewer having expand the second
|
||||||
|
#| child of x2. It also has two children, a double vector and another
|
||||||
|
#| list.
|
||||||
#| echo: false
|
#| echo: false
|
||||||
#| out-width: NULL
|
#| out-width: NULL
|
||||||
knitr::include_graphics("screenshots/View-2.png", dpi = 220)
|
knitr::include_graphics("screenshots/View-2.png", dpi = 220)
|
||||||
|
@ -115,9 +124,12 @@ knitr::include_graphics("screenshots/View-2.png", dpi = 220)
|
||||||
#| label: fig-view-expand-2
|
#| label: fig-view-expand-2
|
||||||
#| fig.cap: >
|
#| fig.cap: >
|
||||||
#| You can repeat this operation as many times as needed to get to the
|
#| You can repeat this operation as many times as needed to get to the
|
||||||
#| data you're interested in. Note the bottom-right corner: if you click
|
#| data you're interested in. Note the bottom-left corner: if you click
|
||||||
#| an element of the list, RStudio will give you the subsetting code
|
#| an element of the list, RStudio will give you the subsetting code
|
||||||
#| needed to access it, in this case `x4[[2]][[2]][[2]]`.
|
#| needed to access it, in this case `x4[[2]][[2]][[2]]`.
|
||||||
|
#| fig.alt: >
|
||||||
|
#| Another screenshot, having expanded the grandchild of x4 to see its
|
||||||
|
#| two children, again a double vector and a list.
|
||||||
#| echo: false
|
#| echo: false
|
||||||
#| out-width: NULL
|
#| out-width: NULL
|
||||||
knitr::include_graphics("screenshots/View-3.png", dpi = 220)
|
knitr::include_graphics("screenshots/View-3.png", dpi = 220)
|
||||||
|
@ -173,11 +185,11 @@ It's possible to put a list in a column of a `data.frame`, but it's a lot fiddli
|
||||||
data.frame(x = list(1:3, 3:5))
|
data.frame(x = list(1:3, 3:5))
|
||||||
```
|
```
|
||||||
|
|
||||||
You can force `data.frame()` to treat a list as a list of rows by wrapping it in list `I()`, but the result doesn't print particularly usefully:
|
You can force `data.frame()` to treat a list as a list of rows by wrapping it in list `I()`, but the result doesn't print particularly well:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
data.frame(
|
data.frame(
|
||||||
x = I(list(1:3, 3:5)),
|
x = I(list(1:2, 3:5)),
|
||||||
y = c("1, 2", "3, 4, 5")
|
y = c("1, 2", "3, 4, 5")
|
||||||
)
|
)
|
||||||
```
|
```
|
||||||
|
@ -188,14 +200,12 @@ It's easier to use list-columns with tibbles because `tibble()` treats lists lik
|
||||||
## Unnesting
|
## Unnesting
|
||||||
|
|
||||||
Now that you've learned the basics of lists and list-columns, let's explore how you can turn them back into regular rows and columns.
|
Now that you've learned the basics of lists and list-columns, let's explore how you can turn them back into regular rows and columns.
|
||||||
We'll start with very simple sample data so you can get the basic idea, and then switch to more realistic examples in the next section.
|
Here we'll use very simple sample data so you can get the basic idea; in the next section we'll switch to real data.
|
||||||
|
|
||||||
List-columns tend to come in two basic forms: named and unnamed.
|
List-columns tend to come in two basic forms: named and unnamed.
|
||||||
When the children are **named**, they tend to have the same names in every row.
|
When the children are **named**, they tend to have the same names in every row.
|
||||||
When the children are **unnamed**, the number of elements tends to vary from row-to-row.
|
For example, in `df1`, every element of list-column `y` has two elements named `a` and `b`.
|
||||||
The following code creates an example of each.
|
Named list-columns naturally unnest into columns: each named element becomes a new named column.
|
||||||
In `df1`, every element of list-column `y` has two elements named `a` and `b`.
|
|
||||||
In `df2`, the elements of list-column `y` are unnamed and vary in length.
|
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
df1 <- tribble(
|
df1 <- tribble(
|
||||||
|
@ -204,6 +214,13 @@ df1 <- tribble(
|
||||||
2, list(a = 21, b = 22),
|
2, list(a = 21, b = 22),
|
||||||
3, list(a = 31, b = 32),
|
3, list(a = 31, b = 32),
|
||||||
)
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
When the children are **unnamed**, the number of elements tends to vary from row-to-row.
|
||||||
|
For example, in `df2`, the elements of list-column `y` are unnamed and vary in length from one to three.
|
||||||
|
Unnamed list-columns naturally unnest in to rows: you'll get one row for each child.
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
|
||||||
df2 <- tribble(
|
df2 <- tribble(
|
||||||
~x, ~y,
|
~x, ~y,
|
||||||
|
@ -213,9 +230,7 @@ df2 <- tribble(
|
||||||
)
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
Named list-columns naturally unnest into columns: each named element becomes a new named column.
|
tidyr provides two functions for these two cases: `unnest_wider()` and `unnest_longer()`.
|
||||||
Unnamed list-columns naturally unnested in to rows: you'll get one row for each child.
|
|
||||||
tidyr provides two functions for these two case: `unnest_wider()` and `unnest_longer()`.
|
|
||||||
The following sections explain how they work.
|
The following sections explain how they work.
|
||||||
|
|
||||||
### `unnest_wider()`
|
### `unnest_wider()`
|
||||||
|
@ -227,7 +242,7 @@ df1 |>
|
||||||
unnest_wider(y)
|
unnest_wider(y)
|
||||||
```
|
```
|
||||||
|
|
||||||
By default, the names of the new columns come exclusively from the names of the list, but you can use the `names_sep` argument to request that they combine the column name and the list names.
|
By default, the names of the new columns come exclusively from the names of the list elements, but you can use the `names_sep` argument to request that they combine the column name and the element name.
|
||||||
This is useful for disambiguating repeated names.
|
This is useful for disambiguating repeated names.
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
|
@ -255,7 +270,7 @@ df2 |>
|
||||||
```
|
```
|
||||||
|
|
||||||
Note how `x` is duplicated for each element inside of `y`: we get one row of output for each element inside the list-column.
|
Note how `x` is duplicated for each element inside of `y`: we get one row of output for each element inside the list-column.
|
||||||
But what happens if the list-column is empty, as in the following example?
|
But what happens if one of the elements is empty, as in the following example?
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
df6 <- tribble(
|
df6 <- tribble(
|
||||||
|
@ -270,15 +285,15 @@ df6 |> unnest_longer(y)
|
||||||
We get zero rows in the output, so the row effectively disappears.
|
We get zero rows in the output, so the row effectively disappears.
|
||||||
Once <https://github.com/tidyverse/tidyr/issues/1339> is fixed, you'll be able to keep this row, replacing `y` with `NA` by setting `keep_empty = TRUE`.
|
Once <https://github.com/tidyverse/tidyr/issues/1339> is fixed, you'll be able to keep this row, replacing `y` with `NA` by setting `keep_empty = TRUE`.
|
||||||
|
|
||||||
You can also unnest named list-columns, like `df1$y` into the rows.
|
You can also unnest named list-columns, like `df1$y`, into rows.
|
||||||
Because the elements are named, and those names might be useful data, puts them in a new column with the suffix `_id`:
|
Because the elements are named, and those names might be useful data, tidyr puts them in a new column with the suffix `_id`:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
df1 |>
|
df1 |>
|
||||||
unnest_longer(y)
|
unnest_longer(y)
|
||||||
```
|
```
|
||||||
|
|
||||||
If you don't want these `ids`, you can suppress this with `indices_include = FALSE`.
|
If you don't want these `ids`, you can suppress them with `indices_include = FALSE`.
|
||||||
On the other hand, it's sometimes useful to retain the position of unnamed elements in unnamed list-columns.
|
On the other hand, it's sometimes useful to retain the position of unnamed elements in unnamed list-columns.
|
||||||
You can do this with `indices_include = TRUE`:
|
You can do this with `indices_include = TRUE`:
|
||||||
|
|
||||||
|
@ -311,7 +326,7 @@ df4 |>
|
||||||
|
|
||||||
As you can see, the output contains a list-column, but every element of the list-column contains a single element.
|
As you can see, the output contains a list-column, but every element of the list-column contains a single element.
|
||||||
Because `unnest_longer()` can't find a common type of vector, it keeps the original types in a list-column.
|
Because `unnest_longer()` can't find a common type of vector, it keeps the original types in a list-column.
|
||||||
You might wonder if this breaks the commandment that every element of a column must be the same type --- not quite, because every element is a still a list, and each component of that list contains something different.
|
You might wonder if this breaks the commandment that every element of a column must be the same type --- not quite: every element is a still a list, even though the contents of each element is a different type.
|
||||||
|
|
||||||
What happens if you find this problem in a dataset you're trying to rectangle?
|
What happens if you find this problem in a dataset you're trying to rectangle?
|
||||||
There are two basic options.
|
There are two basic options.
|
||||||
|
@ -328,8 +343,7 @@ Another option would be to filter down to the rows that have values of a specifi
|
||||||
```{r}
|
```{r}
|
||||||
df4 |>
|
df4 |>
|
||||||
unnest_longer(y) |>
|
unnest_longer(y) |>
|
||||||
rowwise() |>
|
filter(map_lgl(y, is.numeric))
|
||||||
filter(is.numeric(y))
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Then you can call `unnest_longer()` once more:
|
Then you can call `unnest_longer()` once more:
|
||||||
|
@ -337,20 +351,21 @@ Then you can call `unnest_longer()` once more:
|
||||||
```{r}
|
```{r}
|
||||||
df4 |>
|
df4 |>
|
||||||
unnest_longer(y) |>
|
unnest_longer(y) |>
|
||||||
rowwise() |>
|
filter(map_lgl(y, is.numeric)) |>
|
||||||
filter(is.numeric(y)) |>
|
|
||||||
unnest_longer(y)
|
unnest_longer(y)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
You'll learn more about `map_lgl()` in @sec-iteration.
|
||||||
|
|
||||||
### Other functions
|
### Other functions
|
||||||
|
|
||||||
tidyr has a few other useful rectangling functions that we're not going to cover in this book:
|
tidyr has a few other useful rectangling functions that we're not going to cover in this book:
|
||||||
|
|
||||||
- `unnest_auto()` automatically picks between `unnest_longer()` and `unnest_wider()` based on the structure of the list-column. It's a great for rapid exploration, but ultimately its a bad idea because it doesn't force you to understand how your data is structured, and makes your code harder to understand.
|
- `unnest_auto()` automatically picks between `unnest_longer()` and `unnest_wider()` based on the structure of the list-column. It's a great for rapid exploration, but ultimately its a bad idea because it doesn't force you to understand how your data is structured, and makes your code harder to understand.
|
||||||
- `unnest()` expands both rows and columns. It's useful when you have a list-column that contains a 2d structure like a data frame, which we don't see in this book.
|
- `unnest()` expands both rows and columns. It's useful when you have a list-column that contains a 2d structure like a data frame, which you don't see in this book.
|
||||||
- `hoist()` allows you to reach into a deeply nested list and extract just the components that you need. It's mostly equivalent to repeated invocations of `unnest_wider()` + `select()` so read up on it if you're trying to extract just a couple of important variables embedded in a bunch of data that you don't care about.
|
- `hoist()` allows you to reach into a deeply nested list and extract just the components that you need. It's mostly equivalent to repeated invocations of `unnest_wider()` + `select()` so read up on it if you're trying to extract just a couple of important variables embedded in a bunch of data that you don't care about.
|
||||||
|
|
||||||
These are good to know about when you're other people's code and for tackling rarer rectangling challenges.
|
These are good to know about when you're reading other people's code or tackling rarer rectangling challenges.
|
||||||
|
|
||||||
### Exercises
|
### Exercises
|
||||||
|
|
||||||
|
@ -370,13 +385,12 @@ These are good to know about when you're other people's code and for tackling ra
|
||||||
|
|
||||||
## Case studies
|
## Case studies
|
||||||
|
|
||||||
So far you've learned about the simplest case of list-columns, where rectangling only requires a single call to `unnest_longer()` or `unnest_wider()`.
|
The main difference between the simple examples we used above and real data is that real data typically contains multiple levels of nesting that require multiple calls to `unnest_longer()` and/or `unnest_wider()`.
|
||||||
The main difference between real data and these simple examples is that real data typically contains multiple levels of nesting that require multiple calls to `unnest_longer()` and `unnest_wider()`.
|
This section will work through four real rectangling challenges using datasets from the repurrrsive package, inspired by datasets that we've encountered in the wild.
|
||||||
This section will work through four real rectangling challenges using datasets from the repurrrsive package that are inspired by datasets that we've encountered in the wild.
|
|
||||||
|
|
||||||
### Very wide data
|
### Very wide data
|
||||||
|
|
||||||
We'll start by exploring `gh_repos`.
|
We'll with `gh_repos`.
|
||||||
This is a list that contains data about a collection of GitHub repositories retrieved using the GitHub API. It's a very deeply nested list so it's difficult to show the structure in this book; you might want to explore a little on your own with `View(gh_repos)` before we continue.
|
This is a list that contains data about a collection of GitHub repositories retrieved using the GitHub API. It's a very deeply nested list so it's difficult to show the structure in this book; you might want to explore a little on your own with `View(gh_repos)` before we continue.
|
||||||
|
|
||||||
`gh_repos` is a list, but our tools work with list-columns, so we'll begin by putting it into a tibble.
|
`gh_repos` is a list, but our tools work with list-columns, so we'll begin by putting it into a tibble.
|
||||||
|
@ -389,7 +403,7 @@ repos
|
||||||
|
|
||||||
This tibble contains 6 rows, one row for each child of `gh_repos`.
|
This tibble contains 6 rows, one row for each child of `gh_repos`.
|
||||||
Each row contains a unnamed list with either 26 or 30 rows.
|
Each row contains a unnamed list with either 26 or 30 rows.
|
||||||
Since these are unnamed, we'll start with an `unnest_longer()` to put each child in its own row:
|
Since these are unnamed, we'll start with `unnest_longer()` to put each child in its own row:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
repos |>
|
repos |>
|
||||||
|
@ -437,6 +451,8 @@ repos |>
|
||||||
unnest_wider(owner)
|
unnest_wider(owner)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
<!--# TODO: https://github.com/tidyverse/tidyr/issues/1390 -->
|
||||||
|
|
||||||
Uh oh, this list column also contains an `id` column and we can't have two `id` columns in the same data frame.
|
Uh oh, this list column also contains an `id` column and we can't have two `id` columns in the same data frame.
|
||||||
Rather than following the advice to use `names_repair` (which would also work), we'll instead use `names_sep`:
|
Rather than following the advice to use `names_repair` (which would also work), we'll instead use `names_sep`:
|
||||||
|
|
||||||
|
@ -461,14 +477,14 @@ chars <- tibble(json = got_chars)
|
||||||
chars
|
chars
|
||||||
```
|
```
|
||||||
|
|
||||||
The `json` column contains named values, so we'll start by widening it:
|
The `json` column contains named elements, so we'll start by widening it:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
chars |>
|
chars |>
|
||||||
unnest_wider(json)
|
unnest_wider(json)
|
||||||
```
|
```
|
||||||
|
|
||||||
And selecting a few columns just to make it easier to read:
|
And selecting a few columns to make it easier to read:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
characters <- chars |>
|
characters <- chars |>
|
||||||
|
@ -508,16 +524,15 @@ titles <- chars |>
|
||||||
titles
|
titles
|
||||||
```
|
```
|
||||||
|
|
||||||
Now, for example, we could use this table to all the characters that are captains and see all their titles:
|
Now, for example, we could use this table tofind all the characters that are captains and see all their titles:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
captains <- titles |> filter(str_detect(title, "Captain"))
|
captains <- titles |> filter(str_detect(title, "Captain"))
|
||||||
captains
|
captains
|
||||||
|
|
||||||
characters |>
|
characters |>
|
||||||
semi_join(captains, by = "id") |>
|
|
||||||
select(id, name) |>
|
select(id, name) |>
|
||||||
left_join(titles, by = "id", multiple = "all")
|
inner_join(titles, by = "id", multiple = "all")
|
||||||
```
|
```
|
||||||
|
|
||||||
You could imagine creating a table like this for each of the list-columns, then using joins to combine them with the character data as you need it.
|
You could imagine creating a table like this for each of the list-columns, then using joins to combine them with the character data as you need it.
|
||||||
|
@ -540,7 +555,7 @@ titles |>
|
||||||
unnest_longer(word)
|
unnest_longer(word)
|
||||||
```
|
```
|
||||||
|
|
||||||
And then we can count that column to find the most common:
|
And then we can count that column to find the most common words:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
titles |>
|
titles |>
|
||||||
|
@ -680,6 +695,7 @@ If these case studies have whetted your appetite for more real-life rectangling,
|
||||||
Why does it work for `got_chars` but might not work in general?
|
Why does it work for `got_chars` but might not work in general?
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
|
#| results: false
|
||||||
tibble(json = got_chars) |>
|
tibble(json = got_chars) |>
|
||||||
unnest_wider(json) |>
|
unnest_wider(json) |>
|
||||||
select(id, where(is.list)) %>%
|
select(id, where(is.list)) %>%
|
||||||
|
@ -699,7 +715,7 @@ If these case studies have whetted your appetite for more real-life rectangling,
|
||||||
|
|
||||||
## JSON
|
## JSON
|
||||||
|
|
||||||
All of the case studies in the previous section were sourced from wild-caught JSON files.
|
All of the case studies in the previous section were sourced from wild-caught JSON.
|
||||||
JSON is short for **j**ava**s**cript **o**bject **n**otation and is the way that most web APIs return data.
|
JSON is short for **j**ava**s**cript **o**bject **n**otation and is the way that most web APIs return data.
|
||||||
It's important to understand it because while JSON and R's data types are pretty similar, there isn't a perfect 1-to-1 mapping, so it's good to understand a bit about JSON if things go wrong.
|
It's important to understand it because while JSON and R's data types are pretty similar, there isn't a perfect 1-to-1 mapping, so it's good to understand a bit about JSON if things go wrong.
|
||||||
|
|
||||||
|
@ -709,27 +725,28 @@ JSON is a simple format designed to be easily read and written by machines, not
|
||||||
It has six key data types.
|
It has six key data types.
|
||||||
Four of them are scalars:
|
Four of them are scalars:
|
||||||
|
|
||||||
- The simplest type is a null, which is written `null`, which plays the same role as both `NULL` and `NA` in R. It represents the absence of data.
|
- The simplest type is a null (`null`) which plays the same role as both `NULL` and `NA` in R. It represents the absence of data.
|
||||||
- A **string** is much like a string in R, but must use double quotes, not single quotes.
|
- A **string** is much like a string in R, but must always use double quotes.
|
||||||
- A **number** is similar to R's numbers: they can be integer (e.g. 123), decimal (e.g. 123.45), or scientific (e.g. 1.23e3) notation. JSON doesn't support Inf, -Inf, or NaN.
|
- A **number** is similar to R's numbers: they can use integer (e.g. 123), decimal (e.g. 123.45), or scientific (e.g. 1.23e3) notation. JSON doesn't support Inf, -Inf, or NaN.
|
||||||
- A **boolean** is similar to R's `TRUE` and `FALSE`, but use lower case `true` and `false`.
|
- A **boolean** is similar to R's `TRUE` and `FALSE`, but uses lowercase `true` and `false`.
|
||||||
|
|
||||||
JSON's strings, numbers, and booleans are pretty similar to R's character, numeric, and logical vectors.
|
JSON's strings, numbers, and booleans are pretty similar to R's character, numeric, and logical vectors.
|
||||||
The main difference is that JSON's scalars can only represent a single value.
|
The main difference is that JSON's scalars can only represent a single value.
|
||||||
To represent multiple values you need to use one of the two remaining types, arrays and objects.
|
To represent multiple values you need to use one of the two remaining types: arrays and objects.
|
||||||
|
|
||||||
Both arrays and objects are similar to lists in R; the difference is whether or not they're named.
|
Both arrays and objects are similar to lists in R; the difference is whether or not they're named.
|
||||||
An **array** is like an unnamed list, and is written with `[]`.
|
An **array** is like an unnamed list, and is written with `[]`.
|
||||||
For example `[1, 2, 3]` is an array containing 3 numbers, and `[null, 1, "string", false]` is an array that contains a null, a number, a string, and a boolean.
|
For example `[1, 2, 3]` is an array containing 3 numbers, and `[null, 1, "string", false]` is an array that contains a null, a number, a string, and a boolean.
|
||||||
An **object** is like a named list, and it's written with `{}`.
|
An **object** is like a named list, and is written with `{}`.
|
||||||
|
The names (keys in JSON terminology) are strings, so must be surrounded by quotes.
|
||||||
For example, `{"x": 1, "y": 2}` is an object that maps `x` to 1 and `y` to 2.
|
For example, `{"x": 1, "y": 2}` is an object that maps `x` to 1 and `y` to 2.
|
||||||
|
|
||||||
### jsonlite
|
### jsonlite
|
||||||
|
|
||||||
To convert JSON into R data structures, we recommend that you use the jsonlite package, by Jeroen Oooms.
|
To convert JSON into R data structures, we recommend the jsonlite package, by Jeroen Ooms.
|
||||||
We'll use only two jsonlite functions: `read_json()` and `parse_json()`.
|
We'll use only two jsonlite functions: `read_json()` and `parse_json()`.
|
||||||
In real life, you'll use `read_json()` to read a JSON file from disk.
|
In real life, you'll use `read_json()` to read a JSON file from disk.
|
||||||
For example, the repurrsive package also provides the source for `gh_user` as a JSON file:
|
For example, the repurrsive package also provides the source for `gh_user` as a JSON file and you can read it with `read_json()`:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
# A path to a json file inside the package:
|
# A path to a json file inside the package:
|
||||||
|
@ -767,6 +784,7 @@ json <- '[
|
||||||
]'
|
]'
|
||||||
df <- tibble(json = parse_json(json))
|
df <- tibble(json = parse_json(json))
|
||||||
df
|
df
|
||||||
|
|
||||||
df |>
|
df |>
|
||||||
unnest_wider(json)
|
unnest_wider(json)
|
||||||
```
|
```
|
||||||
|
@ -785,6 +803,7 @@ json <- '{
|
||||||
'
|
'
|
||||||
df <- tibble(json = list(parse_json(json)))
|
df <- tibble(json = list(parse_json(json)))
|
||||||
df
|
df
|
||||||
|
|
||||||
df |>
|
df |>
|
||||||
unnest_wider(json) |>
|
unnest_wider(json) |>
|
||||||
unnest_longer(results) |>
|
unnest_longer(results) |>
|
||||||
|
@ -828,3 +847,13 @@ Apply `readr::parse_double()` as needed to the get correct variable type.
|
||||||
df_col <- tibble(json = list(json_col))
|
df_col <- tibble(json = list(json_col))
|
||||||
df_row <- tibble(json = json_row)
|
df_row <- tibble(json = json_row)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
In this chapter, you learned what lists are, how you can generate the from JSON files, and how turn them into rectangular data frames.
|
||||||
|
Surprisingly we only need two new functions: `unnest_longer()` to put list elements into rows and `unnest_wider()` to put list elements into columns.
|
||||||
|
It doesn't matter how deeply nested the list-column is, all you need to do is repeatedly call these two functions.
|
||||||
|
|
||||||
|
JSON is the most common data format returned by web APIs.
|
||||||
|
What happens if the website doesn't have an API, but you can see data you want on the website?
|
||||||
|
That's the topic of the next chapter: web scraping, extracting data from HTML webpages.
|
||||||
|
|
Loading…
Reference in New Issue