Polishing, up to end of case studies
This commit is contained in:
parent
a8a3abe706
commit
fe270b927b
277
rectangle.qmd
277
rectangle.qmd
|
@ -33,17 +33,16 @@ library(jsonlite)
|
||||||
|
|
||||||
## Lists
|
## Lists
|
||||||
|
|
||||||
So far we've used simple vectors, like integers, numbers, characters, date-times, and factors.
|
So far we've used simple vectors like integers, numbers, characters, date-times, and factors.
|
||||||
These vectors are all homogeneous: every element must be the same type.
|
These vectors are simple because they're homogeneous: every element is same type.
|
||||||
If you want to store element of different types, you need a **list**.
|
If you want to store element of different types, you need a **list**, which you create with `list()`:
|
||||||
You can create a list with `list()`:
|
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
x1 <- list(1:4, "a", TRUE)
|
x1 <- list(1:4, "a", TRUE)
|
||||||
x1
|
x1
|
||||||
```
|
```
|
||||||
|
|
||||||
It's often convenient to name the components of a list, which you can do in the same way as naming the columns of a tibble:
|
It's often convenient to name the components, or **children**, of a list, which you can do in the same way as naming the columns of a tibble:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
x2 <- list(a = 1:2, b = 1:3, c = 1:4)
|
x2 <- list(a = 1:2, b = 1:3, c = 1:4)
|
||||||
|
@ -51,15 +50,15 @@ x2
|
||||||
```
|
```
|
||||||
|
|
||||||
Even for these very simple lists, printing takes up quite a lot of space.
|
Even for these very simple lists, printing takes up quite a lot of space.
|
||||||
A very useful alternative is `str()`, short for structure, which generates a compact display of the **str**ucture, de-emphasizing the contents:
|
A useful alternative is `str()`, which generates a compact display of the **str**ucture, de-emphasizing the contents:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
str(x1)
|
str(x1)
|
||||||
str(x2)
|
str(x2)
|
||||||
```
|
```
|
||||||
|
|
||||||
`str()` display each element (or **child**) of a list on its own line.
|
As you can see, `str()` displays each child on its own line.
|
||||||
It displays the name if present, then an abbreviation of the type, then the first few values.
|
It displays the name, if present, then an abbreviation of the type, then the first few values.
|
||||||
|
|
||||||
### Hierarchy
|
### Hierarchy
|
||||||
|
|
||||||
|
@ -71,24 +70,26 @@ x3 <- list(list(1, 2), list(3, 4))
|
||||||
str(x3)
|
str(x3)
|
||||||
```
|
```
|
||||||
|
|
||||||
This is different to `c()`, which generates a flat vector:
|
This is notably different to `c()`, which generates a flat vector:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
c(c(1, 2), c(3, 4))
|
c(c(1, 2), c(3, 4))
|
||||||
```
|
|
||||||
|
|
||||||
You can see how `str()` starts to get even more useful as the lists get more complex, and how it allows you to see the hierarchy at a glance.
|
x4 <- c(list(1, 2), list(3, 4))
|
||||||
|
|
||||||
```{r}
|
|
||||||
x4 <- list(1, list(2, list(3, list(4, list(5)))))
|
|
||||||
str(x4)
|
str(x4)
|
||||||
```
|
```
|
||||||
|
|
||||||
At some point, however, even `str()` starts to fail, and if you're working with deeply nested lists in RStudio, I highly recommend using `View()`.
|
As lists get more complex, `str()` gets more useful, as it lets you see the hierarchy at a glance:
|
||||||
@fig-view-collapsed shows the result of calling `View(x4)`.
|
|
||||||
The viewer starts by showing just the top level of the list, but you can interactively expand any of the components to see more, as in @fig-view-expand-1.
|
```{r}
|
||||||
RStudio will also show you the code you need to access that element, as in @fig-view-expand-2.
|
x5 <- list(1, list(2, list(3, list(4, list(5)))))
|
||||||
We'll come back to how this code works in @sec-vector-subsetting.
|
str(x5)
|
||||||
|
```
|
||||||
|
|
||||||
|
As lists get even large and more complex, even `str()` starts to fail, you'll need to switch to `View()`[^rectangle-1].
|
||||||
|
@fig-view-collapsed shows the result of calling `View(x4)`. The viewer starts by showing just the top level of the list, but you can interactively expand any of the components to see more, as in @fig-view-expand-1. RStudio will also show you the code you need to access that element, as in @fig-view-expand-2. We'll come back to how this code works in @sec-vector-subsetting.
|
||||||
|
|
||||||
|
[^rectangle-1]: This is an RStudio feature.
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
#| label: fig-view-collapsed
|
#| label: fig-view-collapsed
|
||||||
|
@ -122,9 +123,13 @@ knitr::include_graphics("screenshots/View-2.png", dpi = 220)
|
||||||
knitr::include_graphics("screenshots/View-3.png", dpi = 220)
|
knitr::include_graphics("screenshots/View-3.png", dpi = 220)
|
||||||
```
|
```
|
||||||
|
|
||||||
### List columns
|
### List-columns
|
||||||
|
|
||||||
You can put lists in the column of a tibble:
|
Lists can also live inside a tibble, where we call them list-columns.
|
||||||
|
List-columns are useful because they allow you to shoehorn in objects that wouldn't wouldn't usually belong in a data frame.
|
||||||
|
List-columns are are used a lot in the tidymodels ecosystem, because it allows you to store things like models or resamples in a data frame.
|
||||||
|
|
||||||
|
Here's a simple example of a list-column:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
df <- tibble(
|
df <- tibble(
|
||||||
|
@ -135,16 +140,15 @@ df <- tibble(
|
||||||
df
|
df
|
||||||
```
|
```
|
||||||
|
|
||||||
This is a powerful idea because it allows you to store arbitrarily complex objects in a data frame; even things that wouldn't typically belong there.
|
There's nothing special about lists in a tibble; they behave like any other column:
|
||||||
This idea is used a lot in tidymodels, because it allows you to store things like models or resamples in a data frame.
|
|
||||||
|
|
||||||
And those things are carried along like any other column:
|
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
df |>
|
df |>
|
||||||
filter(x == 1)
|
filter(x == 1)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Computing with them is harder, but that's because computing with lists is a harder; we'll come back to that in @sec-iteration.
|
||||||
|
|
||||||
The default print method just displays a rough summary of the contents.
|
The default print method just displays a rough summary of the contents.
|
||||||
The list column could be arbitrarily complex, so there's no good way to print it.
|
The list column could be arbitrarily complex, so there's no good way to print it.
|
||||||
If you want to see it, you'll need to pull the list-column out and apply of the techniques that you learned above:
|
If you want to see it, you'll need to pull the list-column out and apply of the techniques that you learned above:
|
||||||
|
@ -158,21 +162,19 @@ df |>
|
||||||
|
|
||||||
Similarly, if you `View()` a data frame in RStudio, you'll get the standard tabular view, which doesn't allow you to selectively expand list columns.
|
Similarly, if you `View()` a data frame in RStudio, you'll get the standard tabular view, which doesn't allow you to selectively expand list columns.
|
||||||
To explore those fields you'll need to `pull()` and view, e.g.
|
To explore those fields you'll need to `pull()` and view, e.g.
|
||||||
`View(pull(df, z))`
|
`View(pull(df, z))`.
|
||||||
|
|
||||||
::: callout-note
|
::: callout-note
|
||||||
## Base R
|
## Base R
|
||||||
|
|
||||||
It's possible to put a list in a column of a `data.frame`, but it's a lot fiddlier.
|
It's possible to put a list in a column of a `data.frame`, but it's a lot fiddlier.
|
||||||
List-columns are implicit in the definition of the data frame: a data frame is a named list of equal length vectors.
|
|
||||||
A list is a vector, so it's always been legitimate to use a list as a column of a data frame.
|
|
||||||
However, base R doesn't make it easy to create list-columns because `data.frame()` treats a list as a list of columns:
|
However, base R doesn't make it easy to create list-columns because `data.frame()` treats a list as a list of columns:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
data.frame(x = list(1:3, 3:5))
|
data.frame(x = list(1:3, 3:5))
|
||||||
```
|
```
|
||||||
|
|
||||||
You can prevent `data.frame()` from doing this with `I()`, but the result doesn't print particularly well:
|
You can prevent `data.frame()` from doing this with `I()`, but the result doesn't print particularly informatively:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
data.frame(
|
data.frame(
|
||||||
|
@ -186,16 +188,12 @@ Tibbles make it easier to work with list-columns because `tibble()` doesn't modi
|
||||||
|
|
||||||
## Unnesting
|
## Unnesting
|
||||||
|
|
||||||
Now that you've learned the basics of lists and how you can use them as a column of a data frame, lets start to see how you can turn them back into regular columns and rows so you can use them with the tidyverse functions you've already learned about.
|
Now that you've learned the basics of lists and list-columns, lets explore how you can turn them back into regular rows and columns.
|
||||||
We'll start with very simple sample data so you can get the idea of how things work, and then in the next section switch to more realistic examples.
|
We'll start with very simple sample data so you can get the basic idea, and then in the next section switch to more realistic examples.
|
||||||
|
|
||||||
Lists tend to come in two basic forms:
|
|
||||||
|
|
||||||
- A named list where every row has the same number of children with the same names. Every name has the same type.
|
|
||||||
- An unnamed list where the number of children varies from row to row, and all the types are the same.
|
|
||||||
|
|
||||||
More complicated examples just combine these in multiple ways.
|
|
||||||
|
|
||||||
|
List-columns tend to come in two basic forms: named and unnamed.
|
||||||
|
When the children are **named**, they tend to have the same names in every row.
|
||||||
|
When the children are **unnamed**, the number of elements tends to vary from row-to-row.
|
||||||
The following code creates an example of each.
|
The following code creates an example of each.
|
||||||
In `df1`, every element of list-column `y` has two elements named `a` and `b`.
|
In `df1`, every element of list-column `y` has two elements named `a` and `b`.
|
||||||
If `df2`, the elements of list-column `y` are unnamed and vary in length.
|
If `df2`, the elements of list-column `y` are unnamed and vary in length.
|
||||||
|
@ -216,12 +214,10 @@ df2 <- tribble(
|
||||||
)
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
These two cases correspond to two tools from tidyr: `unnest_wider()` and `unnest_longer()`.
|
Named list-columns naturally unnest into columns: each named element becomes a new named column.
|
||||||
Their suffixes have the same meaning as `pivot_wider()` and `pivot_longer()`: `_wider()` adds more columns and `_longer()` adds more rows.
|
Unnamed list-columns naturally unnested in to rows: you'll get one row for each child.
|
||||||
If your situation isn't as clear cut as these cases, you'll still need to use one of `unnest_longer()` and `unnest_wider()`; you'll just need to do a bit more thinking and experimentation to figure out which one is best.
|
tidyr provides two functions for these two case: `unnest_wider()` and `unnest_longer()`.
|
||||||
|
The following sections explain how they work.
|
||||||
The main difference between these simple examples and real data is that there's only one level of nesting here.
|
|
||||||
In real-life, there will often be many, and you'll need to use multiple calls to `unnest_wider()` and `unnest_longer()` to handle it.
|
|
||||||
|
|
||||||
### `unnest_wider()`
|
### `unnest_wider()`
|
||||||
|
|
||||||
|
@ -232,8 +228,8 @@ df1 |>
|
||||||
unnest_wider(y)
|
unnest_wider(y)
|
||||||
```
|
```
|
||||||
|
|
||||||
By default, the names of the new columns come exclusively from the names of the list, but you can use the `names_sep` argument to request that they combine the original column with the new column.
|
By default, the names of the new columns come exclusively from the names of the list, but you can use the `names_sep` argument to request that they combine the column name and the list names.
|
||||||
As you'll learn in the next section, this is useful for disambiguating repeated names.
|
This is useful for disambiguating repeated names.
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
df1 |>
|
df1 |>
|
||||||
|
@ -241,8 +237,7 @@ df1 |>
|
||||||
```
|
```
|
||||||
|
|
||||||
We can also use `unnest_wider()` with unnamed list-columns, as in `df2`.
|
We can also use `unnest_wider()` with unnamed list-columns, as in `df2`.
|
||||||
It's not as naturally well suited, because it's not clear what the columns should be named.
|
Since columns require names but the list lacks them, `unnest_wider()` will label them with consecutive integers:
|
||||||
So `unnest_wider()` gives them numbers:
|
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
df2 |>
|
df2 |>
|
||||||
|
@ -250,7 +245,6 @@ df2 |>
|
||||||
```
|
```
|
||||||
|
|
||||||
You'll notice that `unnested_wider()`, much like `pivot_wider()`, turns implicit missing values in to explicit missing values.
|
You'll notice that `unnested_wider()`, much like `pivot_wider()`, turns implicit missing values in to explicit missing values.
|
||||||
Another challenge is that if you're working with live data, you won't know exactly how many columns you'll end up with.
|
|
||||||
|
|
||||||
### `unnest_longer()`
|
### `unnest_longer()`
|
||||||
|
|
||||||
|
@ -261,26 +255,8 @@ df2 |>
|
||||||
unnest_longer(y)
|
unnest_longer(y)
|
||||||
```
|
```
|
||||||
|
|
||||||
You can also apply the same operation to named list-columns, like `df1$y`:
|
Note how `x` is duplicated for each element inside of `y`: we get one row of output for each element inside the list-column.
|
||||||
|
But what happens if the list-column is empty, as in the following example?
|
||||||
```{r}
|
|
||||||
df1 |>
|
|
||||||
unnest_longer(y)
|
|
||||||
```
|
|
||||||
|
|
||||||
Note the new `y_id` column.
|
|
||||||
Because the elements are named, and those names might be useful data, tidyr keeps them in the result data in a new column with the `_id` suffix.
|
|
||||||
You can suppress this with `indices_include = FALSE`.
|
|
||||||
|
|
||||||
You might also use `indices_include = TRUE` if the position of the elements is important in the unnamed case:
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
df2 |>
|
|
||||||
unnest_longer(y, indices_include = TRUE)
|
|
||||||
```
|
|
||||||
|
|
||||||
The output contains one row for each element inside the list-column.
|
|
||||||
So what happens if the list-column is empty?
|
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
df6 <- tribble(
|
df6 <- tribble(
|
||||||
|
@ -292,14 +268,30 @@ df6 <- tribble(
|
||||||
df6 |> unnest_longer(y)
|
df6 |> unnest_longer(y)
|
||||||
```
|
```
|
||||||
|
|
||||||
The row goes away!
|
We get zero rows in the output, so the row effectively disappears.
|
||||||
--- <https://github.com/tidyverse/tidyr/issues/1339>.
|
Once <https://github.com/tidyverse/tidyr/issues/1339> is fixed, you'll be able to keep this row, replacing `y` with `NA` by setting `keep_empty = TRUE`.
|
||||||
|
|
||||||
|
You can also unnest named list-columns, like `df1$y` into the rows.
|
||||||
|
Because the elements are named, and those names might be useful data, puts them in a new column with the suffix`_id`:
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
df1 |>
|
||||||
|
unnest_longer(y)
|
||||||
|
```
|
||||||
|
|
||||||
|
If you don't want these `ids`, you can suppress this with `indices_include = FALSE`.
|
||||||
|
On the other hand, it's sometimes useful to retain the position of unnamed elements in unnamed list-columns.
|
||||||
|
You can do this with `indices_include = TRUE`:
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
df2 |>
|
||||||
|
unnest_longer(y, indices_include = TRUE)
|
||||||
|
```
|
||||||
|
|
||||||
### Inconsistent types
|
### Inconsistent types
|
||||||
|
|
||||||
What happens if you attempt to unnest a column that doesn't contain only one type of thing.
|
What happens if you unnest a list-column contains different types of vector?
|
||||||
For example, what happens if we take this data set and unnest into rows?
|
For example, take the following dataset where the list-column `y` contains two numbers, a factor, and a logical, which can't normally be mixed in a single column.
|
||||||
`y` will contain two numbers, a factor, a logical, which can't normally be mixed in a single column:
|
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
df4 <- tribble(
|
df4 <- tribble(
|
||||||
|
@ -309,25 +301,27 @@ df4 <- tribble(
|
||||||
)
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
An important invariant for `unnest_longer()` is that the columns say the same but the number of rows change.
|
`unnest_longer()` always keeps the set of columns change, while changing the number of rows.
|
||||||
So what happens?
|
So what happens?
|
||||||
How does `unnest_longer()` produce five rows while keeping everything in `y`?
|
How does `unnest_longer()` produce five rows while keeping everything in `y`?
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
df4 |> unnest_longer(y)
|
df4 |>
|
||||||
|
unnest_longer(y)
|
||||||
```
|
```
|
||||||
|
|
||||||
We still get a list-column, but every element of the list-column contains a single element.
|
As you can see, the output contains a list-column, but every element of the list-column contains a single element.
|
||||||
When `unnest_longer()` can't find a common type, it keeps the original types by using a list-column.
|
Because `unnest_longer()` can't find a common type of vector, it keeps the original types in a list-column.
|
||||||
You might wonder if this breaks the commandment that every element of a column must be the same type --- not quite, because every element is a still a list, but each component of a list can contain something different.
|
You might wonder if this breaks the commandment that every element of a column must be the same type --- not quite, because every element is a still a list, and each component of that list contains something different.
|
||||||
|
|
||||||
What happens if you find this problem in a dataset you're trying to rectangle?
|
What happens if you find this problem in a dataset you're trying to rectangle?
|
||||||
I think there are two basic options.
|
I think there are two basic options.
|
||||||
You could try and coerce to a class that is meaningful for all the rows using the `transform` argument.
|
You could use the `transform` argument to coerce all inputs to a common type.
|
||||||
It's not particularly useful here because there's only really one class that these five class can be converted to: character.
|
It's not particularly useful here because there's only really one class that these five class can be converted to: character.
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
df4 |> unnest_longer(y, transform = as.character)
|
df4 |>
|
||||||
|
unnest_longer(y, transform = as.character)
|
||||||
```
|
```
|
||||||
|
|
||||||
Another option would be to filter down to the rows that have values of a specific type:
|
Another option would be to filter down to the rows that have values of a specific type:
|
||||||
|
@ -351,11 +345,11 @@ df4 |>
|
||||||
|
|
||||||
### Other functions
|
### Other functions
|
||||||
|
|
||||||
There are few other useful rectangling functions that we're not going to talk about here:
|
tidyr has a few other useful rectangling functions that we're not going to cover in this book:
|
||||||
|
|
||||||
- `unnest_auto()` automatically picks between `unnest_longer()` and `unnest_wider()`based on the structure of the list-column. It's a great for rapid exploration, but I think it's ultimately a bad idea because it doesn't force you to understand how your data is structured, and makes your code harder to understand.
|
- `unnest_auto()` automatically picks between `unnest_longer()` and `unnest_wider()` based on the structure of the list-column. It's a great for rapid exploration, but I think it's ultimately a bad idea because it doesn't force you to understand how your data is structured, and makes your code harder to understand.
|
||||||
- `unnest()` modifies rows and columns simultaneously. It's useful when you have a list-column that contains a 2d structure like a data frame (which we often call a nested data frame), which we don't otherwise use in this book.
|
- `unnest()` expands both rows and columns. It's useful when you have a list-column that contains a 2d structure like a data frame, which we don't see in this book.
|
||||||
- `hoist()` allows you to reach into a deeply nested list and extract just the components that you need. It's mostly equivalent to repeated invocations of `unnest_wider()` + `select()` so you should read up on it if there's just a couple of important variables that you want to pull out, embedded in a bunch of data that you don't care about.
|
- `hoist()` allows you to reach into a deeply nested list and extract just the components that you need. It's mostly equivalent to repeated invocations of `unnest_wider()` + `select()` so you read up on it if you're trying to extract just a couple of important variables embedded in a bunch of data that you don't care about.
|
||||||
|
|
||||||
### Exercises
|
### Exercises
|
||||||
|
|
||||||
|
@ -375,21 +369,28 @@ There are few other useful rectangling functions that we're not going to talk ab
|
||||||
|
|
||||||
## Case studies
|
## Case studies
|
||||||
|
|
||||||
Now that you understand the basics of `unnest_wider()` and `unnest_longer()` lets use them to tackle some real rectangling challenges.
|
So far you've learned about the simplest case of list-columns, where you need only a single call to `unnest_longer()` or `unnest_wider()`.
|
||||||
These challenges share the common feature that they're mostly just a sequence of multiple `unnest_wider()` and/or `unnest_longer()` calls, with a little dash of dplyr where needed.
|
The main difference between real data and these simple examples, is with real data you'll see multiple levels of nesting.
|
||||||
See `vignette("rectangling", package = "tidyr")` for more.
|
For example, you might see named list nested inside an unnested list, or an unnamed list nested inside of another unnamed list nested inside a named list.
|
||||||
|
To handle these case you'll need to chain together multiple calls to `unnest_wider()` and/or `unnest_longer()`.
|
||||||
|
|
||||||
|
This section will work through some real rectangling challenges using datasets from the repurrrsive package that are inspired by datasets that we've encountered in the wild.
|
||||||
|
These challenges share the common feature that they're mostly just a sequence of multiple `unnest_wider()` and/or `unnest_longer()` calls, with a dash of dplyr where needed.
|
||||||
|
|
||||||
### Very wide data
|
### Very wide data
|
||||||
|
|
||||||
We'll start with `gh_repos` --- this is some data about GitHub repositories retrived from GitHub API. It's a very deeply nested list so it's hard for me to display in this book; you might want to explore a little on your own with `View(gh_repos)` before we continue.
|
We'll start by exploring `gh_repos` which contains data about some GitHub repositories retrived from the GitHub API. It's a very deeply nested list so it's to show the structure in this book; you might want to explore a little on your own with `View(gh_repos)` before we continue.
|
||||||
To make it more manageable I'm going to put it in a tibble in a column called `json` (for reasons we'll get to later)
|
|
||||||
|
`gh_repos` is a list, but our tools work with list-columns, so we'll begin by putting it a tibble.
|
||||||
|
I call the column call `json` for reasons we'll get to later.
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
repos <- tibble(json = gh_repos)
|
repos <- tibble(json = gh_repos)
|
||||||
repos
|
repos
|
||||||
```
|
```
|
||||||
|
|
||||||
There are row rows, and each row contains a unnamed list with either 26 or 30 rows.
|
This tibble contains 6 rows, one row for each child of `gh_repos`.
|
||||||
|
Each row contains a unnamed list with either 26 or 30 rows.
|
||||||
Since these are unnamed, we'll start with an `unnest_longer()` to put each child in its own row:
|
Since these are unnamed, we'll start with an `unnest_longer()` to put each child in its own row:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
|
@ -397,7 +398,7 @@ repos |>
|
||||||
unnest_longer(json)
|
unnest_longer(json)
|
||||||
```
|
```
|
||||||
|
|
||||||
At first glance, it might seem like we haven't improved the situation --- while we have more rows now (176 instead of 6) it seems like each element of `json` is still a list.
|
At first glance, it might seem like we haven't improved the situation: while we have more rows (176 instead of 6) each element of `json` is still a list.
|
||||||
However, there's an important difference: now each element is a **named** list so we can use `unnamed_wider()` to put each element into its own column:
|
However, there's an important difference: now each element is a **named** list so we can use `unnamed_wider()` to put each element into its own column:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
|
@ -406,7 +407,7 @@ repos |>
|
||||||
unnest_wider(json)
|
unnest_wider(json)
|
||||||
```
|
```
|
||||||
|
|
||||||
This is a bit overwhelming --- there are so many columns that tibble doesn't even print all of them!
|
This has worked but the result is a little overwhelming: there are so many columns that tibble doesn't even print all of them!
|
||||||
We can see them all with `names()`:
|
We can see them all with `names()`:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
|
@ -425,7 +426,9 @@ repos |>
|
||||||
select(id, full_name, owner, description)
|
select(id, full_name, owner, description)
|
||||||
```
|
```
|
||||||
|
|
||||||
`owner` is another list-column, and since it contains named list, we can use `unnest_wider()` to get at the values:
|
You can use this to work back to understand `gh_repos`: each child was a GitHub user containing a list of up to 30 GitHub repositories that they created.
|
||||||
|
|
||||||
|
`owner` is another list-column, and since it a contains named list, we can use `unnest_wider()` to get at the values:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
#| error: true
|
#| error: true
|
||||||
|
@ -447,10 +450,13 @@ repos |>
|
||||||
unnest_wider(owner, names_sep = "_")
|
unnest_wider(owner, names_sep = "_")
|
||||||
```
|
```
|
||||||
|
|
||||||
|
This gives another wide dataset, but you can see that `owner` appears to contain a lot of additional data about the person who "owns" the repository.
|
||||||
|
|
||||||
### Relational data
|
### Relational data
|
||||||
|
|
||||||
When you get nested data, it's not uncommon for it to contain data that we'd normally spread out into multiple data frames.
|
When you get nested data, it's not uncommon for it to contain data that we'd normally spread out into multiple data frames.
|
||||||
Take `got_chars`
|
Take `got_chars`, for example.
|
||||||
|
Like `gh_repos` it's a list, so we start by turning it into a list-column of a tibble:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
chars <- tibble(json = got_chars)
|
chars <- tibble(json = got_chars)
|
||||||
|
@ -481,7 +487,8 @@ chars |>
|
||||||
select(id, where(is.list))
|
select(id, where(is.list))
|
||||||
```
|
```
|
||||||
|
|
||||||
Lets explore a couple, starting with `titles`:
|
Lets explore the `titles` column.
|
||||||
|
It's an unnamed list-column, so we'll unnest it into rows:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
chars |>
|
chars |>
|
||||||
|
@ -490,7 +497,8 @@ chars |>
|
||||||
unnest_longer(titles)
|
unnest_longer(titles)
|
||||||
```
|
```
|
||||||
|
|
||||||
You might expect to see this in its own table:
|
You might expect to see this data in its own table because you could then join back to the characters data as needed.
|
||||||
|
To make this table I'll do a little cleaning; removing the rows contain empty strings and renaming `titles` to `title` since each row now only contains a single title.
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
titles <- chars |>
|
titles <- chars |>
|
||||||
|
@ -502,23 +510,24 @@ titles <- chars |>
|
||||||
titles
|
titles
|
||||||
```
|
```
|
||||||
|
|
||||||
Because you could then join it on as needed.
|
Now, for example, we could use this table to all the characters that are captains and see all their titles:
|
||||||
For example, we find all the characters that are captains:
|
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
captains <- titles |> filter(str_detect(title, "Captain"))
|
captains <- titles |> filter(str_detect(title, "Captain"))
|
||||||
captains
|
captains
|
||||||
|
|
||||||
characters |>
|
characters |>
|
||||||
semi_join(captains)
|
semi_join(captains) |>
|
||||||
|
select(id, name) |>
|
||||||
|
left_join(titles)
|
||||||
```
|
```
|
||||||
|
|
||||||
You could imagine creating a table like this for each of the list-columns, and then using joins to combine when needed.
|
You could imagine creating a table like this for each of the list-columns, then using joins to combine them with the character data as you need it.
|
||||||
|
|
||||||
### A dash of text analysis
|
### A dash of text analysis
|
||||||
|
|
||||||
What if we wanted to find the most common words in the title?
|
What if we wanted to find the most common words in the title?
|
||||||
There are plenty of sophisticated ways to do this, but one simple way starts by breaking each element of `title` up into words by spitting on `" "`:
|
There are plenty of sophisticated ways to do this, but one simple way starts by using `str_split()` to break each element of `title` up into words by spitting on `" "`:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
titles |>
|
titles |>
|
||||||
|
@ -530,7 +539,7 @@ This creates a unnamed variable length list-column, so we can use `unnest_longer
|
||||||
```{r}
|
```{r}
|
||||||
titles |>
|
titles |>
|
||||||
mutate(word = str_split(title, " "), .keep = "unused") |>
|
mutate(word = str_split(title, " "), .keep = "unused") |>
|
||||||
unnest_longer(word)
|
unnest_longer(word)
|
||||||
```
|
```
|
||||||
|
|
||||||
And then we can count that column to find the most common:
|
And then we can count that column to find the most common:
|
||||||
|
@ -542,13 +551,30 @@ titles |>
|
||||||
count(word, sort = TRUE)
|
count(word, sort = TRUE)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Some of those words are not very interesting so we could create a list of common words to drop.
|
||||||
|
In text analysis this is commonly called stop words.
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
stop_words <- tribble(
|
||||||
|
~ word,
|
||||||
|
"of",
|
||||||
|
"the"
|
||||||
|
)
|
||||||
|
|
||||||
|
titles |>
|
||||||
|
mutate(word = str_split(title, " "), .keep = "unused") |>
|
||||||
|
unnest_longer(word) |>
|
||||||
|
anti_join(stop_words) |>
|
||||||
|
count(word, sort = TRUE)
|
||||||
|
```
|
||||||
|
|
||||||
Breaking up text into individual fragments is a powerful idea that underlies much of text analysis.
|
Breaking up text into individual fragments is a powerful idea that underlies much of text analysis.
|
||||||
For more, I'd recommend reading [Text Mining with R](https://www.tidytextmining.com).
|
If this sounds interesting, I'd recommend reading [Text Mining with R](https://www.tidytextmining.com) by Julia Silge and David Robinson.
|
||||||
|
|
||||||
### Deeply nested
|
### Deeply nested
|
||||||
|
|
||||||
We'll finish off with an that is very deeply nested and requires repeated rounds of `unnest_wider()` and `unnest_longer()` to unravel: `gmaps_cities`.
|
We'll finish off this case studies with a list-column that's very deeply nested and requires repeated rounds of `unnest_wider()` and `unnest_longer()` to unravel: `gmaps_cities`.
|
||||||
This is a two column tibble containing five cities names and the results of using Google's [geocoding API](https://developers.google.com/maps/documentation/geocoding) to determine their location:
|
This is a two column tibble containing five city names and the results of using Google's [geocoding API](https://developers.google.com/maps/documentation/geocoding) to determine their location:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
gmaps_cities
|
gmaps_cities
|
||||||
|
@ -561,11 +587,9 @@ gmaps_cities |>
|
||||||
unnest_wider(json)
|
unnest_wider(json)
|
||||||
```
|
```
|
||||||
|
|
||||||
This gives us a status column and the actual results.
|
This gives us the `status` and the `results`.
|
||||||
We'll drop the status column since they're all `OK`.
|
We'll drop the status column since they're all `OK`; in a real analysis, you'd also want capture all the rows where `status != "OK"` and figure out what went wrong.
|
||||||
In a real analysis, you'd also want separately capture all the rows where `status != "OK"` so you could figure out what went wrong.
|
`results` is an unnamed list, with either one or two elements (we'll see why shortly) so we'll unnest it into rows:
|
||||||
`results` is an unnamed list, with either one or two elements.
|
|
||||||
We'll figure to out why shortly.
|
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
gmaps_cities |>
|
gmaps_cities |>
|
||||||
|
@ -574,7 +598,7 @@ gmaps_cities |>
|
||||||
unnest_longer(results)
|
unnest_longer(results)
|
||||||
```
|
```
|
||||||
|
|
||||||
Now results is a named list, so we'll `unnest_wider()`:
|
Now `results` is a named list, so we'll use `unnest_wider()`:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
locations <- gmaps_cities |>
|
locations <- gmaps_cities |>
|
||||||
|
@ -585,10 +609,10 @@ locations <- gmaps_cities |>
|
||||||
locations
|
locations
|
||||||
```
|
```
|
||||||
|
|
||||||
Now we can see why Washington and Arlington got two results: Washington matched both the state and the city (DC), and Arlington matched Arlington Virginia and Arlington Texas.
|
Now we can see why two cities got two results: Washington matched both the Washington state and Washington, DC, and Arlington matched Arlington, Virginia and Arlington, Texas.
|
||||||
|
|
||||||
There are few different places we could go from here.
|
There are few different places we could go from here.
|
||||||
We might want to determine the exact location of the match stored in the `geometry` list-column:
|
We might want to determine the exact location of the match, which is stored in the `geometry` list-column:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
locations |>
|
locations |>
|
||||||
|
@ -628,9 +652,9 @@ locations |>
|
||||||
unnest_wider(c(ne, sw), names_sep = "_")
|
unnest_wider(c(ne, sw), names_sep = "_")
|
||||||
```
|
```
|
||||||
|
|
||||||
Note that I take advantage of the fact that you can unnest multiple columns at a time by supplying a vector of variable names to `unnest_wider()`.
|
Note that I unnest the two columns simultaneously by supplying a vector of variable names to `unnest_wider()`.
|
||||||
|
|
||||||
This one place where `hoist()`, which we mentioned briefly above can be useful.
|
This one place where `hoist()`, mentioned briefly above, can be useful.
|
||||||
Once you've discovered the path to get to the components you're interested in, you can extract them directly using `hoist()`:
|
Once you've discovered the path to get to the components you're interested in, you can extract them directly using `hoist()`:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
|
@ -645,13 +669,18 @@ locations |>
|
||||||
)
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
If these case studies have whetted your appetite for more real-life rectangling, you can see a few more examples in `vignette("rectangling", package = "tidyr")`.
|
||||||
|
|
||||||
### Exercises
|
### Exercises
|
||||||
|
|
||||||
1. The `owner` column of `gh_repo` contains a lot of duplicated information because each owner can have many repos.
|
1. Roughly estimate when `gh_repos` was created.
|
||||||
|
Why can you only roughly estimate the date?
|
||||||
|
|
||||||
|
2. The `owner` column of `gh_repo` contains a lot of duplicated information because each owner can have many repos.
|
||||||
Can you construct a `owners` data frame that contains one row for each owner?
|
Can you construct a `owners` data frame that contains one row for each owner?
|
||||||
(Hint: does `distinct()` work with `list-cols`?)
|
(Hint: does `distinct()` work with `list-cols`?)
|
||||||
|
|
||||||
2. Explain the following code.
|
3. Explain the following code line-by-line.
|
||||||
Why is it interesting?
|
Why is it interesting?
|
||||||
Why does it work for this dataset but might not work in general?
|
Why does it work for this dataset but might not work in general?
|
||||||
|
|
||||||
|
@ -659,10 +688,20 @@ locations |>
|
||||||
tibble(json = got_chars) |>
|
tibble(json = got_chars) |>
|
||||||
unnest_wider(json) |>
|
unnest_wider(json) |>
|
||||||
select(id, where(is.list)) %>%
|
select(id, where(is.list)) %>%
|
||||||
pivot_longer(where(is.list), names_to = "media", values_to = "value") %>%
|
pivot_longer(
|
||||||
|
where(is.list),
|
||||||
|
names_to = "name",
|
||||||
|
values_to = "value"
|
||||||
|
) %>%
|
||||||
unnest_longer(value)
|
unnest_longer(value)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
4. In `gmaps_cities`, what does `address_components` contain?
|
||||||
|
Why does the length vary between rows?
|
||||||
|
Unnest it appropriately to figure it out.
|
||||||
|
(Hint: `types` always appears to contain two elements. Does `unnest_wider()` make it easier to work with than `unnest_longer()`?)
|
||||||
|
.
|
||||||
|
|
||||||
## JSON
|
## JSON
|
||||||
|
|
||||||
All of the case studies in the previous section came originally as JSON, one of the most common sources of hierarchical data.
|
All of the case studies in the previous section came originally as JSON, one of the most common sources of hierarchical data.
|
||||||
|
|
Loading…
Reference in New Issue