Title change + feedback from O'Reilly

This commit is contained in:
Hadley Wickham 2022-11-23 14:57:55 -06:00
parent 19c89ebf64
commit 31363dc23a
1 changed files with 11 additions and 10 deletions

View File

@ -1,4 +1,4 @@
# Data rectangling {#sec-rectangling}
# Hierarchical data {#sec-rectangling}
```{r}
#| results: "asis"
@ -9,7 +9,7 @@ status("polishing")
## Introduction
In this chapter, you'll learn the art of data **rectangling**, taking data that is fundamentally tree-like and converting it into a rectangular data frames made up of rows and columns.
In this chapter, you'll learn the art of data **rectangling**, taking data that is fundamentally hierarchical, or tree-like, and converting it into a rectangular data frames made up of rows and columns.
This is important because hierarchical data is surprisingly common, especially when working with data that comes from the web.
To learn about rectangling, you'll need to first learn about lists, the data structure that makes hierarchical data possible.
@ -294,7 +294,7 @@ df1 |>
```
If you don't want these `ids`, you can suppress them with `indices_include = FALSE`.
On the other hand, it's sometimes useful to retain the position of unnamed elements in unnamed list-columns.
On the other hand, sometimes the positions of the elements is meaningful, and even if the elements are unnamed, you might still want to track their indices.
You can do this with `indices_include = TRUE`:
```{r}
@ -304,7 +304,7 @@ df2 |>
### Inconsistent types
What happens if you unnest a list-column contains different types of vector?
What happens if you unnest a list-column that contains different types of vector?
For example, take the following dataset where the list-column `y` contains two numbers, a factor, and a logical, which can't normally be mixed in a single column.
```{r}
@ -346,7 +346,8 @@ df4 |>
filter(map_lgl(y, is.numeric))
```
Then you can call `unnest_longer()` once more:
Then you can call `unnest_longer()` once more.
This gives us a rectangular dataset of just the numeric values.
```{r}
df4 |>
@ -390,8 +391,8 @@ This section will work through four real rectangling challenges using datasets f
### Very wide data
We'll with `gh_repos`.
This is a list that contains data about a collection of GitHub repositories retrieved using the GitHub API. It's a very deeply nested list so it's difficult to show the structure in this book; you might want to explore a little on your own with `View(gh_repos)` before we continue.
We'll start with `gh_repos`.
This is a list that contains data about a collection of GitHub repositories retrieved using the GitHub API. It's a very deeply nested list so it's difficult to show the structure in this book; we recommend exploring a little on your own with `View(gh_repos)` before we continue.
`gh_repos` is a list, but our tools work with list-columns, so we'll begin by putting it into a tibble.
We call the column `json` for reasons we'll get to later.
@ -469,7 +470,7 @@ This gives another wide dataset, but you can see that `owner` appears to contain
### Relational data
Nested data is sometimes used to represent data that we'd usually spread out into multiple data frames.
For example, take `got_chars`.
For example, take `got_chars` which contains data about characters that appear in Game of Thrones.
Like `gh_repos` it's a list, so we start by turning it into a list-column of a tibble:
```{r}
@ -539,7 +540,7 @@ You could imagine creating a table like this for each of the list-columns, then
### A dash of text analysis
What if we wanted to find the most common words in the title?
Sticking with the same data, what if we wanted to find the most common words in the title?
One simple approach starts by using `str_split()` to break each element of `title` up into words by spitting on `" "`:
```{r}
@ -664,7 +665,7 @@ locations |>
Note how we unnest two columns simultaneously by supplying a vector of variable names to `unnest_wider()`.
This is somewhere that `hoist()`, mentioned briefly above, can be useful.
This is somewhere that `hoist()`, mentioned earlier in the chapter, can be useful.
Once you've discovered the path to get to the components you're interested in, you can extract them directly using `hoist()`:
```{r}