From 31363dc23ad76e0b4aa6f5800ca5243049bd19f6 Mon Sep 17 00:00:00 2001 From: Hadley Wickham Date: Wed, 23 Nov 2022 14:57:55 -0600 Subject: [PATCH] Title change + feedback from O'Reilly --- rectangling.qmd | 21 +++++++++++---------- 1 file changed, 11 insertions(+), 10 deletions(-) diff --git a/rectangling.qmd b/rectangling.qmd index 02624d3..caebd9a 100644 --- a/rectangling.qmd +++ b/rectangling.qmd @@ -1,4 +1,4 @@ -# Data rectangling {#sec-rectangling} +# Hierarchical data {#sec-rectangling} ```{r} #| results: "asis" @@ -9,7 +9,7 @@ status("polishing") ## Introduction -In this chapter, you'll learn the art of data **rectangling**, taking data that is fundamentally tree-like and converting it into a rectangular data frames made up of rows and columns. +In this chapter, you'll learn the art of data **rectangling**, taking data that is fundamentally hierarchical, or tree-like, and converting it into a rectangular data frames made up of rows and columns. This is important because hierarchical data is surprisingly common, especially when working with data that comes from the web. To learn about rectangling, you'll need to first learn about lists, the data structure that makes hierarchical data possible. @@ -294,7 +294,7 @@ df1 |> ``` If you don't want these `ids`, you can suppress them with `indices_include = FALSE`. -On the other hand, it's sometimes useful to retain the position of unnamed elements in unnamed list-columns. +On the other hand, sometimes the positions of the elements is meaningful, and even if the elements are unnamed, you might still want to track their indices. You can do this with `indices_include = TRUE`: ```{r} @@ -304,7 +304,7 @@ df2 |> ### Inconsistent types -What happens if you unnest a list-column contains different types of vector? +What happens if you unnest a list-column that contains different types of vector? For example, take the following dataset where the list-column `y` contains two numbers, a factor, and a logical, which can't normally be mixed in a single column. ```{r} @@ -346,7 +346,8 @@ df4 |> filter(map_lgl(y, is.numeric)) ``` -Then you can call `unnest_longer()` once more: +Then you can call `unnest_longer()` once more. +This gives us a rectangular dataset of just the numeric values. ```{r} df4 |> @@ -390,8 +391,8 @@ This section will work through four real rectangling challenges using datasets f ### Very wide data -We'll with `gh_repos`. -This is a list that contains data about a collection of GitHub repositories retrieved using the GitHub API. It's a very deeply nested list so it's difficult to show the structure in this book; you might want to explore a little on your own with `View(gh_repos)` before we continue. +We'll start with `gh_repos`. +This is a list that contains data about a collection of GitHub repositories retrieved using the GitHub API. It's a very deeply nested list so it's difficult to show the structure in this book; we recommend exploring a little on your own with `View(gh_repos)` before we continue. `gh_repos` is a list, but our tools work with list-columns, so we'll begin by putting it into a tibble. We call the column `json` for reasons we'll get to later. @@ -469,7 +470,7 @@ This gives another wide dataset, but you can see that `owner` appears to contain ### Relational data Nested data is sometimes used to represent data that we'd usually spread out into multiple data frames. -For example, take `got_chars`. +For example, take `got_chars` which contains data about characters that appear in Game of Thrones. Like `gh_repos` it's a list, so we start by turning it into a list-column of a tibble: ```{r} @@ -539,7 +540,7 @@ You could imagine creating a table like this for each of the list-columns, then ### A dash of text analysis -What if we wanted to find the most common words in the title? +Sticking with the same data, what if we wanted to find the most common words in the title? One simple approach starts by using `str_split()` to break each element of `title` up into words by spitting on `" "`: ```{r} @@ -664,7 +665,7 @@ locations |> Note how we unnest two columns simultaneously by supplying a vector of variable names to `unnest_wider()`. -This is somewhere that `hoist()`, mentioned briefly above, can be useful. +This is somewhere that `hoist()`, mentioned earlier in the chapter, can be useful. Once you've discovered the path to get to the components you're interested in, you can extract them directly using `hoist()`: ```{r}