From bd67dc7a626b7ea298946baa5f11a09ff73e2c72 Mon Sep 17 00:00:00 2001
From: Hadley Wickham <h.wickham@gmail.com>
Date: Thu, 26 Jan 2023 11:11:55 -0600
Subject: [PATCH] Streamline hierarchical data

---
 rectangling.qmd | 165 ++++++++++--------------------------------------
 1 file changed, 33 insertions(+), 132 deletions(-)

diff --git a/rectangling.qmd b/rectangling.qmd
index 28552f2..9a2fa92 100644
--- a/rectangling.qmd
+++ b/rectangling.qmd
@@ -164,17 +164,7 @@ In this chapter, we'll focus on unnesting list-columns out into regular variable
 
 The default print method just displays a rough summary of the contents.
 The list column could be arbitrarily complex, so there's no good way to print it.
-If you want to see it, you'll need to pull the list-column out and apply one of the techniques that you've learned above:
-
-```{r}
-df |> 
-  filter(x == 1) |> 
-  pull(z) |> 
-  str()
-```
-
-Similarly, if you `View()` a data frame in RStudio, you'll get the standard tabular view, which doesn't allow you to selectively expand list columns.
-To explore those fields you'll need to `pull()` and view, e.g. `df |> pull(z) |> View()`.
+If you want to see it, you'll need to pull the list-column out and apply one of the techniques that you've learned above, like `df |> pull(z) |> str()` or `df |> pull(z) |> View()`.
 
 ::: callout-note
 ## Base R
@@ -250,14 +240,6 @@ df1 |>
   unnest_wider(y, names_sep = "_")
 ```
 
-We can also use `unnest_wider()` with unnamed list-columns, as in `df2`.
-Since columns require names but the list lacks them, `unnest_wider()` will label them with consecutive integers:
-
-```{r}
-df2 |> 
-  unnest_wider(y, names_sep = "_")
-```
-
 You'll notice that `unnest_wider()`, much like `pivot_wider()`, turns implicit missing values in to explicit missing values.
 
 ### `unnest_longer()`
@@ -283,24 +265,7 @@ df6 |> unnest_longer(y)
 ```
 
 We get zero rows in the output, so the row effectively disappears.
-Once <https://github.com/tidyverse/tidyr/issues/1339> is fixed, you'll be able to keep this row, replacing `y` with `NA` by setting `keep_empty = TRUE`.
-
-You can also unnest named list-columns, like `df1$y`, into rows.
-Because the elements are named, and those names might be useful data, tidyr puts them in a new column with the suffix `_id`:
-
-```{r}
-df1 |> 
-  unnest_longer(y)
-```
-
-If you don't want these `ids`, you can suppress them with `indices_include = FALSE`.
-On the other hand, sometimes the positions of the elements is meaningful, and even if the elements are unnamed, you might still want to track their indices.
-You can do this with `indices_include = TRUE`:
-
-```{r}
-df2 |> 
-  unnest_longer(y, indices_include = TRUE)
-```
+If you want to preserve that row, adding add `NA` in `y` by setting `keep_empty = TRUE`.
 
 ### Inconsistent types
 
@@ -310,8 +275,8 @@ For example, take the following dataset where the list-column `y` contains two n
 ```{r}
 df4 <- tribble(
   ~x, ~y,
-  "a", list(1, "a"),
-  "b", list(TRUE, factor("a"), 5)
+  "a", list(1),
+  "b", list("a", TRUE, 5)
 )
 ```
 
@@ -326,37 +291,10 @@ df4 |>
 
 As you can see, the output contains a list-column, but every element of the list-column contains a single element.
 Because `unnest_longer()` can't find a common type of vector, it keeps the original types in a list-column.
-You might wonder if this breaks the commandment that every element of a column must be the same type --- not quite: every element is a still a list, even though the contents of each element is a different type.
+You might wonder if this breaks the commandment that every element of a column must be the same type.
+It doesn't: every element is a list, even though the contents are of different types.
 
-What happens if you find this problem in a dataset you're trying to rectangle?
-There are two basic options.
-You could use the `transform` argument to coerce all inputs to a common type.
-However, it's not particularly useful here because there's only really one class that these five class can be converted to character.
-
-```{r}
-df4 |> 
-  unnest_longer(y, transform = as.character)
-```
-
-Another option would be to filter down to the rows that have values of a specific type:
-
-```{r}
-df4 |> 
-  unnest_longer(y) |> 
-  filter(map_lgl(y, is.numeric))
-```
-
-Then you can call `unnest_longer()` once more.
-This gives us a rectangular dataset of just the numeric values.
-
-```{r}
-df4 |> 
-  unnest_longer(y) |> 
-  filter(map_lgl(y, is.numeric)) |> 
-  unnest_longer(y)
-```
-
-You'll learn more about `map_lgl()` in @sec-iteration.
+Dealing with inconsistent types is challenging and the details depend on the precise nature of the problem and your goals, but you'll mostly likely need tools from @sec-iteration.
 
 ### Other functions
 
@@ -370,7 +308,14 @@ These functions are good to know about as you might encounter them when reading
 
 ### Exercises
 
-1.  From time-to-time you encounter data frames with multiple list-columns with aligned values.
+1.  What happens when you use `unnest_wider()` with unnamed list-columns like `df2`?
+    What argument is now necessary?
+
+2.  What happens when you use `unnest_longer()` with named list-columns like `df1`?
+    What additional information do you get in the output?
+    How can you suppress that extra detail?
+
+3.  From time-to-time you encounter data frames with multiple list-columns with aligned values.
     For example, in the following data frame, the values of `y` and `z` are aligned (i.e. `y` and `z` will always have the same length within a row, and the first value of `y` corresponds to the first value of `z`).
     What happens if you apply two `unnest_longer()` calls to this data frame?
     How can you preserve the relationship between `x` and `y`?
@@ -387,7 +332,7 @@ These functions are good to know about as you might encounter them when reading
 ## Case studies
 
 The main difference between the simple examples we used above and real data is that real data typically contains multiple levels of nesting that require multiple calls to `unnest_longer()` and/or `unnest_wider()`.
-This section will work through four real rectangling challenges using datasets from the repurrrsive package, inspired by datasets that we've encountered in the wild.
+To show that in action, this section works through three real rectangling challenges using datasets from the repurrrsive package.
 
 ### Very wide data
 
@@ -395,7 +340,7 @@ We'll start with `gh_repos`.
 This is a list that contains data about a collection of GitHub repositories retrieved using the GitHub API. It's a very deeply nested list so it's difficult to show the structure in this book; we recommend exploring a little on your own with `View(gh_repos)` before we continue.
 
 `gh_repos` is a list, but our tools work with list-columns, so we'll begin by putting it into a tibble.
-We call the column `json` for reasons we'll get to later.
+We call this column `json` for reasons we'll get to later.
 
 ```{r}
 repos <- tibble(json = gh_repos)
@@ -431,7 +376,7 @@ repos |>
   head(10)
 ```
 
-Let's select a few that look interesting:
+Let's pull out a few that look interesting:
 
 ```{r}
 repos |> 
@@ -453,10 +398,8 @@ repos |>
   unnest_wider(owner)
 ```
 
-<!--# TODO: https://github.com/tidyverse/tidyr/issues/1390 -->
-
 Uh oh, this list column also contains an `id` column and we can't have two `id` columns in the same data frame.
-Rather than following the advice to use `names_repair` (which would also work), we'll instead use `names_sep`:
+As suggested, lets use `names_sep` to resolve the problem:
 
 ```{r}
 repos |> 
@@ -466,12 +409,12 @@ repos |>
   unnest_wider(owner, names_sep = "_")
 ```
 
-This gives another wide dataset, but you can see that `owner` appears to contain a lot of additional data about the person who "owns" the repository.
+This gives another wide dataset, but you can get the sense that `owner` appears to contain a lot of additional data about the person who "owns" the repository.
 
 ### Relational data
 
-Nested data is sometimes used to represent data that we'd usually spread out into multiple data frames.
-For example, take `got_chars` which contains data about characters that appear in Game of Thrones.
+Nested data is sometimes used to represent data that we'd usually spread across multiple data frames.
+For example, take `got_chars` which contains data about characters that appear in the Game of Thrones books and TV series.
 Like `gh_repos` it's a list, so we start by turning it into a list-column of a tibble:
 
 ```{r}
@@ -495,7 +438,7 @@ characters <- chars |>
 characters
 ```
 
-There are also many list-columns:
+This dataset contains also many list-columns:
 
 ```{r}
 chars |> 
@@ -514,7 +457,7 @@ chars |>
 ```
 
 You might expect to see this data in its own table because it would be easy to join to the characters data as needed.
-To do so, we'll do a little cleaning: removing the rows containing empty strings and renaming `titles` to `title` since each row now only contains a single title.
+Let's do that, which requires little cleaning: removing the rows containing empty strings and renaming `titles` to `title` since each row now only contains a single title.
 
 ```{r}
 titles <- chars |> 
@@ -539,49 +482,6 @@ characters |>
 
 You could imagine creating a table like this for each of the list-columns, then using joins to combine them with the character data as you need it.
 
-### A dash of text analysis
-
-Sticking with the same data, what if we wanted to find the most common words in the title?
-One simple approach starts by using `str_split()` to break each element of `title` up into words by splitting on `" "`:
-
-```{r}
-titles |> 
-  mutate(word = str_split(title, " "), .keep = "unused")
-```
-
-This creates an unnamed variable length list-column, so we can use `unnest_longer()`:
-
-```{r}
-titles |> 
-  mutate(word = str_split(title, " "), .keep = "unused") |> 
-  unnest_longer(word)
-```
-
-And then we can count that column to find the most common words:
-
-```{r}
-titles |> 
-  mutate(word = str_split(title, " "), .keep = "unused") |> 
-  unnest_longer(word) |> 
-  count(word, sort = TRUE)
-```
-
-Some of those words are not very interesting so we could create a list of common words to drop.
-In text analysis these are commonly called stop words.
-
-```{r}
-stop_words <- tibble(word = c("of", "the"))
-
-titles |> 
-  mutate(word = str_split(title, " "), .keep = "unused") |> 
-  unnest_longer(word) |> 
-  anti_join(stop_words) |> 
-  count(word, sort = TRUE)
-```
-
-Breaking up text into individual fragments is a powerful idea that underlies much of text analysis.
-If this sounds interesting, a good place to learn more is [Text Mining with R](https://www.tidytextmining.com) by Julia Silge and David Robinson.
-
 ### Deeply nested
 
 We'll finish off these case studies with a list-column that's very deeply nested and requires repeated rounds of `unnest_wider()` and `unnest_longer()` to unravel: `gmaps_cities`.
@@ -670,6 +570,7 @@ This is where `hoist()`, mentioned earlier in the chapter, can be useful.
 Once you've discovered the path to get to the components you're interested in, you can extract them directly using `hoist()`:
 
 ```{r}
+#| results: false
 locations |> 
   select(city, formatted_address, geometry) |> 
   hoist(
@@ -692,7 +593,9 @@ If these case studies have whetted your appetite for more real-life rectangling,
     Can you construct a `owners` data frame that contains one row for each owner?
     (Hint: does `distinct()` work with `list-cols`?)
 
-3.  Explain the following code line-by-line.
+3.  Follow the steps used for `titles` to create similar tables for the aliases, allegiances, books, and TV series for the Game of Thrones characters.
+
+4.  Explain the following code line-by-line.
     Why is it interesting?
     Why does it work for `got_chars` but might not work in general?
 
@@ -709,7 +612,7 @@ If these case studies have whetted your appetite for more real-life rectangling,
       unnest_longer(value)
     ```
 
-4.  In `gmaps_cities`, what does `address_components` contain?
+5.  In `gmaps_cities`, what does `address_components` contain?
     Why does the length vary between rows?
     Unnest it appropriately to figure it out.
     (Hint: `types` always appears to contain two elements. Does `unnest_wider()` make it easier to work with than `unnest_longer()`?)
@@ -743,6 +646,10 @@ An **object** is like a named list, and is written with `{}`.
 The names (keys in JSON terminology) are strings, so must be surrounded by quotes.
 For example, `{"x": 1, "y": 2}` is an object that maps `x` to 1 and `y` to 2.
 
+Note that JSON doesn't have any native way to represent dates or date-times, so they're often stored as strings, and you'll need to use `readr::parse_date()` or `readr::parse_datetime()` to turn them into the correct data structure.
+Similarly, JSON's rules for representing floating point numbers in JSON are a little imprecise, so you'll also sometimes find numbers stored in strings.
+Apply `readr::parse_double()` as needed to the get correct variable type.
+
 ### jsonlite
 
 To convert JSON into R data structures, we recommend the jsonlite package, by Jeroen Ooms.
@@ -820,12 +727,6 @@ df |>
   unnest_wider(results)
 ```
 
-### Translation challenges
-
-Since JSON doesn't have any way to represent dates or date-times, they're often stored as ISO8601 date times in strings, and you'll need to use `readr::parse_date()` or `readr::parse_datetime()` to turn them into the correct data structure.
-Similarly, JSON's rules for representing floating point numbers in JSON are a little imprecise, so you'll also sometimes find numbers stored in strings.
-Apply `readr::parse_double()` as needed to the get correct variable type.
-
 ### Exercises
 
 1.  Rectangle the `df_col` and `df_row` below.