Update rectangling.qmd (#1172)

This commit is contained in:
mcsnowface, PhD 2022-12-06 15:13:20 -07:00 committed by GitHub
parent 4635426ec3
commit ae9680ecd7
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 24 additions and 24 deletions

View File

@ -9,7 +9,7 @@ status("polishing")
## Introduction
In this chapter, you'll learn the art of data **rectangling**, taking data that is fundamentally hierarchical, or tree-like, and converting it into a rectangular data frames made up of rows and columns.
In this chapter, you'll learn the art of data **rectangling**, taking data that is fundamentally hierarchical, or tree-like, and converting it into a rectangular data frame made up of rows and columns.
This is important because hierarchical data is surprisingly common, especially when working with data that comes from the web.
To learn about rectangling, you'll need to first learn about lists, the data structure that makes hierarchical data possible.
@ -19,7 +19,7 @@ We'll finish off by talking about JSON, the most frequent source of hierarchical
### Prerequisites
In this chapter we'll use many functions from tidyr, a core member of the tidyverse.
In this chapter, we'll use many functions from tidyr, a core member of the tidyverse.
We'll also use repurrrsive to provide some interesting datasets for rectangling practice, and we'll finish by using jsonlite to read JSON files into R lists.
```{r}
@ -34,8 +34,8 @@ library(jsonlite)
## Lists
So far you've worked with data frames that contain simple vectors like integers, numbers, characters, date-times, and factors.
These vectors are simple because they're homogeneous: every element is the same type.
If you want to store element of different types in the same vector, you'll need a **list**, which you create with `list()`:
These vectors are simple because they're homogeneous: every element is of the same data type.
If you want to store elements of different types in the same vector, you'll need a **list**, which you create with `list()`:
```{r}
x1 <- list(1:4, "a", TRUE)
@ -138,8 +138,8 @@ knitr::include_graphics("screenshots/View-3.png", dpi = 220)
### List-columns
Lists can also live inside a tibble, where we call them list-columns.
List-columns are useful because they allow you to shoehorn in objects that wouldn't usually belong in a tibble.
In particular, list-columns are are used a lot in the [tidymodels](https://www.tidymodels.org) ecosystem, because they allow you to store things like models or resamples in a data frame.
List-columns are useful because they allow you to place objects in a tibble that wouldn't usually belong in there.
In particular, list-columns are used a lot in the [tidymodels](https://www.tidymodels.org) ecosystem, because they allow you to store things like model outputs or resamples in a data frame.
Here's a simple example of a list-column:
@ -164,7 +164,7 @@ In this chapter, we'll focus on unnesting list-columns out into regular variable
The default print method just displays a rough summary of the contents.
The list column could be arbitrarily complex, so there's no good way to print it.
If you want to see it, you'll need to pull the list-column out and apply one of the techniques that you learned above:
If you want to see it, you'll need to pull the list-column out and apply one of the techniques that you've learned above:
```{r}
df |>
@ -194,7 +194,7 @@ data.frame(
)
```
It's easier to use list-columns with tibbles because `tibble()` treats lists like either vectors and the print method has been designed with lists in mind.
It's easier to use list-columns with tibbles because `tibble()` treats lists like vectors and the print method has been designed with lists in mind.
:::
## Unnesting
@ -315,7 +315,7 @@ df4 <- tribble(
)
```
`unnest_longer()` always keeps the set of columns change, while changing the number of rows.
`unnest_longer()` always keeps the set of columns unchanged, while changing the number of rows.
So what happens?
How does `unnest_longer()` produce five rows while keeping everything in `y`?
@ -331,7 +331,7 @@ You might wonder if this breaks the commandment that every element of a column m
What happens if you find this problem in a dataset you're trying to rectangle?
There are two basic options.
You could use the `transform` argument to coerce all inputs to a common type.
It's not particularly useful here because there's only really one class that these five class can be converted to character.
However, it's not particularly useful here because there's only really one class that these five class can be converted to character.
```{r}
df4 |>
@ -362,11 +362,11 @@ You'll learn more about `map_lgl()` in @sec-iteration.
tidyr has a few other useful rectangling functions that we're not going to cover in this book:
- `unnest_auto()` automatically picks between `unnest_longer()` and `unnest_wider()` based on the structure of the list-column. It's a great for rapid exploration, but ultimately its a bad idea because it doesn't force you to understand how your data is structured, and makes your code harder to understand.
- `unnest_auto()` automatically picks between `unnest_longer()` and `unnest_wider()` based on the structure of the list-column. It's great for rapid exploration, but ultimately it's a bad idea because it doesn't force you to understand how your data is structured, and makes your code harder to understand.
- `unnest()` expands both rows and columns. It's useful when you have a list-column that contains a 2d structure like a data frame, which you don't see in this book.
- `hoist()` allows you to reach into a deeply nested list and extract just the components that you need. It's mostly equivalent to repeated invocations of `unnest_wider()` + `select()` so read up on it if you're trying to extract just a couple of important variables embedded in a bunch of data that you don't care about.
These are good to know about when you're reading other people's code or tackling rarer rectangling challenges.
These functions are good to know about as you might encounter them when reading other people's code or tackling rarer rectangling challenges yourself.
### Exercises
@ -525,7 +525,7 @@ titles <- chars |>
titles
```
Now, for example, we could use this table tofind all the characters that are captains and see all their titles:
Now, for example, we could use this table to find all the characters that are captains and see all their titles:
```{r}
captains <- titles |> filter(str_detect(title, "Captain"))
@ -541,14 +541,14 @@ You could imagine creating a table like this for each of the list-columns, then
### A dash of text analysis
Sticking with the same data, what if we wanted to find the most common words in the title?
One simple approach starts by using `str_split()` to break each element of `title` up into words by spitting on `" "`:
One simple approach starts by using `str_split()` to break each element of `title` up into words by splitting on `" "`:
```{r}
titles |>
mutate(word = str_split(title, " "), .keep = "unused")
```
This creates a unnamed variable length list-column, so we can use `unnest_longer()`:
This creates an unnamed variable length list-column, so we can use `unnest_longer()`:
```{r}
titles |>
@ -566,7 +566,7 @@ titles |>
```
Some of those words are not very interesting so we could create a list of common words to drop.
In text analysis these is commonly called stop words.
In text analysis these are commonly called stop words.
```{r}
stop_words <- tibble(word = c("of", "the"))
@ -598,7 +598,7 @@ gmaps_cities |>
```
This gives us the `status` and the `results`.
We'll drop the status column since they're all `OK`; in a real analysis, you'd also want capture all the rows where `status != "OK"` and figure out what went wrong.
We'll drop the status column since they're all `OK`; in a real analysis, you'd also want to capture all the rows where `status != "OK"` and figure out what went wrong.
`results` is an unnamed list, with either one or two elements (we'll see why shortly) so we'll unnest it into rows:
```{r}
@ -665,7 +665,7 @@ locations |>
Note how we unnest two columns simultaneously by supplying a vector of variable names to `unnest_wider()`.
This is somewhere that `hoist()`, mentioned earlier in the chapter, can be useful.
This is where `hoist()`, mentioned earlier in the chapter, can be useful.
Once you've discovered the path to get to the components you're interested in, you can extract them directly using `hoist()`:
```{r}
@ -728,7 +728,7 @@ Four of them are scalars:
- The simplest type is a null (`null`) which plays the same role as both `NULL` and `NA` in R. It represents the absence of data.
- A **string** is much like a string in R, but must always use double quotes.
- A **number** is similar to R's numbers: they can use integer (e.g. 123), decimal (e.g. 123.45), or scientific (e.g. 1.23e3) notation. JSON doesn't support Inf, -Inf, or NaN.
- A **number** is similar to R's numbers: they can use integer (e.g. 123), decimal (e.g. 123.45), or scientific (e.g. 1.23e3) notation. JSON doesn't support `Inf`, `-Inf`, or `NaN`.
- A **boolean** is similar to R's `TRUE` and `FALSE`, but uses lowercase `true` and `false`.
JSON's strings, numbers, and booleans are pretty similar to R's character, numeric, and logical vectors.
@ -760,8 +760,8 @@ gh_users2 <- read_json(gh_users_json())
identical(gh_users, gh_users2)
```
In this book, I'll also use `parse_json()`, since it takes a string containing JSON, which makes it good for generating simple examples.
To get started, here's three simple JSON datasets, starting with a number, then putting a few number in an array, then putting that array in an object:
In this book, we'll also use `parse_json()`, since it takes a string containing JSON, which makes it good for generating simple examples.
To get started, here are three simple JSON datasets, starting with a number, then putting a few numbers in an array, then putting that array in an object:
```{r}
str(parse_json('1'))
@ -790,8 +790,8 @@ df |>
unnest_wider(json)
```
In rarer cases, the JSON consists of a single top-level JSON object, representing one "thing".
In this case, you'll need to kick off the rectangling process by wrapping it a list, before you put it in a tibble.
In rarer cases, the JSON file consists of a single top-level JSON object, representing one "thing".
In this case, you'll need to kick off the rectangling process by wrapping it in a list, before you put it in a tibble.
```{r}
json <- '{
@ -851,7 +851,7 @@ Apply `readr::parse_double()` as needed to the get correct variable type.
## Summary
In this chapter, you learned what lists are, how you can generate the from JSON files, and how turn them into rectangular data frames.
In this chapter, you learned what lists are, how you can generate them from JSON files, and how turn them into rectangular data frames.
Surprisingly we only need two new functions: `unnest_longer()` to put list elements into rows and `unnest_wider()` to put list elements into columns.
It doesn't matter how deeply nested the list-column is, all you need to do is repeatedly call these two functions.