diff --git a/rectangling.qmd b/rectangling.qmd index 5b09b64..40b3d09 100644 --- a/rectangling.qmd +++ b/rectangling.qmd @@ -10,17 +10,17 @@ status("polishing") ## Introduction In this chapter, you'll learn the art of data **rectangling**, taking data that is fundamentally tree-like and converting it into a rectangular data frames made up of rows and columns. -This is important because hierarchical data is surprisingly common, especially when working with data that comes from a web API. +This is important because hierarchical data is surprisingly common, especially when working with data that comes from the web. -To learn about rectangling, you'll first learn about lists, the data structure that makes hierarchical data possible in R. -Then you'll learn about two crucial tidyr functions: `tidyr::unnest_longer()`, which converts children in rows, and `tidyr::unnest_wider()`, which converts children into columns. -We'll then show you a few case studies, applying these simple function multiple times to solve real problems. +To learn about rectangling, you'll need to first learn about lists, the data structure that makes hierarchical data possible. +Then you'll learn about two crucial tidyr functions: `tidyr::unnest_longer()` and `tidyr::unnest_wider()`. +We'll then show you a few case studies, applying these simple functions again and again to solve real problems. We'll finish off by talking about JSON, the most frequent source of hierarchical datasets and a common format for data exchange on the web. ### Prerequisites In this chapter we'll use many functions from tidyr, a core member of the tidyverse. -We'll also use repurrrsive to provide some interesting datasets rectangling practice, and we'll finish up with a little jsonlite, which we'll use to read JSON files into R lists. +We'll also use repurrrsive to provide some interesting datasets for rectangling practice, and we'll finish by using jsonlite to read JSON files into R lists. ```{r} #| label: setup @@ -33,8 +33,8 @@ library(jsonlite) ## Lists -So far we've used simple vectors like integers, numbers, characters, date-times, and factors. -These vectors are simple because they're homogeneous: every element is same type. +So far you've worked with data frames that contain simple vectors like integers, numbers, characters, date-times, and factors. +These vectors are simple because they're homogeneous: every element is the same type. If you want to store element of different types in the same vector, you'll need a **list**, which you create with `list()`: ```{r} @@ -86,16 +86,21 @@ x5 <- list(1, list(2, list(3, list(4, list(5))))) str(x5) ``` -As lists get even large and more complex, even `str()` starts to fail, you'll need to switch to `View()`[^rectangling-1]. -@fig-view-collapsed shows the result of calling `View(x4)`. The viewer starts by showing just the top level of the list, but you can interactively expand any of the components to see more, as in @fig-view-expand-1. RStudio will also show you the code you need to access that element, as in @fig-view-expand-2. We'll come back to how this code works in @sec-vector-subsetting. +As lists get even larger and more complex, `str()` eventually starts to fail, and you'll need to switch to `View()`[^rectangling-1]. +@fig-view-collapsed shows the result of calling `View(x4)`. The viewer starts by showing just the top level of the list, but you can interactively expand any of the components to see more, as in @fig-view-expand-1. RStudio will also show you the code you need to access that element, as in @fig-view-expand-2. We'll come back to how this code works in @sec-lists. [^rectangling-1]: This is an RStudio feature. ```{r} #| label: fig-view-collapsed #| fig.cap: > -#| The RStudio allows you to interactively explore a complex list. +#| The RStudio view lets you interactively explore a complex list. #| The viewer opens showing only the top level of the list. +#| fig.alt: > +#| A screenshot of RStudio showing the list-viewer. It shows the +#| two children of x4: the first child is a double vector and the +#| second child is a list. A rightward facing triable indicates that the +#| second child itself has children but you can't see them. #| echo: false #| out-width: NULL knitr::include_graphics("screenshots/View-1.png", dpi = 220) @@ -106,6 +111,10 @@ knitr::include_graphics("screenshots/View-1.png", dpi = 220) #| fig.cap: > #| Clicking on the rightward facing triangle expands that component #| of the list so that you can also see its children. +#| fig.alt: > +#| Another screenshot of the list-viewer having expand the second +#| child of x2. It also has two children, a double vector and another +#| list. #| echo: false #| out-width: NULL knitr::include_graphics("screenshots/View-2.png", dpi = 220) @@ -115,9 +124,12 @@ knitr::include_graphics("screenshots/View-2.png", dpi = 220) #| label: fig-view-expand-2 #| fig.cap: > #| You can repeat this operation as many times as needed to get to the -#| data you're interested in. Note the bottom-right corner: if you click +#| data you're interested in. Note the bottom-left corner: if you click #| an element of the list, RStudio will give you the subsetting code #| needed to access it, in this case `x4[[2]][[2]][[2]]`. +#| fig.alt: > +#| Another screenshot, having expanded the grandchild of x4 to see its +#| two children, again a double vector and a list. #| echo: false #| out-width: NULL knitr::include_graphics("screenshots/View-3.png", dpi = 220) @@ -173,11 +185,11 @@ It's possible to put a list in a column of a `data.frame`, but it's a lot fiddli data.frame(x = list(1:3, 3:5)) ``` -You can force `data.frame()` to treat a list as a list of rows by wrapping it in list `I()`, but the result doesn't print particularly usefully: +You can force `data.frame()` to treat a list as a list of rows by wrapping it in list `I()`, but the result doesn't print particularly well: ```{r} data.frame( - x = I(list(1:3, 3:5)), + x = I(list(1:2, 3:5)), y = c("1, 2", "3, 4, 5") ) ``` @@ -188,14 +200,12 @@ It's easier to use list-columns with tibbles because `tibble()` treats lists lik ## Unnesting Now that you've learned the basics of lists and list-columns, let's explore how you can turn them back into regular rows and columns. -We'll start with very simple sample data so you can get the basic idea, and then switch to more realistic examples in the next section. +Here we'll use very simple sample data so you can get the basic idea; in the next section we'll switch to real data. List-columns tend to come in two basic forms: named and unnamed. When the children are **named**, they tend to have the same names in every row. -When the children are **unnamed**, the number of elements tends to vary from row-to-row. -The following code creates an example of each. -In `df1`, every element of list-column `y` has two elements named `a` and `b`. -In `df2`, the elements of list-column `y` are unnamed and vary in length. +For example, in `df1`, every element of list-column `y` has two elements named `a` and `b`. +Named list-columns naturally unnest into columns: each named element becomes a new named column. ```{r} df1 <- tribble( @@ -204,6 +214,13 @@ df1 <- tribble( 2, list(a = 21, b = 22), 3, list(a = 31, b = 32), ) +``` + +When the children are **unnamed**, the number of elements tends to vary from row-to-row. +For example, in `df2`, the elements of list-column `y` are unnamed and vary in length from one to three. +Unnamed list-columns naturally unnest in to rows: you'll get one row for each child. + +```{r} df2 <- tribble( ~x, ~y, @@ -213,9 +230,7 @@ df2 <- tribble( ) ``` -Named list-columns naturally unnest into columns: each named element becomes a new named column. -Unnamed list-columns naturally unnested in to rows: you'll get one row for each child. -tidyr provides two functions for these two case: `unnest_wider()` and `unnest_longer()`. +tidyr provides two functions for these two cases: `unnest_wider()` and `unnest_longer()`. The following sections explain how they work. ### `unnest_wider()` @@ -227,7 +242,7 @@ df1 |> unnest_wider(y) ``` -By default, the names of the new columns come exclusively from the names of the list, but you can use the `names_sep` argument to request that they combine the column name and the list names. +By default, the names of the new columns come exclusively from the names of the list elements, but you can use the `names_sep` argument to request that they combine the column name and the element name. This is useful for disambiguating repeated names. ```{r} @@ -255,7 +270,7 @@ df2 |> ``` Note how `x` is duplicated for each element inside of `y`: we get one row of output for each element inside the list-column. -But what happens if the list-column is empty, as in the following example? +But what happens if one of the elements is empty, as in the following example? ```{r} df6 <- tribble( @@ -270,15 +285,15 @@ df6 |> unnest_longer(y) We get zero rows in the output, so the row effectively disappears. Once is fixed, you'll be able to keep this row, replacing `y` with `NA` by setting `keep_empty = TRUE`. -You can also unnest named list-columns, like `df1$y` into the rows. -Because the elements are named, and those names might be useful data, puts them in a new column with the suffix `_id`: +You can also unnest named list-columns, like `df1$y`, into rows. +Because the elements are named, and those names might be useful data, tidyr puts them in a new column with the suffix `_id`: ```{r} df1 |> unnest_longer(y) ``` -If you don't want these `ids`, you can suppress this with `indices_include = FALSE`. +If you don't want these `ids`, you can suppress them with `indices_include = FALSE`. On the other hand, it's sometimes useful to retain the position of unnamed elements in unnamed list-columns. You can do this with `indices_include = TRUE`: @@ -311,7 +326,7 @@ df4 |> As you can see, the output contains a list-column, but every element of the list-column contains a single element. Because `unnest_longer()` can't find a common type of vector, it keeps the original types in a list-column. -You might wonder if this breaks the commandment that every element of a column must be the same type --- not quite, because every element is a still a list, and each component of that list contains something different. +You might wonder if this breaks the commandment that every element of a column must be the same type --- not quite: every element is a still a list, even though the contents of each element is a different type. What happens if you find this problem in a dataset you're trying to rectangle? There are two basic options. @@ -328,8 +343,7 @@ Another option would be to filter down to the rows that have values of a specifi ```{r} df4 |> unnest_longer(y) |> - rowwise() |> - filter(is.numeric(y)) + filter(map_lgl(y, is.numeric)) ``` Then you can call `unnest_longer()` once more: @@ -337,20 +351,21 @@ Then you can call `unnest_longer()` once more: ```{r} df4 |> unnest_longer(y) |> - rowwise() |> - filter(is.numeric(y)) |> + filter(map_lgl(y, is.numeric)) |> unnest_longer(y) ``` +You'll learn more about `map_lgl()` in @sec-iteration. + ### Other functions tidyr has a few other useful rectangling functions that we're not going to cover in this book: - `unnest_auto()` automatically picks between `unnest_longer()` and `unnest_wider()` based on the structure of the list-column. It's a great for rapid exploration, but ultimately its a bad idea because it doesn't force you to understand how your data is structured, and makes your code harder to understand. -- `unnest()` expands both rows and columns. It's useful when you have a list-column that contains a 2d structure like a data frame, which we don't see in this book. +- `unnest()` expands both rows and columns. It's useful when you have a list-column that contains a 2d structure like a data frame, which you don't see in this book. - `hoist()` allows you to reach into a deeply nested list and extract just the components that you need. It's mostly equivalent to repeated invocations of `unnest_wider()` + `select()` so read up on it if you're trying to extract just a couple of important variables embedded in a bunch of data that you don't care about. -These are good to know about when you're other people's code and for tackling rarer rectangling challenges. +These are good to know about when you're reading other people's code or tackling rarer rectangling challenges. ### Exercises @@ -370,13 +385,12 @@ These are good to know about when you're other people's code and for tackling ra ## Case studies -So far you've learned about the simplest case of list-columns, where rectangling only requires a single call to `unnest_longer()` or `unnest_wider()`. -The main difference between real data and these simple examples is that real data typically contains multiple levels of nesting that require multiple calls to `unnest_longer()` and `unnest_wider()`. -This section will work through four real rectangling challenges using datasets from the repurrrsive package that are inspired by datasets that we've encountered in the wild. +The main difference between the simple examples we used above and real data is that real data typically contains multiple levels of nesting that require multiple calls to `unnest_longer()` and/or `unnest_wider()`. +This section will work through four real rectangling challenges using datasets from the repurrrsive package, inspired by datasets that we've encountered in the wild. ### Very wide data -We'll start by exploring `gh_repos`. +We'll with `gh_repos`. This is a list that contains data about a collection of GitHub repositories retrieved using the GitHub API. It's a very deeply nested list so it's difficult to show the structure in this book; you might want to explore a little on your own with `View(gh_repos)` before we continue. `gh_repos` is a list, but our tools work with list-columns, so we'll begin by putting it into a tibble. @@ -389,7 +403,7 @@ repos This tibble contains 6 rows, one row for each child of `gh_repos`. Each row contains a unnamed list with either 26 or 30 rows. -Since these are unnamed, we'll start with an `unnest_longer()` to put each child in its own row: +Since these are unnamed, we'll start with `unnest_longer()` to put each child in its own row: ```{r} repos |> @@ -437,6 +451,8 @@ repos |> unnest_wider(owner) ``` + + Uh oh, this list column also contains an `id` column and we can't have two `id` columns in the same data frame. Rather than following the advice to use `names_repair` (which would also work), we'll instead use `names_sep`: @@ -461,14 +477,14 @@ chars <- tibble(json = got_chars) chars ``` -The `json` column contains named values, so we'll start by widening it: +The `json` column contains named elements, so we'll start by widening it: ```{r} chars |> unnest_wider(json) ``` -And selecting a few columns just to make it easier to read: +And selecting a few columns to make it easier to read: ```{r} characters <- chars |> @@ -508,16 +524,15 @@ titles <- chars |> titles ``` -Now, for example, we could use this table to all the characters that are captains and see all their titles: +Now, for example, we could use this table tofind all the characters that are captains and see all their titles: ```{r} captains <- titles |> filter(str_detect(title, "Captain")) captains characters |> - semi_join(captains, by = "id") |> select(id, name) |> - left_join(titles, by = "id", multiple = "all") + inner_join(titles, by = "id", multiple = "all") ``` You could imagine creating a table like this for each of the list-columns, then using joins to combine them with the character data as you need it. @@ -540,7 +555,7 @@ titles |> unnest_longer(word) ``` -And then we can count that column to find the most common: +And then we can count that column to find the most common words: ```{r} titles |> @@ -680,6 +695,7 @@ If these case studies have whetted your appetite for more real-life rectangling, Why does it work for `got_chars` but might not work in general? ```{r} + #| results: false tibble(json = got_chars) |> unnest_wider(json) |> select(id, where(is.list)) %>% @@ -699,7 +715,7 @@ If these case studies have whetted your appetite for more real-life rectangling, ## JSON -All of the case studies in the previous section were sourced from wild-caught JSON files. +All of the case studies in the previous section were sourced from wild-caught JSON. JSON is short for **j**ava**s**cript **o**bject **n**otation and is the way that most web APIs return data. It's important to understand it because while JSON and R's data types are pretty similar, there isn't a perfect 1-to-1 mapping, so it's good to understand a bit about JSON if things go wrong. @@ -709,27 +725,28 @@ JSON is a simple format designed to be easily read and written by machines, not It has six key data types. Four of them are scalars: -- The simplest type is a null, which is written `null`, which plays the same role as both `NULL` and `NA` in R. It represents the absence of data. -- A **string** is much like a string in R, but must use double quotes, not single quotes. -- A **number** is similar to R's numbers: they can be integer (e.g. 123), decimal (e.g. 123.45), or scientific (e.g. 1.23e3) notation. JSON doesn't support Inf, -Inf, or NaN. -- A **boolean** is similar to R's `TRUE` and `FALSE`, but use lower case `true` and `false`. +- The simplest type is a null (`null`) which plays the same role as both `NULL` and `NA` in R. It represents the absence of data. +- A **string** is much like a string in R, but must always use double quotes. +- A **number** is similar to R's numbers: they can use integer (e.g. 123), decimal (e.g. 123.45), or scientific (e.g. 1.23e3) notation. JSON doesn't support Inf, -Inf, or NaN. +- A **boolean** is similar to R's `TRUE` and `FALSE`, but uses lowercase `true` and `false`. JSON's strings, numbers, and booleans are pretty similar to R's character, numeric, and logical vectors. The main difference is that JSON's scalars can only represent a single value. -To represent multiple values you need to use one of the two remaining types, arrays and objects. +To represent multiple values you need to use one of the two remaining types: arrays and objects. Both arrays and objects are similar to lists in R; the difference is whether or not they're named. An **array** is like an unnamed list, and is written with `[]`. For example `[1, 2, 3]` is an array containing 3 numbers, and `[null, 1, "string", false]` is an array that contains a null, a number, a string, and a boolean. -An **object** is like a named list, and it's written with `{}`. +An **object** is like a named list, and is written with `{}`. +The names (keys in JSON terminology) are strings, so must be surrounded by quotes. For example, `{"x": 1, "y": 2}` is an object that maps `x` to 1 and `y` to 2. ### jsonlite -To convert JSON into R data structures, we recommend that you use the jsonlite package, by Jeroen Oooms. +To convert JSON into R data structures, we recommend the jsonlite package, by Jeroen Ooms. We'll use only two jsonlite functions: `read_json()` and `parse_json()`. In real life, you'll use `read_json()` to read a JSON file from disk. -For example, the repurrsive package also provides the source for `gh_user` as a JSON file: +For example, the repurrsive package also provides the source for `gh_user` as a JSON file and you can read it with `read_json()`: ```{r} # A path to a json file inside the package: @@ -767,6 +784,7 @@ json <- '[ ]' df <- tibble(json = parse_json(json)) df + df |> unnest_wider(json) ``` @@ -785,6 +803,7 @@ json <- '{ ' df <- tibble(json = list(parse_json(json))) df + df |> unnest_wider(json) |> unnest_longer(results) |> @@ -828,3 +847,13 @@ Apply `readr::parse_double()` as needed to the get correct variable type. df_col <- tibble(json = list(json_col)) df_row <- tibble(json = json_row) ``` + +## Summary + +In this chapter, you learned what lists are, how you can generate the from JSON files, and how turn them into rectangular data frames. +Surprisingly we only need two new functions: `unnest_longer()` to put list elements into rows and `unnest_wider()` to put list elements into columns. +It doesn't matter how deeply nested the list-column is, all you need to do is repeatedly call these two functions. + +JSON is the most common data format returned by web APIs. +What happens if the website doesn't have an API, but you can see data you want on the website? +That's the topic of the next chapter: web scraping, extracting data from HTML webpages.