Rectangling polishing

This commit is contained in:
Hadley Wickham 2022-08-11 10:29:28 -05:00
parent b50c84d771
commit b15eecf8b3
1 changed files with 84 additions and 36 deletions

View File

@ -1,4 +1,4 @@
# Data rectangling {#sec-rectangle-data} # Data rectangling {#sec-rectangling}
```{r} ```{r}
#| results: "asis" #| results: "asis"
@ -699,63 +699,111 @@ If these case studies have whetted your appetite for more real-life rectangling,
## JSON ## JSON
All of the case studies in the previous section came from data stored in JSON format. All of the case studies in the previous section were sourced from wild-caught JSON files.
JSON is short for **j**ava**s**cript **o**bject **n**otation and is the way that most web APIs return data. JSON is short for **j**ava**s**cript **o**bject **n**otation and is the way that most web APIs return data.
It's important to understand it because while JSON and R are pretty similar, there isn't a perfect 1-to-1 mapping between JSON and R data types. It's important to understand it because while JSON and R's data types are pretty similar, there isn't a perfect 1-to-1 mapping, so it's good to understand a bit about JSON if things go wrong.
In this section, you'll learn a little more about JSON and how to read it into R; once you've done that you can use the rectangling tools described above to get it into a data frame for further analysis.
### Data types ### Data types
JSON is a simple format designed to be easily read and written by machines, not humans. JSON is a simple format designed to be easily read and written by machines, not humans.
It has six key data types. It has six key data types.
Four of them are scalars, which are similar to atomic vectors in R: there's no way to break them down further. Four of them are scalars:
- The simplest type is `null`, which plays the same role as both `NULL` and `NA` in R. It represents the absence of data. - The simplest type is a null, which is written `null`, which plays the same role as both `NULL` and `NA` in R. It represents the absence of data.
- **Strings** are written much like in R, but can only use double quotes, not single quotes. - A **string** is much like a string in R, but must use double quotes, not single quotes.
- **Numbers** are similar to R's numbers: they can be integer (e.g. 123), decimal (e.g. 123.45), or scientific (e.g. 1.23e3). JSON doesn't support Inf, -Inf, or NaN. - A **number** is similar to R's numbers: they can be use integer (e.g. 123), decimal (e.g. 123.45), or scientific (e.g. 1.23e3) notation. JSON doesn't support Inf, -Inf, or NaN.
- **Booleans**, are similar to R's logical vectors, but use `true` and `false` instead of `TRUE` and `FALSE`. - A **boolean** is similar to R's `TRUE` and `FALSE`, but use lower case `true` and `false`.
The biggest different between JSON's scalars and atomic vectors is that scalars only represent a single item. JSON's strings, numbers, and booleans are pretty similar to R's character, numeric, and logical vectors.
To create a vector of multiple items you need to use of the two remaining two types, **arrays** and **objects**. The main difference is that JSON's scalars can only represent a single value.
An array is like an unnamed list in R, and is written with `[]`. To represent multiple values you need to use one of the two remaining two types, arrays and objects.
Both arrays and objects are similar to lists in R; the difference is whether or not they're named.
An **array** is like an unnamed list, and is written with `[]`.
For example `[1, 2, 3]` is an array containing 3 numbers, and `[null, 1, "string", false]` is an array that contains a null, a number, a string, and a boolean. For example `[1, 2, 3]` is an array containing 3 numbers, and `[null, 1, "string", false]` is an array that contains a null, a number, a string, and a boolean.
Objects are like a named list in R, and are written with `{}`. An **object** is like a named list, and they're written with `{}`.
For example, `{"x": 1, "y": 2}` is an object that maps `x` to 1 and `y` to 2. For example, `{"x": 1, "y": 2}` is an object that maps `x` to 1 and `y` to 2.
You might already be starting to imagine some of the challenges converting JSON to R data structures.
### jsonlite ### jsonlite
Most of the time you won't deal with JSON directly, instead you'll use the jsonlite package, by Jeroen Oooms, to load it into R as a nested list. To convert JSON into R data structures, we recommend that you use the jsonlite package, by Jeroen Oooms.
We'll focus on two functions from jsonlite. We'll use only two jsonlite functions: `read_json()` and `parse_json()`.
Most of the time you'll use `read_json()` to read a json file from disk, but sometimes you'll also need `parse_json()` which takes json stored in a string in R. In real life, you'll use `read_json()` to read a JSON file from disk.
For example, we the repurrsive package also provides the source for `gh_user` as a JSON file:
```{r} ```{r}
# A path to a json file inside the package:
gh_users_json()
# Read it with read_json()
gh_users2 <- read_json(gh_users_json())
# Check it's the same as the data we were using previously
identical(gh_users, gh_users2)
```
In this book, I'll also use `parse_json()`, since it takes a string containing JSON, which makes it good for generating simple examples.
To get started, here's three simple JSON datasets, starting with a number, then putting a few number in an array, then putting that array in an object:
```{r}
str(parse_json('1'))
str(parse_json('[1, 2, 3]')) str(parse_json('[1, 2, 3]'))
str(parse_json('{"x": [1, 2, 3]}')) str(parse_json('{"x": [1, 2, 3]}'))
``` ```
Note that the rectangling approach described above is designed around the most common case where the API returns multiple "things", e.g. multiple pages, or multiple records, or multiple results. jsonlite has another important function called `fromJSON()`.
In this case, you just do `tibble(json)` and each element becomes a row. We don't use it here because it performs automatic simplification (`simplifyVector = TRUE)`.
If the JSON returns a single "thing", then you'll need to do `tibble(json = list(json))` so you start with a data frame containing a single row. This often works well, particularly in simple cases, but we think you're better off doing the rectangling yourself so you know exactly what's happening and can more easily handle the most complicated nested structures.
Note that jsonlite has another important function called `fromJSON()`. ### Starting the rectangling process
We don't use it here because it uses `simplifyVector = TRUE` which attempts to automatically unnest the JSON in a data frame.
This often works well, particularly in simple cases. In most cases, JSON files contain a single top-level array, because they're designed to provide data about multiple "things", e.g. multiple pages, or multiple records, or multiple results.
But we think you're better off doing the rectangling yourself so you know exactly what's happening and can more easily handle the most complicated nested structures. In this case, you'll start your rectangling with `tibble(json)` so that each element becomes a row:
Doing it yourself also means you'll use the standard tidyverse rules for recycling and vector coercion: there's nothing wrong with jsonlite's rules, but they're different and we don't want to get in to the details here.
```{r}
json <- '[
{"name": "John", "age": 34},
{"name": "Susan", "age": 27}
]'
df <- tibble(json = parse_json(json))
df
df |>
unnest_wider(json)
```
In rarer cases, the JSON consists of a single top-level JSON object, representing one "thing".
In this case, you'll need to kick off the rectangling process by wrapping it a list, before you put it in a tibble.
```{r}
json <- '{
"status": "OK",
"results": [
{"name": "John", "age": 34},
{"name": "Susan", "age": 27}
]
}
'
df <- tibble(json = list(parse_json(json)))
df
df |>
unnest_wider(json) |>
unnest_longer(results) |>
unnest_wider(results)
```
Alternatively, you can reach inside the parsed JSON and start with the bit that you actually care about:
```{r}
df <- tibble(results = parse_json(json)$results)
df |>
unnest_wider(results)
```
### Translation challenges ### Translation challenges
There isn't a perfect match between json's data types and R's data types. Since JSON doesn't have any way to represent dates or date-times, they're often stored as ISO8601 date times in strings, and you'll need to use `readr::parse_date()` or `readr::parse_datetime()` to turn them into the correct data structure.
So when reading a json file into R, we have to make some assumptions: Similarly, JSON's rules for representing floating point numbers in JSON are a little imprecise, so you'll also sometimes find numbers stored in strings.
Apply `readr::parse_double()` as needed to the get correct variable type.
```{r}
str(parse_json('[1, 2, 3]'))
str(parse_json('[true, false, true]'))
```
JSON doesn't have any way to represent dates or date-times, so they're normally stored as ISO8601 date times in strings, and you'll need to use `readr::parse_date()` or `readr::parse_datetime()` to turn them into the correct data structure.
### Exercises ### Exercises
@ -778,5 +826,5 @@ JSON doesn't have any way to represent dates or date-times, so they're normally
') ')
df_col <- tibble(json = list(json_col)) df_col <- tibble(json = list(json_col))
df_row <- tibble(json = list(json_row)) df_row <- tibble(json = json_row)
``` ```