Polishing rectangling

This commit is contained in:
Hadley Wickham 2022-08-04 07:41:40 -05:00
parent df55cd92fa
commit 226e0061ad
1 changed files with 65 additions and 60 deletions

View File

@ -1,4 +1,4 @@
# Data rectangling {#sec-rectangling}
# Data rectangling {#sec-rectangle-data}
```{r}
#| results: "asis"
@ -704,48 +704,56 @@ If these case studies have whetted your appetite for more real-life rectangling,
## JSON
All of the case studies in the previous section came originally as JSON, one of the most common sources of hierarchical data.
In this section, you'll learn more about JSON and some common problems you might have.
JSON, short for javascript object notation, is a data format that grew out of the javascript programming language and has become an extremely common way of representing data.
All of the case studies in the previous section came from data stored in JSON format.
JSON is short for **j**ava**s**cript **o**bject **n**otation and the way that most web APIs return data.
In this section, you'll learn a little more about JSON and how to read it into R; once you've done that you can use the rectangling tools described above to get it into a data frame for further analysis.
``` json
{
"name1": "value1",
"name2": "value2"
}
```
JSON is a simple format designed to be easily read and written by machines (not humans).
JSON has six key data types.
Four of them are scalars, which are similar to atomic vectors in R: there's no way to break them down further.
Two of them recursive, like R's lists, and can store all other data types.
We'll start with the four scalar types:
Which in R you might represent as:
- The simplest type is `null`, which is equivalent to both `NULL` and `NA` in R. It represents the absence of data.
- Strings are written much like in R, but can only use double quotes, not single quotes.
- Numbers are similar to R's numbers: they can be integer (e.g. 123), decimal (e.g. 123.45), or scientific (e.g. 1.23e3). JSON doesn't support Inf, -Inf, or NaN.
- Booleans, are similar to R's logical vectors, but use `true` and `false` instead of `TRUE` and `FALSE`.
```{r}
list(
name1 = "value1",
name2 = "value2"
)
```
JSON represents more complex data by nesting in to arrays and objects.
An array is like an unnamed list in R, and is written with `[]`.
For `[1, 2, 3]` is an array containing 3 numbers, and `[null, 1, "string", false]` is an array that contains a null, a number, a string, and a boolean.
Objects are like a named list in R are a written with `{}`.
For example, `{"x": 1, "y": 2}` is an object that maps `x` to 1 and `y` to 2.
There are five types of things that JSON can represent
``` json
{
"strings": "are surrounded by double doubles",
"numbers": 123456,
"boolean": [false, true],
"arrays": [1, 2, 3, 4, 5],
"objects": {
"name1": "value1",
"name2": "value2"
},
"null": null
}
```
You'll notice that these types don't embrace many of the types you've learned earlier in the book like factors, and date-times.
This is important: typically these data types will be encoded as string, and you'll need coerce to the correct data type.
### jsonlite
Most of the time you won't deal with JSON directly, instead you'll use the jsonlite package, by Jeroen Oooms, to load it into R as a nested list.
We'll focus on two functions from jsonlite.
Most of the time you'll use `read_json()` to read a json file from disk, but sometimes you'll also need `parse_json()` which takes json stored in a string in R.
### Data frames
Note that these functions have an important difference to `fromJSON()` --- they set the default value of `simplifyVector = FALSE`.
`fromJSON()` uses `simplifyVector = TRUE` which attempts to automatically unnest the JSON in a data frame.
This can work well for simple cases[^rectangling-2], but we think you're better off doing the simplification yourself so you know exactly what's happening and easily handle arbitrarily complicated systems.
[^rectangling-2]: Doing it yourself also means you'll use the standard tidyverse rules for recycling and vector coercion.
There's nothing wrong with jsonlite's rules, but they're different and we don't want to get in to the details here.
```{r}
parse_json('[1, 2, 3]')
parse_json('{"x": [1, 2, 3]}')
```
Note that the rectangling approach described above is designed around the most common case where the API returns multiple "things", e.g. multiple pages, or multiple records, or multiple results.
In this case, you just do `tibble(json)` and each element becomes a row.
If the JSON returns a single "thing", then you'll need to do `tibble(json = list(json))` so you start with a data frame containing a single row.
### Data types
There isn't a perfect match between json's data types and R's data types.
So when reading a json file into R, we have to make some assumptions:
- Inside an array, `null` is translated to `NA`, so `[true, null, false]` is translated to `c(TRUE, NA, FALSE)` but `{"x": null}` is translated to `list(x = NULL)`.
- JSON doesn't have any way to represent dates or date-times, so they're normally stored as ISO8601 date times in strings, and you'll need to use `readr::parse_date()` or `readr::parse_datetime()` to turn them into the correct data structure.
JSON doesn't have any 2-dimension data structures, so how would you represent a data frame?
@ -757,7 +765,7 @@ df <- tribble(
)
```
There are two ways: you can either make an struct of arrays, or an array of structs.
There are two ways: you can either make an object of arrays, or an array of objects:
``` json
{
@ -773,28 +781,25 @@ There are two ways: you can either make an struct of arrays, or an array of stru
]
```
```{r}
df_col <- jsonlite::fromJSON('
{
"x": ["a", "x"],
"y": [10, 3]
}
')
tibble(json = list(df_col)) |>
unnest_wider(json) |>
unnest_longer(everything())
```
### Exercises
```{r}
df_row <- jsonlite::fromJSON(simplifyVector = FALSE, '
[
{"x": "a", "y": 10},
{"x": "x", "y": 3}
]
')
tibble(json = list(df_row)) |>
unnest_longer(json) |>
unnest_wider(json)
```
1. Rectangle the `df_col` and `df_row` below.
They represent the two ways of encoding a data frame in JSON.
Note that we have to wrap it in a `list()` because we have a single "thing" to unnest.
```{r}
json_col <- parse_json('
{
"x": ["a", "x"],
"y": [10, 3]
}
')
json_row <- parse_json('
[
{"x": "a", "y": 10},
{"x": "x", "y": 3}
]
')
df_col <- tibble(json = list(json_col))
df_row <- tibble(json = list(json_row))
```