Improve data tidying

Fixes #1322
This commit is contained in:
Hadley Wickham 2023-03-07 16:10:32 -06:00
parent 810b9f6a3c
commit 424665c929
1 changed files with 6 additions and 4 deletions

View File

@ -176,9 +176,11 @@ billboard
In this dataset, each observation is a song.
The first three columns (`artist`, `track` and `date.entered`) are variables that describe the song.
Then we have 76 columns (`wk1`-`wk76`) that describe the rank of the song in each week.
Then we have 76 columns (`wk1`-`wk76`) that describe the rank of the song in each week[^data-tidy-1].
Here, the column names are one variable (the `week`) and the cell values are another (the `rank`).
[^data-tidy-1]: The song will be included as long as it was in the top 100 at some point in 2000, and is tracked for up to 72 weeks after it appears.
To tidy this data, we'll use `pivot_longer()`:
```{r, R.options=list(pillar.print_min = 10)}
@ -202,9 +204,9 @@ Now let's turn our attention to the resulting, longer data frame.
What happens if a song is in the top 100 for less than 76 weeks?
Take 2 Pac's "Baby Don't Cry", for example.
The above output suggests that it was only in the top 100 for 7 weeks, and all the remaining weeks are filled in with missing values.
These `NA`s don't really represent unknown observations; they were forced to exist by the structure of the dataset[^data-tidy-1], so we can ask `pivot_longer()` to get rid of them by setting `values_drop_na = TRUE`:
These `NA`s don't really represent unknown observations; they were forced to exist by the structure of the dataset[^data-tidy-2], so we can ask `pivot_longer()` to get rid of them by setting `values_drop_na = TRUE`:
[^data-tidy-1]: We'll come back to this idea in @sec-missing-values.
[^data-tidy-2]: We'll come back to this idea in @sec-missing-values.
```{r}
billboard |>
@ -216,7 +218,7 @@ billboard |>
)
```
The number of rows is now much lower, indicating that the rows with `NA`s were dropped.
The number of rows is now much lower, indicating that many rows with `NA`s were dropped.
You might also wonder what happens if a song is in the top 100 for more than 76 weeks?
We can't tell from this data, but you might guess that additional columns `wk77`, `wk78`, ... would be added to the dataset.