From 424665c929f8bf8fbb9e2baf126b2175a8dca0e2 Mon Sep 17 00:00:00 2001 From: Hadley Wickham Date: Tue, 7 Mar 2023 16:10:32 -0600 Subject: [PATCH] Improve data tidying Fixes #1322 --- data-tidy.qmd | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/data-tidy.qmd b/data-tidy.qmd index 70d5015..d2c31b2 100644 --- a/data-tidy.qmd +++ b/data-tidy.qmd @@ -176,9 +176,11 @@ billboard In this dataset, each observation is a song. The first three columns (`artist`, `track` and `date.entered`) are variables that describe the song. -Then we have 76 columns (`wk1`-`wk76`) that describe the rank of the song in each week. +Then we have 76 columns (`wk1`-`wk76`) that describe the rank of the song in each week[^data-tidy-1]. Here, the column names are one variable (the `week`) and the cell values are another (the `rank`). +[^data-tidy-1]: The song will be included as long as it was in the top 100 at some point in 2000, and is tracked for up to 72 weeks after it appears. + To tidy this data, we'll use `pivot_longer()`: ```{r, R.options=list(pillar.print_min = 10)} @@ -202,9 +204,9 @@ Now let's turn our attention to the resulting, longer data frame. What happens if a song is in the top 100 for less than 76 weeks? Take 2 Pac's "Baby Don't Cry", for example. The above output suggests that it was only in the top 100 for 7 weeks, and all the remaining weeks are filled in with missing values. -These `NA`s don't really represent unknown observations; they were forced to exist by the structure of the dataset[^data-tidy-1], so we can ask `pivot_longer()` to get rid of them by setting `values_drop_na = TRUE`: +These `NA`s don't really represent unknown observations; they were forced to exist by the structure of the dataset[^data-tidy-2], so we can ask `pivot_longer()` to get rid of them by setting `values_drop_na = TRUE`: -[^data-tidy-1]: We'll come back to this idea in @sec-missing-values. +[^data-tidy-2]: We'll come back to this idea in @sec-missing-values. ```{r} billboard |> @@ -216,7 +218,7 @@ billboard |> ) ``` -The number of rows is now much lower, indicating that the rows with `NA`s were dropped. +The number of rows is now much lower, indicating that many rows with `NA`s were dropped. You might also wonder what happens if a song is in the top 100 for more than 76 weeks? We can't tell from this data, but you might guess that additional columns `wk77`, `wk78`, ... would be added to the dataset.