Improve data tidying

Fixes #1322
2023-03-07 16:10:32 -06:00 · 2023-03-07 16:10:32 -06:00 · 424665c929
parent 810b9f6a3c
commit 424665c929
1 changed files with 6 additions and 4 deletions
--- a/data-tidy.qmd
+++ b/data-tidy.qmd
@ -176,9 +176,11 @@ billboard

 In this dataset, each observation is a song.
 The first three columns (`artist`, `track` and `date.entered`) are variables that describe the song.
-Then we have 76 columns (`wk1`-`wk76`) that describe the rank of the song in each week.
+Then we have 76 columns (`wk1`-`wk76`) that describe the rank of the song in each week[^data-tidy-1].
 Here, the column names are one variable (the `week`) and the cell values are another (the `rank`).

+[^data-tidy-1]: The song will be included as long as it was in the top 100 at some point in 2000, and is tracked for up to 72 weeks after it appears.
+
 To tidy this data, we'll use `pivot_longer()`:

 ```{r, R.options=list(pillar.print_min = 10)}
@ -202,9 +204,9 @@ Now let's turn our attention to the resulting, longer data frame.
 What happens if a song is in the top 100 for less than 76 weeks?
 Take 2 Pac's "Baby Don't Cry", for example.
 The above output suggests that it was only in the top 100 for 7 weeks, and all the remaining weeks are filled in with missing values.
-These `NA`s don't really represent unknown observations; they were forced to exist by the structure of the dataset[^data-tidy-1], so we can ask `pivot_longer()` to get rid of them by setting `values_drop_na = TRUE`:
+These `NA`s don't really represent unknown observations; they were forced to exist by the structure of the dataset[^data-tidy-2], so we can ask `pivot_longer()` to get rid of them by setting `values_drop_na = TRUE`:

-[^data-tidy-1]: We'll come back to this idea in @sec-missing-values.
+[^data-tidy-2]: We'll come back to this idea in @sec-missing-values.

 ```{r}
 billboard |> 
@ -216,7 +218,7 @@ billboard |>
  )
 ```

-The number of rows is now much lower, indicating that the rows with `NA`s were dropped.
+The number of rows is now much lower, indicating that many rows with `NA`s were dropped.

 You might also wonder what happens if a song is in the top 100 for more than 76 weeks?
 We can't tell from this data, but you might guess that additional columns `wk77`, `wk78`, ... would be added to the dataset.