parent
810b9f6a3c
commit
424665c929
|
@ -176,9 +176,11 @@ billboard
|
||||||
|
|
||||||
In this dataset, each observation is a song.
|
In this dataset, each observation is a song.
|
||||||
The first three columns (`artist`, `track` and `date.entered`) are variables that describe the song.
|
The first three columns (`artist`, `track` and `date.entered`) are variables that describe the song.
|
||||||
Then we have 76 columns (`wk1`-`wk76`) that describe the rank of the song in each week.
|
Then we have 76 columns (`wk1`-`wk76`) that describe the rank of the song in each week[^data-tidy-1].
|
||||||
Here, the column names are one variable (the `week`) and the cell values are another (the `rank`).
|
Here, the column names are one variable (the `week`) and the cell values are another (the `rank`).
|
||||||
|
|
||||||
|
[^data-tidy-1]: The song will be included as long as it was in the top 100 at some point in 2000, and is tracked for up to 72 weeks after it appears.
|
||||||
|
|
||||||
To tidy this data, we'll use `pivot_longer()`:
|
To tidy this data, we'll use `pivot_longer()`:
|
||||||
|
|
||||||
```{r, R.options=list(pillar.print_min = 10)}
|
```{r, R.options=list(pillar.print_min = 10)}
|
||||||
|
@ -202,9 +204,9 @@ Now let's turn our attention to the resulting, longer data frame.
|
||||||
What happens if a song is in the top 100 for less than 76 weeks?
|
What happens if a song is in the top 100 for less than 76 weeks?
|
||||||
Take 2 Pac's "Baby Don't Cry", for example.
|
Take 2 Pac's "Baby Don't Cry", for example.
|
||||||
The above output suggests that it was only in the top 100 for 7 weeks, and all the remaining weeks are filled in with missing values.
|
The above output suggests that it was only in the top 100 for 7 weeks, and all the remaining weeks are filled in with missing values.
|
||||||
These `NA`s don't really represent unknown observations; they were forced to exist by the structure of the dataset[^data-tidy-1], so we can ask `pivot_longer()` to get rid of them by setting `values_drop_na = TRUE`:
|
These `NA`s don't really represent unknown observations; they were forced to exist by the structure of the dataset[^data-tidy-2], so we can ask `pivot_longer()` to get rid of them by setting `values_drop_na = TRUE`:
|
||||||
|
|
||||||
[^data-tidy-1]: We'll come back to this idea in @sec-missing-values.
|
[^data-tidy-2]: We'll come back to this idea in @sec-missing-values.
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
billboard |>
|
billboard |>
|
||||||
|
@ -216,7 +218,7 @@ billboard |>
|
||||||
)
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
The number of rows is now much lower, indicating that the rows with `NA`s were dropped.
|
The number of rows is now much lower, indicating that many rows with `NA`s were dropped.
|
||||||
|
|
||||||
You might also wonder what happens if a song is in the top 100 for more than 76 weeks?
|
You might also wonder what happens if a song is in the top 100 for more than 76 weeks?
|
||||||
We can't tell from this data, but you might guess that additional columns `wk77`, `wk78`, ... would be added to the dataset.
|
We can't tell from this data, but you might guess that additional columns `wk77`, `wk78`, ... would be added to the dataset.
|
||||||
|
|
Loading…
Reference in New Issue