Replace with realistic data for "how does `pivot_*()` work" (#1357)

This commit is contained in:
Mine Cetinkaya-Rundel 2023-03-10 17:31:07 -05:00 committed by GitHub
parent 2f1b978bea
commit acc6f5a79b
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
7 changed files with 45 additions and 42 deletions

View File

@ -265,25 +265,26 @@ billboard_longer |>
Now that you've seen how we can use pivoting to reshape our data, let's take a little time to gain some intuition about what pivoting does to the data.
Let's start with a very simple dataset to make it easier to see what's happening.
We'll create it with `tribble()`, a handy function for creating small tibbles by hand:
Suppose we have three patients with `id`s A, B, and C, and we take two blood pressure measurements on each patient.
We'll create the data with `tribble()`, a handy function for constructing small tibbles by hand:
```{r}
df <- tribble(
~var, ~col1, ~col2,
"A", 1, 2,
"B", 3, 4,
"C", 5, 6
~id, ~bp1, ~bp2,
"A", 100, 120,
"B", 140, 115,
"C", 120, 125
)
```
We want out new dataset to have three variables: `var` (already exists), `name` (the column names), and `value` (the cell values).
We want our new dataset to have three variables: `id` (already exists), `measurement` (the column names), and `value` (the cell values).
So we can tidy `df` with:
```{r}
df |>
pivot_longer(
cols = col1:col2,
names_to = "name",
cols = bp1:bp2,
names_to = "measurement",
values_to = "value"
)
```
@ -300,9 +301,9 @@ As shown in @fig-pivot-variables, the values in column that was already a variab
#| each column that is pivotted.
#| fig-alt: >
#| A diagram showing how `pivot_longer()` transforms a simple
#| dataset, using color to highlight how the values in the `var` column
#| dataset, using color to highlight how the values in the `id` column
#| ("A", "B", "C") are each repeated twice in the output because there are
#| two columns being pivotted ("col1" and "col2").
#| two columns being pivotted ("bp1" and "bp2").
knitr::include_graphics("diagrams/tidy-data/variables.png", dpi = 270)
```
@ -318,8 +319,8 @@ They need to be repeated once for each row in the original dataset.
#| values need to be repeated once for each row of the original dataset.
#| fig-alt: >
#| A diagram showing how `pivot_longer()` transforms a simple
#| data set, using color to highlight how column names ("col1" and
#| "col2") become the values in a new `var` column. They are repeated
#| data set, using color to highlight how column names ("bp1" and
#| "bp2") become the values in a new `measurement` column. They are repeated
#| three times because there were three rows in the input.
knitr::include_graphics("diagrams/tidy-data/column-names.png", dpi = 270)
@ -337,10 +338,10 @@ They are unwound row by row.
#| row-by-row.
#| fig-alt: >
#| A diagram showing how `pivot_longer()` transforms data,
#| using color to highlight how the cell values (the numbers 1 to 6)
#| using color to highlight how the cell values (blood pressure measurements)
#| become the values in a new `value` column. They are unwound row-by-row,
#| so the original rows (1,2), then (3,4), then (5,6), become a column
#| running from 1 to 6.
#| so the original rows (100,120), then (140,115), then (120,125), become
#| a column running from 100 to 125.
knitr::include_graphics("diagrams/tidy-data/cell-values.png", dpi = 270)
```
@ -493,35 +494,36 @@ This gives us the output that we're looking for.
### How does `pivot_wider()` work?
To understand how `pivot_wider()` works, let's again start with a very simple dataset:
To understand how `pivot_wider()` works, let's again start with a very simple dataset.
This time we have two patients with `id`s A and B, we have three blood pressure measurements on patient A and two on patient B:
```{r}
df <- tribble(
~id, ~name, ~value,
"A", "x", 1,
"B", "y", 2,
"B", "x", 3,
"A", "y", 4,
"A", "z", 5,
~id, ~measurement, ~value,
"A", "bp1", 100,
"B", "bp1", 140,
"B", "bp2", 115,
"A", "bp2", 120,
"A", "bp3", 105
)
```
We'll take the values from the `value` column and the names from the `name` column:
We'll take the values from the `value` column and the names from the `measurement` column:
```{r}
df |>
pivot_wider(
names_from = name,
names_from = measurement,
values_from = value
)
```
To begin the process `pivot_wider()` needs to first figure out what will go in the rows and columns.
Finding the new column names is easy: it's just the unique values of `name`.
The new column names will be the unique values of `measurement`.
```{r}
df |>
distinct(name) |>
distinct(measurement) |>
pull()
```
@ -531,7 +533,7 @@ Here there is only one column, but in general there can be any number.
```{r}
df |>
select(-name, -value) |>
select(-measurement, -value) |>
distinct()
```
@ -539,45 +541,46 @@ df |>
```{r}
df |>
select(-name, -value) |>
select(-measurement, -value) |>
distinct() |>
mutate(x = NA, y = NA, z = NA)
```
It then fills in all the missing values using the data in the input.
In this case, not every cell in the output has corresponding value in the input as there's no entry for id "B" and name "z", so that cell remains missing.
In this case, not every cell in the output has a corresponding value in the input as there's no third blood pressure measurement for patient B, so that cell remains missing.
We'll come back to this idea that `pivot_wider()` can "make" missing values in @sec-missing-values.
You might also wonder what happens if there are multiple rows in the input that correspond to one cell in the output.
The example below has two rows that correspond to id "A" and name "x":
The example below has two rows that correspond to id "A" and name "bp1":
```{r}
df <- tribble(
~id, ~name, ~value,
"A", "x", 1,
"A", "x", 2,
"A", "y", 3,
"B", "x", 4,
"B", "y", 5,
~id, ~measurement, ~value,
"A", "bp1", 100,
"A", "bp1", 102,
"A", "bp2", 120,
"B", "bp1", 140,
"B", "bp2", 115
)
```
If we attempt to pivot this we get an output that contains list-columns, which you'll learn more about in @sec-rectangling:
```{r}
df |> pivot_wider(
names_from = name,
values_from = value
)
df |>
pivot_wider(
names_from = measurement,
values_from = value
)
```
Since you don't know how to work with this sort of data yet, you'll want to follow the hint in the warning to figure out where the problem is:
```{r}
df |>
group_by(id, name) |>
group_by(id, measurement) |>
summarize(n = n(), .groups = "drop") |>
filter(n > 1L)
filter(n > 1)
```
It's then up to you to figure out what's gone wrong with your data and either repair the underlying damage or use your grouping and summarizing skills to ensure that each combination of row and column values only has a single row.

Binary file not shown.

Binary file not shown.

Before

Width:  |  Height:  |  Size: 56 KiB

After

Width:  |  Height:  |  Size: 64 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 60 KiB

After

Width:  |  Height:  |  Size: 65 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 58 KiB

After

Width:  |  Height:  |  Size: 56 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 42 KiB

After

Width:  |  Height:  |  Size: 39 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 56 KiB

After

Width:  |  Height:  |  Size: 62 KiB