Polishing tidy data

This commit is contained in:
Hadley Wickham 2022-02-23 17:17:19 -06:00
parent 48acb4b0e6
commit 17b95c131f
1 changed files with 126 additions and 55 deletions

View File

@ -2,13 +2,13 @@
## Introduction
> "Happy families are all alike; every unhappy family is unhappy in its own way." ---- Leo Tolstoy
> "Happy families are all alike; every unhappy family is unhappy in its own way." --- Leo Tolstoy
> "Tidy datasets are all alike, but every messy dataset is messy in its own way." ---- Hadley Wickham
> "Tidy datasets are all alike, but every messy dataset is messy in its own way." --- Hadley Wickham
In this chapter, you will learn a consistent way to organise your data in R, an organisation called **tidy data**.
Getting your data into this format requires some upfront work, but that work pays off in the long term.
Once you have tidy data and the tidy tools provided by packages in the tidyverse, you will spend much less time munging data from one representation to another, allowing you to spend more time on the analytic questions at hand.
In this chapter, you will learn a consistent way to organize your data in R using a system called **tidy data**.
Getting your data into this format requires some work up front, but that work pays off in the long term.
Once you have tidy data and the tidy tools provided by packages in the tidyverse, you will spend much less time munging data from one representation to another, allowing you to spend more time on the data questions you care about.
This chapter will give you a practical introduction to tidy data and the accompanying tools in the **tidyr** package.
If you'd like to learn more about the underlying theory, you might enjoy the *Tidy Data* paper published in the Journal of Statistical Software, <http://www.jstatsoft.org/v59/i10/paper>.
@ -28,7 +28,7 @@ From this chapter on, we'll suppress the loading message from `library(tidyverse
You can represent the same underlying data in multiple ways.
The example below shows the same data organised in four different ways.
Each dataset shows the same values of four variables *country*, *year*, *population*, and *cases*, but each dataset organises the values in a different way.
Each dataset shows the same values of four variables *country*, *year*, *population*, and *cases*, but each dataset organizes the values in a different way.
```{r}
table1
@ -42,28 +42,28 @@ table4b # population
These are all representations of the same underlying data, but they are not equally easy to use.
One dataset, the tidy dataset, will be much easier to work with inside the tidyverse.
There are three interrelated rules which make a dataset tidy:
1. Each variable must have its own column.
2. Each observation must have its own row.
3. Each value must have its own cell.
1. Each variable is a column; each column is a variable.
2. Each observation is row; each row is an observation
3. Each value is a cell; each cell is a single value.
These three rules are interrelated because typically by fixing one of them you'll fix the other two.
Figure \@ref(fig:tidy-structure) shows the rules visually.
In the example above, only `table1` is tidy.
```{r tidy-structure, echo = FALSE, out.width = "100%", fig.cap = "Following three rules makes a dataset tidy: variables are in columns, observations are in rows, and values are in cells.", fig.alt = "Three panels, each representing a tidy data frame. The first panel shows that each variable has its own column. The second panel shows that each observation has its own row. The third panel shows that each value has its own cell."}
```{r tidy-structure, echo = FALSE, out.width = "100%"}
#| fig.cap: >
#| Following three rules makes a dataset tidy: variables are columns,
#| observations are rows, and values are cells.
#| fig.alt: >
#| Three panels, each representing a tidy data frame. The first panel
#| shows that each variable is column. The second panel shows that each
#| observation is a row. The third panel shows that each value is
#| a cell.
knitr::include_graphics("images/tidy-1.png")
```
These three rules are interrelated because it's impossible to only satisfy two of the three.
That interrelationship leads to an even simpler set of practical instructions:
1. Put each dataset in a tibble.
2. Put each variable in a column.
In this example, only `table1` is tidy.
It's the only representation where each column is a variable.
Why ensure that your data is tidy?
There are two main advantages:
@ -77,10 +77,25 @@ There are two main advantages:
dplyr, ggplot2, and all the other packages in the tidyverse are designed to work with tidy data.
Here are a couple of small examples showing how you might work with `table1`.
```{r fig.width = 5, fig.alt = "This figure shows the numbers of cases in 1999 and 2000 for Afghanistan, Brazil, and China, with year on the x-axis and number of cases on the y-axis. Each point on the plot represents the number of cases in a given country in a given year. The points for each country are differentiated from others by color and shape and connected with a line, resulting in three, non-parallel, non-intersecting lines. The numbers of cases in China are highest for both 1999 and 2000, with values above 200,000 for both years. The number of cases in Brazil is approximately 40,000 in 1999 and approximately 75,000 in 2000. The numbers of cases in Afghanistan are lowest for both 1999 and 2000, with values that appear to be very close to 0 on this scale."}
```{r fig.width = 5}
#| fig.alt: >
#| This figure shows the numbers of cases in 1999 and 2000 for
#| Afghanistan, Brazil, and China, with year on the x-axis and number
#| of cases on the y-axis. Each point on the plot represents the number
#| of cases in a given country in a given year. The points for each
#| country are differentiated from others by color and shape and connected
#| with a line, resulting in three, non-parallel, non-intersecting lines.
#| The numbers of cases in China are highest for both 1999 and 2000, with
#| values above 200,000 for both years. The number of cases in Brazil is
#| approximately 40,000 in 1999 and approximately 75,000 in 2000. The
#| numbers of cases in Afghanistan are lowest for both 1999 and 2000, with
#| values that appear to be very close to 0 on this scale.
# Compute rate per 10,000
table1 |>
mutate(rate = cases / population * 10000)
mutate(
rate = cases / population * 10000
)
# Compute cases per year
table1 |>
@ -120,8 +135,8 @@ There are two main reasons:
1. Most people aren't familiar with the principles of tidy data, and it's hard to derive them yourself unless you spend a *lot* of time working with data.
2. Data is often organised to facilitate some use other than analysis.
For example, data is often organised to make entry as easy as possible.
2. Data is often organised to facilitate some goal other than analysis.
For example, data is often organised to make collection as easy as possible.
This means for most real analyses, you'll need to do some tidying.
The first step is always to figure out what the variables and observations are.
@ -132,8 +147,9 @@ The second step is to resolve one of two common problems:
2. One observation might be scattered across multiple rows.
Typically a dataset will only suffer from one of these problems; it'll only suffer from both if you're really unlucky!
To fix these problems, you'll need the two most important functions in tidyr: `pivot_longer()` and `pivot_wider()`.
As you might guess from their names these functions are complements: `pivot_longer()` makes wide tables narrower and longer; `pivot_wider()` makes long tables shorter and wider.
Typically a dataset will only suffer from one of these problems; it'll only suffer from both if you're really unlucky!
### Longer
@ -146,7 +162,18 @@ table4a
And you want to create the following visualisation where each line represents a `country`, `year` is on the x-axis, `cases` are on the y-axis, and you automatically get the legend that indicates which line represents which country.
```{r tidy-pivot-longer-plot-lines, fig.width = 5, echo = FALSE, fig.cap = "Number of cases over the years for each country.", fig.alt = "This figure shows the numbers of cases in 1999 and 2000 for Afghanistan, Brazil, and China, with year on the x-axis and number of cases on the y-axis. Each point on the plot represents the number of cases in a given country in a given year. The points for each country are differentiated from others by color and shape and connected with a line, resulting in three, non-parallel, non-intersecting lines. The numbers of cases in China are highest for both 1999 and 2000, with values above 200,000 for both years. The number of cases in Brazil is approximately 40,000 in 1999 and approximately 75,000 in 2000. The numbers of cases in Afghanistan are lowest for both 1999 and 2000, with values that appear to be very close to 0 on this scale."}
```{r tidy-pivot-longer-plot-lines, fig.width = 5, echo = FALSE}
#| fig.cap: >
#| Number of cases over the years for each country.
#| fig.alt: >
#| This figure shows the numbers of cases in 1999 and 2000 for
#| Afghanistan, Brazil, and China, with year on the x-axis and number of
#| cases on the y-axis. Each point on the plot represents the number of
#| cases in a given country in a given year. The points for each country
#| are differentiated from others by color and shape and connected with a
#| line, resulting in three, non-parallel, non-intersecting lines. The
#| numbers of cases in China are highest for both 1999 and 2000, with values above 200,000 for both years. The number of cases in Brazil is approximately 40,000 in 1999 and approximately 75,000 in 2000. The numbers of cases in Afghanistan are lowest for both 1999 and 2000, with values that appear to be very close to 0 on this scale."
table4a |>
pivot_longer(
cols = c(`1999`, `2000`),
@ -160,16 +187,11 @@ table4a |>
scale_x_continuous(breaks = c(1999, 2000))
```
It's most straight-forward to do this starting with a data frame where `country`, `year`, and `cases` are the columns and each row represents a record from a country for a particular year.
It's most straightforward to do this starting with a data frame where `country`, `year`, and `cases` are the columns and each row represents a record from a country for a particular year.
Something like the following:
```{r echo = FALSE}
table4a |>
pivot_longer(
cols = c(`1999`, `2000`),
names_to = "year",
values_to = "cases"
) |>
mutate(year = parse_integer(year))
```{r}
table1 |> select(country, year, cases)
```
However in `table4a` the column names `1999` and `2000` represent values of the `year` variable, the values in the `1999` and `2000` columns represent values of the `cases` variable, and each row represents two observations, not one.
@ -195,9 +217,11 @@ table4a |>
)
```
The columns to pivot are specified with `dplyr::select()` style notation in the `cols` argument.
The `cols` argument specifies the columns to pivot using `dplyr::select()` style notation.
Here there are only two columns, so we list them individually.
Note that `1999` and `2000` are non-syntactic names (because they don't start with a letter) so we have to surround them in backticks.
Unfortunately, there's a challenge!
`1999` and `2000` are unusual column names.
Because they don't start with a letter they're called **non-syntactic** names and we have to surround them in backticks.
To refresh your memory of the other ways to select columns, see Section \@ref(select).
`year` and `cases` do not exist in `table4a` so we put their names in quotes in `names_to` and `values_to` arguments, respectively.
@ -206,7 +230,15 @@ In the final result, the pivoted columns are dropped, and we get new `year` and
Otherwise, the relationships between the original variables are preserved.
Visually, this is shown in Figure \@ref(fig:tidy-pivot-longer).
```{r tidy-pivot-longer, echo = FALSE, out.width = "100%", fig.cap = "Pivoting `table4a` into a \"longer\", tidy form.", fig.alt = "Two panels, one with a longer and the other with a wider data frame. Arrows represent how values in the 1999 and 2000 columns of the wider data frame are pivoted to a column named cases in the longer data frame and how column names from the wider data frame (1999 and 2000) are pivoted into column names in the longer data frame."}
```{r tidy-pivot-longer, echo = FALSE, out.width = "100%"}
#| fig.cap: >
#| Pivoting `table4a` into a "longer", tidy form.
#| fig.alt: >
#| Two panels, one with a longer and the other with a wider data frame.
#| Arrows represent how values in the 1999 and 2000 columns of the wider
#| data frame are pivoted to a column named cases in the longer data frame
#| and how column names from the wider data frame (1999 and 2000) are
#| pivoted into column names in the longer data frame.
knitr::include_graphics("images/tidy-9.png")
```
@ -227,9 +259,9 @@ table4a |>
mutate(year = parse_integer(year))
```
Once we have our data in this longer format, we can create the visualisation that motivated this tidying exercise as follows.
Once we have our data in this longer format, we can create the visualisation that motivated this tidying exercise with the following code.
```{r ref.label = "tidy-pivot-longer-plot-lines", fig.alt = "Number of cases over the years for each country.", fig.alt = "This figure shows the numbers of cases in 1999 and 2000 for Afghanistan, Brazil, and China, with year on the x-axis and number of cases on the y-axis. Each point on the plot represents the number of cases in a given country in a given year. The points for each country are differentiated from others by color and shape and connected with a line, resulting in three, non-parallel, non-intersecting lines. The numbers of cases in China are highest for both 1999 and 2000, with values above 200,000 for both years. The number of cases in Brazil is approximately 40,000 in 1999 and approximately 75,000 in 2000. The numbers of cases in Afghanistan are lowest for both 1999 and 2000, with values that appear to be very close to 0 on this scale."}
```{r ref.label = "tidy-pivot-longer-plot-lines", fig.show='hide'}
```
`pivot_longer()` makes datasets longer by increasing the number of rows and decreasing the number of columns.
@ -302,7 +334,16 @@ table2 |>
pivot_wider(names_from = type, values_from = count)
```
```{r tidy-pivot-wider, echo = FALSE, out.width = "100%", fig.cap = "Pivoting `table2` into a \"wider\", tidy form.", fig.alt = "Two panels, one with a longer and the other with a wider data frame. Arrows represent how values in the count column of the longer data frame are pivoted to two columns named cases and population in the wider data frame as well as how values in the type column of the longer data (cases and population) frame are pivoted into column names in the wider data frame."}
```{r tidy-pivot-wider, echo = FALSE, out.width = "100%"}
#| fig.cap: >
#| Pivoting `table2` into a "wider", tidy form.
#| fig.alt: >
#| Two panels, one with a longer and the other with a wider data frame.
#| Arrows represent how values in the count column of the longer data
#| frame are pivoted to two columns named cases and population in the
#| wider data frame as well as how values in the type column of the longer
#| data (cases and population) frame are pivoted into column names in
#| the wider data frame.
knitr::include_graphics("images/tidy-8.png")
```
@ -313,7 +354,19 @@ Once we have our data in this wider format, we can create the data frame that mo
Earlier we visualised case counts over the years, and this representation can be useful for visualising case rates, for example.
```{r, fig.alt = "This figure shows the case rate in 1999 and 2000 for Afghanistan, Brazil, and China, with year on the x-axis and number of cases on the y-axis. Each point on the plot represents the case rate in a given country in a given year. The points for each country are differentiated from others by color and shape and connected with a line, resulting in three, non-parallel, non-intersecting lines. The case rates in Brazil are highest for both 1999 and 2000; approximately 0.0002 in 1999 and approximately 0.00045 in 2000. The case rates in China are slightly below 0.0002 in both 1999 and 2000. The case rates in Afghanistan are lowest for both 1999 and 2000; pretty close to 0 in 1999 and approximately 0.0001 in 2000."}
```{r}
#| fig.alt: >
#| This figure shows the case rate in 1999 and 2000 for Afghanistan,
#| Brazil, and China, with year on the x-axis and number of cases on the
#| y-axis. Each point on the plot represents the case rate in a given
#| country in a given year. The points for each country are differentiated
#| from others by color and shape and connected with a line, resulting in
#| three, non-parallel, non-intersecting lines. The case rates in Brazil
#| are highest for both 1999 and 2000; approximately 0.0002 in 1999 and
#| approximately 0.00045 in 2000. The case rates in China are slightly
#| below 0.0002 in both 1999 and 2000. The case rates in Afghanistan are
#| lowest for both 1999 and 2000; pretty close to 0 in 1999 and
#| approximately 0.0001 in 2000."
table2 |>
pivot_wider(names_from = type, values_from = count) |>
mutate(rate = cases / population) |>
@ -327,13 +380,16 @@ Now let's go one step further and widen the data to record `cases`, `population`
```{r tidy-pivot-even-wider-case-ratio, echo = FALSE}
table2 |>
pivot_wider(names_from = type, values_from = count) |>
pivot_wider(
names_from = type,
values_from = count
) |>
mutate(rate = cases / population) |>
pivot_wider(
names_from = year,
values_from = c(cases, population, rate)
) |>
relocate(country, contains("1999"))
values_from = c(cases, population, rate),
names_vary = "slowest"
)
```
This representation is rarely useful for data analysis but it might be useful as the basis of a table for communication of results in a data analysis report.
@ -341,7 +397,16 @@ This representation is rarely useful for data analysis but it might be useful as
To achieve this we need to add year information in column headings for `cases`, `population`, and `rate` as well as distribute the values that are currently under these three columns into six columns (two columns for each year we have data for).
This is represented in Figure \@ref(fig:tidy-pivot-even-wider).
```{r tidy-pivot-even-wider, echo = FALSE, out.width = "100%", fig.cap = "Pivoting `table2` into an even \"wider\" form. Arrows for `cases` and `rate` values are omitted for clarity.", fig.alt = "Two panels, one with a wider and the other with an even wider data frame. Arrows represent how population values for 1999 and 2000 that are stored in a single column in the wide data frame are spread across two columns in the data frame that is even wider. These new columns are called population_1999 and population_2000."}
```{r tidy-pivot-even-wider, echo = FALSE, out.width = "100%"}
#| fig.cap: >
#| Pivoting `table2` into an even "wider" form. Arrows for `cases` and
#| `rate` values are omitted for clarity.
#| fig.alt: >
#| Two panels, one with a wider and the other with an even wider data
#| frame. Arrows represent how population values for 1999 and 2000 that
#| are stored in a single column in the wide data frame are spread across
#| two columns in the data frame that is even wider. These new columns
#| are called population_1999 and population_2000.
knitr::include_graphics("images/tidy-19.png")
```
@ -350,23 +415,23 @@ The first three lines of the following code chunk is what we've already done in
```{r}
table2 |>
pivot_wider(names_from = type, values_from = count) |>
pivot_wider(
names_from = type,
values_from = count
) |>
mutate(rate = cases / population) |>
pivot_wider(
names_from = year,
values_from = c(cases, population, rate)
values_from = c(cases, population, rate),
names_vary = "slowest"
)
```
The last step for achieving our goal is to relocate columns in the resulting data frame so columns for 1999 data come before those for 2000.
We can use the `relocate()` function to move the 1999 columns ahead of the 2000 columns.
Note the use of `names_vary` to keep the years (coming from the columns names) together.
```{r ref.label = "tidy-pivot-even-wider-case-ratio"}
```
As you might have guessed from their names, `pivot_wider()` and `pivot_longer()` are complements.
`pivot_longer()` makes wide tables narrower and longer; `pivot_wider()` makes long tables shorter and wider.
### Exercises
1. Why are `pivot_longer()` and `pivot_wider()` not perfectly symmetrical?\
@ -586,7 +651,13 @@ who_tidy
This tidy data frame allows us to explore the data with more ease than the original `who` dataset.
For example, we can easily filter for a particular type of TB for a given country and sum over the number of cases to see how case numbers for this type of TB have evolved over the years.
```{r fig.alt = "A scatterplot of number of smear positive pulmonary TB cases in the US over the years, with year on the x-axis ranging from 1995 to 2013 and yearly total number of cases on the y-axis ranging from 3000 to 8000. The points on the scatterplot are overlaid with a smooth curve, which shows a strong, negative association between the two variables."}
```{r}
#| fig.alt: >
#| A scatterplot of number of smear positive pulmonary TB cases in the
#| US over the years, with year on the x-axis ranging from 1995 to 2013
#| and yearly total number of cases on the y-axis ranging from 3000 to
#| 8000. The points on the scatterplot are overlaid with a smooth curve,
#| which shows a strong, negative association between the two variables.
who_tidy |>
filter(diagnosis == "sp", country == "United States of America") |>
group_by(year) |>