534 lines
20 KiB
Plaintext
534 lines
20 KiB
Plaintext
# Data tidying {#data-tidy}
|
|
|
|
## Introduction
|
|
|
|
> "Happy families are all alike; every unhappy family is unhappy in its own way." --- Leo Tolstoy
|
|
|
|
> "Tidy datasets are all alike, but every messy dataset is messy in its own way." --- Hadley Wickham
|
|
|
|
In this chapter, you will learn a consistent way to organize your data in R using a system called **tidy data**.
|
|
Getting your data into this format requires some work up front, but that work pays off in the long term.
|
|
Once you have tidy data and the tidy tools provided by packages in the tidyverse, you will spend much less time munging data from one representation to another, allowing you to spend more time on the data questions you care about.
|
|
|
|
This chapter will give you a practical introduction to tidy data and the accompanying tools in the **tidyr** package.
|
|
If you'd like to learn more about the underlying theory, you might enjoy the *Tidy Data* paper published in the Journal of Statistical Software, <http://www.jstatsoft.org/v59/i10/paper>.
|
|
|
|
### Prerequisites
|
|
|
|
In this chapter we'll focus on tidyr, a package that provides a bunch of tools to help tidy up your messy datasets.
|
|
tidyr is a member of the core tidyverse.
|
|
|
|
```{r setup, message = FALSE}
|
|
library(tidyverse)
|
|
```
|
|
|
|
From this chapter on, we'll suppress the loading message from `library(tidyverse)`.
|
|
|
|
## Tidy data
|
|
|
|
You can represent the same underlying data in multiple ways.
|
|
The example below shows the same data organised in four different ways.
|
|
Each dataset shows the same values of four variables *country*, *year*, *population*, and *cases*, but each dataset organizes the values in a different way.
|
|
|
|
```{r}
|
|
table1
|
|
table2
|
|
table3
|
|
|
|
# Spread across two tibbles
|
|
table4a # cases
|
|
table4b # population
|
|
```
|
|
|
|
These are all representations of the same underlying data, but they are not equally easy to use.
|
|
One dataset, the tidy dataset, will be much easier to work with inside the tidyverse.
|
|
There are three interrelated rules which make a dataset tidy:
|
|
|
|
1. Each variable is a column; each column is a variable.
|
|
2. Each observation is row; each row is an observation
|
|
3. Each value is a cell; each cell is a single value.
|
|
|
|
These three rules are interrelated because typically by fixing one of them you'll fix the other two.
|
|
Figure \@ref(fig:tidy-structure) shows the rules visually.
|
|
In the example above, only `table1` is tidy.
|
|
|
|
```{r tidy-structure, echo = FALSE, out.width = "100%"}
|
|
#| fig.cap: >
|
|
#| Following three rules makes a dataset tidy: variables are columns,
|
|
#| observations are rows, and values are cells.
|
|
#| fig.alt: >
|
|
#| Three panels, each representing a tidy data frame. The first panel
|
|
#| shows that each variable is column. The second panel shows that each
|
|
#| observation is a row. The third panel shows that each value is
|
|
#| a cell.
|
|
knitr::include_graphics("images/tidy-1.png")
|
|
```
|
|
|
|
Why ensure that your data is tidy?
|
|
There are two main advantages:
|
|
|
|
1. There's a general advantage to picking one consistent way of storing data.
|
|
If you have a consistent data structure, it's easier to learn the tools that work with it because they have an underlying uniformity.
|
|
|
|
2. There's a specific advantage to placing variables in columns because it allows R's vectorised nature to shine.
|
|
As you learned in Sections \@ref(mutate-funs) and \@ref(summarise-funs), most built-in R functions work with vectors of values.
|
|
That makes transforming tidy data feel particularly natural.
|
|
|
|
dplyr, ggplot2, and all the other packages in the tidyverse are designed to work with tidy data.
|
|
Here are a couple of small examples showing how you might work with `table1`.
|
|
|
|
```{r fig.width = 5}
|
|
#| fig.alt: >
|
|
#| This figure shows the numbers of cases in 1999 and 2000 for
|
|
#| Afghanistan, Brazil, and China, with year on the x-axis and number
|
|
#| of cases on the y-axis. Each point on the plot represents the number
|
|
#| of cases in a given country in a given year. The points for each
|
|
#| country are differentiated from others by color and shape and connected
|
|
#| with a line, resulting in three, non-parallel, non-intersecting lines.
|
|
#| The numbers of cases in China are highest for both 1999 and 2000, with
|
|
#| values above 200,000 for both years. The number of cases in Brazil is
|
|
#| approximately 40,000 in 1999 and approximately 75,000 in 2000. The
|
|
#| numbers of cases in Afghanistan are lowest for both 1999 and 2000, with
|
|
#| values that appear to be very close to 0 on this scale.
|
|
|
|
# Compute rate per 10,000
|
|
table1 |>
|
|
mutate(
|
|
rate = cases / population * 10000
|
|
)
|
|
|
|
# Compute cases per year
|
|
table1 |>
|
|
count(year, wt = cases)
|
|
|
|
# Visualise changes over time
|
|
ggplot(table1, aes(year, cases)) +
|
|
geom_line(aes(group = country), colour = "grey50") +
|
|
geom_point(aes(colour = country, shape = country)) +
|
|
scale_x_continuous(breaks = c(1999, 2000))
|
|
```
|
|
|
|
### Exercises
|
|
|
|
1. Using prose, describe how the variables and observations are organised in each of the sample tables.
|
|
|
|
2. Compute the `rate` for `table2`, and `table4a` + `table4b`.
|
|
You will need to perform four operations:
|
|
|
|
a. Extract the number of TB cases per country per year.
|
|
b. Extract the matching population per country per year.
|
|
c. Divide cases by population, and multiply by 10000.
|
|
d. Store back in the appropriate place.
|
|
|
|
Which representation is easiest to work with?
|
|
Which is hardest?
|
|
Why?
|
|
|
|
3. Recreate the plot showing change in cases over time using `table2` instead of `table1`.
|
|
What do you need to do first?
|
|
|
|
## Pivoting
|
|
|
|
The principles of tidy data seem so obvious that you might wonder if you'll ever encounter a dataset that isn't tidy.
|
|
Unfortunately, however, most data that you will encounter will be untidy.
|
|
There are two main reasons:
|
|
|
|
1. Most people aren't familiar with the principles of tidy data, and it's hard to derive them yourself unless you spend a *lot* of time working with data.
|
|
|
|
2. Data is often organised to facilitate some goal other than analysis.
|
|
For example, data is often organised to make collection as easy as possible.
|
|
|
|
This means for most real analyses, you'll need to do some tidying.
|
|
The first step is always to figure out what the variables and observations are.
|
|
Sometimes this is easy; other times you'll need to consult with the people who originally generated the data.
|
|
The next step is to **pivot** your data to make sure that the variables are in the columns and the observations are in the rows.
|
|
|
|
tidyr provides two functions for pivoting data: `pivot_longer()`, which makes datasets **longer** by expanding rows and shrinking columns, and `pivot_wider()` which makes datasets **wider** by expanding columns and shrinking rows.
|
|
`pivot_longer()` is most useful for getting data in to a tidy form.
|
|
`pivot_wider()` is less commonly needed to make data tidy, but it can be useful for making non-tidy data (we'll come back to this in Section \@ref(non-tidy-data)).
|
|
|
|
The following sections work through the use of `pivot_longer()` and `pivot_wider()` to tackle a wide range of realistic datasets.
|
|
These examples are drawn from `vignette("pivot", package = "tidyr")` which includes more variations and more challenging problems.
|
|
|
|
### String data in column names {#pew}
|
|
|
|
The `relig_income` dataset stores counts based on a survey which (among other things) asked people about their religion and annual income:
|
|
|
|
```{r}
|
|
relig_income
|
|
```
|
|
|
|
This dataset contains three variables:
|
|
|
|
- `religion`, stored in the rows,
|
|
- `income`, spread across the column names, and
|
|
- `count`, stored in the cells.
|
|
|
|
To tidy it we use `pivot_longer()`:
|
|
|
|
```{r}
|
|
relig_income %>%
|
|
pivot_longer(
|
|
cols = !religion,
|
|
names_to = "income",
|
|
values_to = "count"
|
|
)
|
|
```
|
|
|
|
- `cols` describes which columns need to be reshaped.
|
|
In this case, it's every column apart from `religion`.
|
|
It uses the same syntax as `select()`.
|
|
|
|
- `names_to` gives the name of the variable that will be created from the data stored in the column names, i.e. `income`.
|
|
|
|
- `values_to` gives the name of the variable that will be created from the data stored in the cell value, i.e. `count`.
|
|
|
|
Neither the `names_to` nor the `values_to` column exists in `relig_income`, so we provide them as strings surrounded by quotes.
|
|
|
|
### Numeric data in column names {#billboard}
|
|
|
|
The `billboard` dataset records the billboard rank of songs in the year 2000.
|
|
It has a form similar to the `relig_income` data, but there are a lot of missing values because there are 76 columns to make it possible to track a song for 76 weeks.
|
|
Songs that stay in the chart for less time than that to get filled out with missing values.
|
|
|
|
```{r}
|
|
billboard
|
|
```
|
|
|
|
This time there are five variables:
|
|
|
|
- `artist`, `track`, and `date.entered` are already columns,
|
|
- `week` is spread across the columns, and
|
|
- `rank` is stored in the cells.
|
|
|
|
There are a few ways to we could specify which `cols` need to be pivotted.
|
|
One option would be copy the previous usage and do `!c(artist, track, date.entered)`.
|
|
But the variables in this case have a common prefix, so it's nice opportunity to use `starts_with():`
|
|
|
|
```{r}
|
|
billboard %>%
|
|
pivot_longer(
|
|
cols = starts_with("wk"),
|
|
names_to = "week",
|
|
values_to = "rank",
|
|
values_drop_na = TRUE
|
|
)
|
|
```
|
|
|
|
There's one new argument here: `values_drop_na`.
|
|
It tells `pivot_longer()` to drop the rows that correspond to missing values, because in this case we know they're not meaningful.
|
|
|
|
If you look closely at the output you'll notice that `week` is a character vector, and but it'd make future computation a bit easier if this was a number.
|
|
We can do this in two steps: first we use the `names_prefix` argument to strip of the `wk` prefix, then we use `mutate()` + `as.integer()` to convert the string into a number:
|
|
|
|
```{r}
|
|
billboard_tidy <- billboard %>%
|
|
pivot_longer(
|
|
cols = starts_with("wk"),
|
|
names_to = "week",
|
|
names_prefix = "wk",
|
|
values_to = "rank",
|
|
values_drop_na = TRUE
|
|
) |>
|
|
mutate(week = as.integer(week))
|
|
billboard_tidy
|
|
```
|
|
|
|
Now we're in a good position to look at the typical course of a song's rank by drawing a plot.
|
|
|
|
```{r}
|
|
#| fig.alt: >
|
|
#| A line plot with week on the x-axis and rank on the y-axis, where
|
|
#| each line represents a song. Most songs appear to start at a high rank,
|
|
#| rapidly accelerate to a low rank, and then decay again. There are
|
|
#| suprisingly few tracks in the region when week is >20 and rank is
|
|
#| >50.
|
|
billboard_tidy |>
|
|
ggplot(aes(week, rank, group = track)) +
|
|
geom_line(alpha = 1/3) +
|
|
scale_y_reverse()
|
|
```
|
|
|
|
### Many variables in column names
|
|
|
|
A more challenging situation occurs when you have multiple variables crammed into the column names.
|
|
For example, take this minor variation on the `who` dataset:
|
|
|
|
```{r}
|
|
who2 <- who |>
|
|
rename_with(~ str_remove(.x, "new_?")) |>
|
|
rename_with(~ str_replace(.x, "([mf])", "\\1_")) |>
|
|
select(!starts_with("iso"))
|
|
who2
|
|
```
|
|
|
|
I've used regular expressions to make the problem a little simpler; you'll learn how they work in Chapter \@ref(regular-expressions).
|
|
|
|
There are six variables in this data set:
|
|
|
|
- `country` and `year` are already in columns.
|
|
- The columns the columns from `sp_m_014` to `rel_f_65` encode three variables in their names:
|
|
- `sp`/`rel`/`ep` describe the method used for the `diagnosis`.
|
|
|
|
- `m`/`f` gives the `gender`.
|
|
|
|
- `014`/`1524`/`2535`/`3544`/`4554`/`65` is the `age` range.
|
|
- The case `count` is in the cells.
|
|
|
|
This requires a slightly more complicate call to `pivot_longer()`, where `names_to` gets a vector of column names and `names_sep` describes how to split the variable name up into pieces:
|
|
|
|
```{r}
|
|
who2 %>%
|
|
pivot_longer(
|
|
cols = !(country:year),
|
|
names_to = c("diagnosis", "gender", "age"),
|
|
names_sep = "_",
|
|
values_to = "count"
|
|
)
|
|
```
|
|
|
|
### Multiple observations per row
|
|
|
|
So far we have been working with data frames that have one observation per row, but many important pivoting problems involve multiple observations per row.
|
|
You can usually recognize this case because name of the column that you want to appear in the output is part of the column name in the input.
|
|
In this section, you'll learn how to pivot this sort of data.
|
|
|
|
The following example is adapted from the [data.table vignette](https://CRAN.R-project.org/package=data.table/vignettes/datatable-reshape.html):
|
|
|
|
```{r}
|
|
family <- tribble(
|
|
~family, ~dob_child1, ~dob_child2, ~name_child1, ~name_child2,
|
|
1, "1998-11-26", "2000-01-29", "Susan", "Jose",
|
|
2, "1996-06-22", NA, "Mark", NA,
|
|
3, "2002-07-11", "2004-04-05", "Sam", "Seth",
|
|
4, "2004-10-10", "2009-08-27", "Craig", "Khai",
|
|
5, "2000-12-05", "2005-02-28", "Parker", "Gracie",
|
|
)
|
|
family <- family %>%
|
|
mutate(across(starts_with("dob"), parse_date))
|
|
family
|
|
```
|
|
|
|
There are four variables here:
|
|
|
|
- `family` is already a column.
|
|
- `child` is part of the column name.
|
|
- `dob` and `name` are stored as cell values.
|
|
|
|
This problem is hard because the column names contain both the name of variable (`dob`, `name)` and the value of a variable (`child1`, `child2`).
|
|
So again we need to supply a vector to `names_to` but now we use the special `".value"`[^data-tidy-1] name to indicate that first component should become a column name.
|
|
|
|
[^data-tidy-1]: Calling this `.value` instead of `.variable` seems confusing so I think we'll change it: <https://github.com/tidyverse/tidyr/issues/1326>
|
|
|
|
```{r}
|
|
family %>%
|
|
pivot_longer(
|
|
cols = !family,
|
|
names_to = c(".value", "child"),
|
|
names_sep = "_",
|
|
values_drop_na = TRUE
|
|
)
|
|
```
|
|
|
|
Note the use of `values_drop_na = TRUE`, since again the input shape forces the creation of explicit missing variables for observations that don't exist (families with only one child).
|
|
|
|
### Tidy census
|
|
|
|
So far we've focused on `pivot_longer()` which help solves the common class of problems where variable values have ended up in the column names.
|
|
Next we'll pivot (HA HA) to `pivot_wider()`, which helps when one observation is spread across multiple rows.
|
|
For example, the `us_rent_income` dataset contains information about median income and rent for each state in the US for 2017 (from the American Community Survey, retrieved with the [tidycensus](https://walker-data.com/tidycensus/) package).
|
|
|
|
```{r}
|
|
us_rent_income
|
|
```
|
|
|
|
Here it starts to get a bit philosophical as to what the variable are, but I'd say:
|
|
|
|
- `GEOID` and `NAME` which are already columns.
|
|
- The `estimate` and margin of error (`moe`) for each of `rent` and `income`, i.e. `income_estimate`, `income_moe`, `rent_estimate`, `rent_moe`.
|
|
|
|
We can get most of the way there with a simple call to `pivot_wider()`:
|
|
|
|
```{r}
|
|
us_rent_income %>%
|
|
pivot_wider(
|
|
names_from = variable,
|
|
values_from = c(estimate, moe)
|
|
)
|
|
```
|
|
|
|
However, there are two problems:
|
|
|
|
- We want (e.g.) `income_estimate` not `estimate_income`
|
|
- We want `_estimate` then `_moe` for each variable, not all the estimates then all the margins of error.
|
|
|
|
We can fix the renaming problems by providing a custom glue specification for creating the variable names, and have the variable names vary slowest rather than default of fastest:
|
|
|
|
```{r}
|
|
us_rent_income %>%
|
|
pivot_wider(
|
|
names_from = variable,
|
|
values_from = c(estimate, moe),
|
|
names_glue = "{variable}_{.value}",
|
|
names_vary = "slowest"
|
|
)
|
|
```
|
|
|
|
We'll see a couple more examples where `pivot_wider()` is useful in the next section where we work through a couple of examples that require both `pivot_longer()` and `pivot_wider()`.
|
|
|
|
## Case studies
|
|
|
|
Some problems can't be solved by pivoting in a single direction.
|
|
The two examples in this section show how you might combine both `pivot_longer()` and `pivot_wider()` to solve more complex problems.
|
|
|
|
### World bank
|
|
|
|
`world_bank_pop` contains data from the World Bank about population per country from 2000 to 2018.
|
|
|
|
```{r}
|
|
world_bank_pop
|
|
```
|
|
|
|
My goal is to produce a tidy dataset where each variable is in a column, but I don't know exactly what variables exist so I'm not sure what I'll need to do.
|
|
However, there's one obvious problem to start with: year is spread across multiple columns.
|
|
I'll fix this with `pivot_longer()`:
|
|
|
|
```{r}
|
|
pop2 <- world_bank_pop %>%
|
|
pivot_longer(
|
|
cols = `2000`:`2017`,
|
|
names_to = "year",
|
|
values_to = "value"
|
|
)
|
|
pop2
|
|
```
|
|
|
|
Next we need to consider the `indicator` variable:
|
|
|
|
```{r}
|
|
pop2 %>%
|
|
count(indicator)
|
|
```
|
|
|
|
There are only four possible values, so I dig a little digging and discovered that:
|
|
|
|
- `SP.POP.GROW` is population growth,
|
|
- `SP.POP.TOTL` is total population,
|
|
- `SP.URB.GROW` is population growth in urban areas,
|
|
- `SP.POP.TOTL` is total population in urban areas.
|
|
|
|
To me, this feels like it could be broken down into three variables:
|
|
|
|
- `GROW`: population growth
|
|
- `TOTL`: total population
|
|
- `area`: whether the statistics apply to the complete country or just urban areas.
|
|
|
|
So I'll first separate indicator into these pieces:
|
|
|
|
```{r}
|
|
pop3 <- pop2 %>%
|
|
separate(indicator, c(NA, "area", "variable"))
|
|
pop3
|
|
```
|
|
|
|
(You'll learn more about this function in Chapter \@ref(strings).)
|
|
|
|
Now we can complete the tidying by pivoting `variable` and `value` to make `TOTL` and `GROW` columns:
|
|
|
|
```{r}
|
|
pop3 %>%
|
|
pivot_wider(
|
|
names_from = variable,
|
|
values_from = value
|
|
)
|
|
```
|
|
|
|
### Multi-choice
|
|
|
|
The final example shows a dataset inspired by [Maxime Wack](https://github.com/tidyverse/tidyr/issues/384), which requires us to deal with a common, but annoying, way of recording multiple choice data.
|
|
Often you will get such data as follows:
|
|
|
|
```{r}
|
|
multi <- tribble(
|
|
~id, ~choice1, ~choice2, ~choice3,
|
|
1, "A", "B", "C",
|
|
2, "C", "B", NA,
|
|
3, "D", NA, NA,
|
|
4, "B", "D", NA
|
|
)
|
|
```
|
|
|
|
Here the actual order is important, and you'd prefer to have the individual responses in the columns.
|
|
You can achieve the desired transformation in two steps.
|
|
First, you make the data longer, eliminating the explicit `NA`s with `values_drop_na`, and adding a column to indicate that this response was chosen:
|
|
|
|
```{r}
|
|
multi2 <- multi %>%
|
|
pivot_longer(
|
|
cols = !id,
|
|
values_drop_na = TRUE
|
|
) %>%
|
|
mutate(selected = TRUE)
|
|
multi2
|
|
```
|
|
|
|
Then you make the data wider, filling in the missing observations with `FALSE`:
|
|
|
|
```{r}
|
|
multi2 %>%
|
|
pivot_wider(
|
|
id_cols = id,
|
|
names_from = value,
|
|
values_from = selected,
|
|
values_fill = FALSE
|
|
)
|
|
```
|
|
|
|
## Non-tidy data
|
|
|
|
Before we continue on to other topics, it's worth talking briefly about non-tidy data.
|
|
Earlier in the chapter, I used the pejorative term "messy" to refer to non-tidy data.
|
|
That's an oversimplification: there are lots of useful and well-founded data structures that are not tidy data.
|
|
There are two main reasons to use other data structures:
|
|
|
|
- Alternative representations may have substantial performance or space advantages.
|
|
|
|
- Specialised fields have evolved their own conventions for storing data that may be quite different to the conventions of tidy data.
|
|
|
|
Either of these reasons means you'll need something other than a tibble (or data frame).
|
|
If your data does fit naturally into a rectangular structure composed of observations and variables, I think tidy data should be your default choice.
|
|
But there are good reasons to use other structures; tidy data is not the only way.
|
|
|
|
For example, take the tidy `fish_encounters` dataset, which describes when fish swimming down a river are detected by automatic monitoring stations:
|
|
|
|
```{r}
|
|
fish_encounters
|
|
```
|
|
|
|
Many tools used to analyse this data need it in a non-tidy form where each station is a column.
|
|
`pivot_wider()` makes it easier to get our tidy dataset into this form:
|
|
|
|
```{r}
|
|
fish_encounters %>%
|
|
pivot_wider(
|
|
names_from = station,
|
|
values_from = seen,
|
|
values_fill = 0
|
|
)
|
|
```
|
|
|
|
This dataset only records when a fish was detected by the station - it doesn't record when it wasn't detected (this is common with this type of data).
|
|
That means the output data is filled with `NA`s.
|
|
However, in this case we know that the absence of a record means that the fish was not `seen`, so we can ask `pivot_wider()` to fill these missing values in with zeros:
|
|
|
|
```{r}
|
|
fish_encounters %>%
|
|
pivot_wider(
|
|
names_from = station,
|
|
values_from = seen,
|
|
values_fill = 0
|
|
)
|
|
```
|
|
|
|
If you'd like to learn more about non-tidy data, I'd highly recommend this thoughtful blog post by Jeff Leek: <https://simplystatistics.org/posts/2016-02-17-non-tidy-data>.
|