Final tidy polishing

This commit is contained in:
hadley 2016-07-27 08:23:28 -05:00
parent 6fdcf51930
commit f9e51a7096
2 changed files with 78 additions and 78 deletions

152
tidy.Rmd
View File

@ -8,7 +8,7 @@
> "Tidy datasets are all alike, but every messy dataset is messy in its
> own way." -- Hadley Wickham
In this chapter, you will learn a consistent way to organise your data in R, a organisation called __tidy data__. Getting your data into this format requires some upfront work, but that work pays off in the long-term. Once you have tidy data and the tidy tools provided by packages in the tidyverse, you will spend much less time munging data from one representation to another, allowing you to spend more time on the analytic questions at hand.
In this chapter, you will learn a consistent way to organise your data in R, a organisation called __tidy data__. Getting your data into this format requires some upfront work, but that work pays off in the long-term. Once you have tidy data and the tidy tools provided by packages in the tidyverse, you will spend much less time munging data from one representation to another, allowing you to spend more time on the analytic questions at hand.
This chapter will give you a practical introduction to tidy data and the accompanying tools in the __tidyr__ package. If you'd like to learn more about the underlying theory, you might enjoy the *Tidy Data* paper published in the Journal of Statistical Software, <http://www.jstatsoft.org/v59/i10/paper>.
@ -16,7 +16,7 @@ This chapter will give you a practical introduction to tidy data and the accompa
In this chapter we'll focus on tidyr, a package that provides a bunch of tools to help tidy up your messy datasets. We'll also need to use a little dplyr, as is common when tidying data.
```{r setup}
```{r setup, message = FALSE}
library(tidyr)
library(dplyr)
```
@ -41,7 +41,9 @@ These are all representations of the same underlying data, but they are not equa
1. Each observation has its own row.
1. Each value has its own cell.
```{r, echo = FALSE, out.width = "100%"}
Figure \@ref(fig:tidy-structure) shows this visually.
```{r tidy-structure, echo = FALSE, out.width = "100%", fig.cap = "Following three rules makes a dataset tidy: variables are in columns, observations are in rows, and values are in cells."}
knitr::include_graphics("images/tidy-1.png")
```
@ -50,19 +52,21 @@ These three rules are interrelated because it's impossible to only satisfy two o
1. Put each dataset in a tibble.
1. Put each variable in a column.
There are two advantages to tidy data:
In this example, only `tabble1` is tidy. It's the only representation where each column is a variable.
1. There's a general advantage to just picking one consistent way of storing
Why ensure that your data is tidy? There are two main advantages:
1. There's a general advantage to picking one consistent way of storing
data. If you have a consistent data structure, it's easier to learn the
tools that work with it because they have an underlying uniformity.
1. There's a specific advantage to placing variables in columns because
it allows R's vectorised nature to shine. As you learned in [useful
creation functions] and [useful summary functions], most built-in R
functions work with a vector of values. That makes transforming tidy
data feel particularly natural.
it allows R's vectorised nature to shine. As you learned in
[mutate](#mutate-funs) and [summary functions](#summary-funs), most
built-in R functions work with vectors of values. That makes transforming
tidy data feel particularly natural.
In this example, it's `table1` that has the tidy representation, because each of the four columns represents a variable. This form is the easiest to work with in dplyr or ggplot2. It's also well suited for modelling, as you'll learn later. In fact, the way that R's modelling functions work was an inspiration for the tidy data format. Here are a couple of small examples of how you might work with this data. Think about how you'd achieve the same result with the other representations.
dplyr, ggplot2, and all other the packages in the tidyverse are designed to work with tidy data. Here are a couple of small examples showing how you might work with `table1`. Think about how you'd achieve the same result with the other representations.
```{r}
# Compute rate
@ -89,11 +93,11 @@ ggplot(table1, aes(year, cases)) +
You will need to perform four operations:
1. Extract the number of TB cases per country per year.
2. Extract the matching population per country per year.
3. Divide cases by population, and multiply by 10000.
5. Store back in the appropriate place.
1. Extract the matching population per country per year.
1. Divide cases by population, and multiply by 10000.
1. Store back in the appropriate place.
Which is easiest? Which is hardest?
Which representation is easiest to work with? Which is hardest? Why?
1. Recreate the plot showing change in cases over time using `table2`
instead of `table1`. What do you need to do first?
@ -102,28 +106,34 @@ ggplot(table1, aes(year, cases)) +
The principles of tidy data seem so obvious that you might wonder if you'll ever encounter a dataset that isn't tidy. Unfortunately, while the principles are obvious in hindsight, it took Hadley over 5 years of struggling with many datasets to figure out these very simple principles. Most datasets that you will encounter in real life will not be tidy, either because the creator was not aware of the principles of tidy data, or because the data is stored in order to make data entry, not data analysis, easy.
The first step to tidying any dataset is to study it and figure out what the variables are. Sometimes this is easy; other times you'll need to consult with the people who originally generated the data.
The first step to tidying any dataset is to study it and figure out what the variables are. Sometimes this is easy; other times you'll need to consult with the people who originally generated the data. Once you've identified the variables, you can start wrangling the data so each variable forms a column.
One of the most messy-data common problems is that you'll some variables will not be in the columns: one variable might be spread across multiple columns, or you might find that the variables for one observation are scattered across multiple rows. To fix these problems, you'll need the two most important functions in tidyr: `gather()` and `spread()`.
There are two common problems that it's best to solve first:
1. One variable might be spread across multiple columns.
1. One observation might be spread across mutliple rows.
To fix these problems, you'll need the two most important functions in tidyr: `gather()` and `spread()`.
### Gathering
A common problem is a dataset where some of the column names are not names of a variable, but _values_ of a variable. Take `table4a`, for example, the column names `1991` and `2000` represent values of the `year` variable.
A common problem is a dataset where some of the column names are not names of a variable, but _values_ of a variable. Take `table4a`: the column names `1991` and `2000` represent values of the `year` variable. Each row represents two observations, not one.
```{r}
table4a
```
To tidy a dataset like these, we need to __gather__ those column into a new pair of columns. To describe that operation we need three parameters:
To tidy a dataset like this, we need to __gather__ those column into a new pair of variables. To describe that operation we need three parameters:
* The set of columns that represent values, not variables. In this example,
those are the columns `1999` and `2000`.
* The name of variable that the column names represent, the `key`. In this
example, that's the `year`.
* The name of the variable whose values form the column names. I call that
the `key`, and here it is `year`.
* The name of the variable that the cell values represent, the `value`.
Here, that's the number of `cases`.
* The name of the variable whose values are spread over the cells. I call
that `value`. and here it's the number of `cases`.
Together those parameters generate the call to `gather()`:
@ -131,11 +141,11 @@ Together those parameters generate the call to `gather()`:
table4a %>% gather(`1999`, `2000`, key = "year", value = "cases")
```
The columns to gather are specified with `dplyr::select()` style notation. Here there are only two columns, so we list them by name. 1999 and 2000 are non-syntactic names so we have to surround in backticks. To refresh your memory of the other ways you can select columns, see [select](#select).
The columns to gather are specified with `dplyr::select()` style notation. Here there are only two columns, so we list them individually. Note that "1999" and "2000" are non-syntactic names so we have to surround them in backticks. To refresh your memory of the other ways to select columns, see [select](#select).
In the final result, the gathered columns are dropped, and we get new `key` and `value` variables. Otherwise, the relationships between the original variables are preserved.
In the final result, the gathered columns are dropped, and we get new `key` and `value` columns. Otherwise, the relationships between the original variables are preserved. Visually, this is shown in Figure \@ref(fig:tidy-gather).
```{r, echo = FALSE, out.width = "100%"}
```{r tidy-gather, echo = FALSE, out.width = "100%", fig.cap = "Gathering `table4` into a tidy form."}
knitr::include_graphics("images/tidy-9.png")
```
@ -148,50 +158,37 @@ table4b %>% gather(`1999`, `2000`, key = "year", value = "population")
To combine the tidied versions of `table4a` and `table4b` into a single tibble, we need to use `dplyr::left_join()`, which you'll learn about in [relational data].
```{r}
tidy4a <- table4a %>% gather("year", "cases", `1999`:`2000`)
tidy4b <- table4b %>% gather("year", "population", `1999`:`2000`)
tidy4a <- table4a %>% gather(`1999`, `2000`, key = "year", value = "cases")
tidy4b <- table4b %>% gather(`1999`, `2000`, key = "year", value = "population")
left_join(tidy4a, tidy4b)
```
### Spreading
Spreading is the opposite of gathering. You use it when the variables for one observation are scattered across multiple rows. For example, take `table2`. An observation is a country in a year, but each observation is spread across two rows.
Spreading is the opposite of gathering. You use it when an observation is scattered across multiple rows. For example, take `table2`: an observation is a country in a year, but each observation is spread across two rows.
```{r}
table2
```
To tidy this up, we perform a similar operation to `gather()`. We need to identify which column:
To tidy this up, we first analysis the representation in similar way to `gather()`. This time, however, we only need two parameters:
* Which column gives the name of the variable, the `key`. Here, it's `key`.
* Which column gives the value of the variable, the `value`. Here's `value`.
* The column that contains variable names, the `key` column. Here, it's
`type`.
Once we've figured that out, we can use `spread()`:
* The column that contains values froms multiple variables, the `value`
column. Here it's `count`.
Once we've figured that out, we can use `spread()`, as shown progammatically below, and visually in Figure \@ref(fig:tidy-spread).
```{r}
spread(table2, key = key, value = value)
spread(table2, key = type, value = count)
```
Visually:
```{r, echo = FALSE, out.width = "100%"}
```{r tidy-spread, echo = FALSE, out.width = "100%", fig.cap = "Spreading `table2` makes it tidy"}
knitr::include_graphics("images/tidy-8.png")
```
Real-life datasets aren't usually labelled so helpfully. Here's another simple example:
```{r}
weather <- frame_data(
~day, ~measurement, ~record,
"Jan 1", "temp", 31,
"Jan 1", "precip", 0,
"Jan 2", "temp", 35,
"Jan 2", "precip", 5
)
weather %>%
spread(key = measurement, value = record)
```
As you might have guessed from the common `key` and `value` arguments, `spread()` and `gather()` are complements. `gather()` makes wide tables narrower and longer; `spread()` makes long tables shorter and wider.
### Exercises
@ -248,26 +245,30 @@ As you might have guessed from the common `key` and `value` arguments, `spread()
## Separating and uniting
You've learned how to tidy `table2` and `table4`, but not `table3`. `table3` has a different problem: we have one column (`rate`) that contains two variables (`cases` and `population`). To fix this problem, we'll need the `separate()` function. You'll also learn about inverse of `separate()`: `unite()`, which you use if a single variable is spread across multiple columns.
So far you've learned how to tidy `table2` and `table4`, but not `table3`. `table3` has a different problem: we have one column (`rate`) that contains two variables (`cases` and `population`). To fix this problem, we'll need the `separate()` function. You'll also learn about inverse of `separate()`: `unite()`, which you use if a single variable is spread across multiple columns.
### Separate
`separate()` pulls apart one column into multiple variables, by separating wherever a separator character appears.
![](images/tidy-17.png)
We need to use `separate()` to tidy `table3`, which combines values of *cases* and *population* in the same column. `separate()` take a data frame, the name of the column to separate, and the names of the columns to seperate into:
`separate()` pulls apart one column into multiple variables, by separating wherever a separator character appears. Take `table3`:
```{r}
table3
```
The `rate` column contains both `cases` and `population` variable, and we need to split it into two variabes. `separate()` takes the name of the column to separate, and the names of the columns to separate into, as shown in \@ref(fig:tidy-separate) and the code below.
```{r}
table3 %>%
separate(rate, into = c("cases", "population"))
```
```{r tidy-separate, echo = FALSE, out.width = "100%", fig.cap = "Separating `table3` makes it tidy"}
knitr::include_graphics("images/tidy-17.png")
```
By default, `separate()` will split values wherever it sees a non-alphanumeric character (i.e. a character that isn't a number or letter). For example, in the code above, `separate()` split the values of `rate` at the forward slash characters. If you wish to use a specific character to separate a column, you can pass the character to the `sep` argument of `separate()`. For example, we could rewrite the code above as:
```{r eval=FALSE}
```{r eval = FALSE}
table3 %>%
separate(rate, into = c("cases", "population"), sep = "/")
```
@ -290,7 +291,7 @@ table3 %>%
### Unite
`unite()` does the opposite of `separate()`: it combines multiple columns into a single column. You'll need it much less frequently that `separate()`, but it's still a useful tool to have in your back pocket.
`unite()` is inverse of `separate()`: it combines multiple columns into a single column. You'll need it much less frequently that `separate()`, but it's still a useful tool to have in your back pocket.
![](images/tidy-18.png)
@ -414,20 +415,20 @@ There's a wealth of epidemiological information in this dataset, but it's challe
who
```
This is a very typical example of data you are likely to encounter in real life. It contains redundant columns, odd variable codes, and many missing values. In short, `who` is messy, and we'll need multiple steps to tidy it. Like dplyr, tidyr is designed so that each function does one thing well. That means in real-life situations you'll typically need to string together multiple verbs.
This is a very typical example of data you are likely to encounter in real life. It contains redundant columns, odd variable codes, and many missing values. In short, `who` is messy, and we'll need multiple steps to tidy it. Like dplyr, tidyr is designed so that each function does one thing well. That means in real-life situations you'll usually need to string together multiple verbs into a pipeline.
The best place to start is almost always to gathering together the columns that are not variables. Let's have a look at what we've got:
* It looks like `country`, `iso2`, and `iso3` are redundant ways of specifying
the same variable, the `country`.
* It looks like `country`, `iso2`, and `iso3` are three variables that
redundantly specify the country.
* `year` is clearly also a variable.
* We don't know what all the other columns are yet, but given the structure
in the variables (e.g. `new_sp_m014`, `new_ep_m014`, `new_ep_f014`) these
are likely to be values, not variable names.
in the variable names (e.g. `new_sp_m014`, `new_ep_m014`, `new_ep_f014`)
these are likely to be values, not variable.
So we need to gather together all the columns from `new_sp_m3544` to `newrel_f65`. We don't yet know what these things mean, so for now we'll use the generic names `key`. We know the cells repesent the count of cases, so we'll use the variable `cases`. There are a lot of missing values in the current representation, so for now we'll use `na.rm` just so we can focus on the values that are present.
So we need to gather together all the columns from `new_sp_m3544` to `newrel_f65`. We don't know what those values represent yet, so we'll give them the generic name `"key"`. We know the cells repesent the count of cases, so we'll use the variable `cases`. There are a lot of missing values in the current representation, so for now we'll use `na.rm` just so we can focus on the values that are present.
```{r}
who1 <- who %>% gather(new_sp_m014:newrel_f65, key = "key", value = "cases",
@ -456,11 +457,11 @@ You might be able to parse this out by yourself with a little thought and some e
* `sp` stands for cases of pulmonary TB that could be diagnosed be
a pulmonary smear (smear positive)
3. The sixth letter describes the sex of TB patients. The dataset groups
3. The sixth letter gives the sex of TB patients. The dataset groups
cases by males (`m`) and females (`f`).
4. The remaining numbers describe the age group of TB patients. The dataset
groups cases into seven age groups:
4. The remaining numbers gives the age group. The dataset groups cases into
seven age groups:
* `014` = 0 -- 14 years old
* `1524` = 15 -- 24 years old
@ -491,21 +492,23 @@ who3 %>% count(new)
who4 <- who3 %>% select(-new, -iso2, -iso3)
```
Next we'll split `sexage` up into `sex` and `age` by splitting after the first character:
Next we'll separate `sexage` into `sex` and `age` by splitting after the first character:
```{r}
who5 <- who4 %>% separate(sexage, c("sex", "age"), sep = 1)
who5
```
The `who` dataset is now tidy as each variable is a column. It is far from clean (for example, it contains several redundant columns and many missing values), but it will now be much easier to work with in R. Typically you wouldn't assign each step to a new variable. Instead you'd join everything together in one big pipeline:
The `who` dataset is now tidy! It is far from clean (for example, it contains several redundant columns and many missing values), but it will now be much easier to work with in R.
I've shown you the code a piece at a time, assinging each interim result to a new variable. This typically isn't how you'd work interactively. Instead, you'd gradually build up a complex pipe:
```{r}
who %>%
gather(code, value, new_sp_m014:newrel_f65, na.rm = TRUE) %>%
mutate(code = stringr::str_replace(code, "newrel", "new_rel")) %>%
separate(code, c("new", "var", "sexage")) %>%
select(-new) %>%
select(-new, -iso2, -iso3) %>%
separate(sexage, c("sex", "age"), sep = 1)
```
@ -513,9 +516,8 @@ who %>%
1. In this case study I set `na.rm = TRUE` just to make it easier to
check that we had the correct values. Is this reasonable? Think about
how missing values are represented in this dataset. What's the difference
between an `NA` and zero? Do you think we should use `fill = 0` in
the final `spread()` step?
how missing values are represented in this dataset. Are there implicit
missing values? What's the difference between an `NA` and zero?
1. What happens if you neglect the `mutate()` step?
@ -528,9 +530,7 @@ who %>%
## Non-tidy data
Before we continue on to other topics, it's worth talking a little bit about non-tidy data. Early in the chapter, I used the perjorative term "messy" to refer to non-tidy data. That's an oversimplification: there are lots of useful and well founded data structures that are not tidy data.
There are two mains reasons to use other data structures:
Before we continue on to other topics, it's worth talking briefly about non-tidy data. Earlier in the chapter, I used the perjorative term "messy" to refer to non-tidy data. That's an oversimplification: there are lots of useful and well founded data structures that are not tidy data. There are two mains reasons to use other data structures:
* Alternative representations may have substantial performance or space
advantages.

View File

@ -355,7 +355,7 @@ transmute(flights,
)
```
### Useful creation functions
### Useful creation functions {#mutate-funs}
There are many functions for creating new variables that you can use with `mutate()`. The key property is that the function must be vectorised: it must take a vector of values as input, return a vector with the same number of values as output. There's no way to list every possible function that you might use, but here's a selection of functions that are frequently useful:
@ -655,7 +655,7 @@ batters %>% arrange(desc(ba))
You can find a good explanation of this problem at <http://varianceexplained.org/r/empirical_bayes_baseball/> and <http://www.evanmiller.org/how-not-to-sort-by-average-rating.html>.
### Useful summary functions
### Useful summary functions {#summarise-funs}
Just using means, counts, and sum can get you a long way, but R provides many other useful summary functions: