Proofing tidy & relational data

This commit is contained in:
hadley 2016-08-12 08:58:32 -05:00
parent 6da4da4a54
commit e1a49849d4
2 changed files with 78 additions and 53 deletions

View File

@ -4,19 +4,19 @@
It's rare that a data analysis involves only a single table of data. Typically you have many tables of data, and you must combine them to answer the questions that you're interested in. Collectively, multiple tables of data are called __relational data__ because it is the relations, not just the individual datasets, that are important.
Relations are always defined between a pair of tables. All other relations are built up from this simple idea: the relations of three or more tables are always a property of the relations between each pair. Sometimes both elements of a pair can be the same table! This is common for example, if you have a table of people, and each person has a reference to their parents.
Relations are always defined between a pair of tables. All other relations are built up from this simple idea: the relations of three or more tables are always a property of the relations between each pair. Sometimes both elements of a pair can be the same table! This is needed if, for example, you have a table of people, and each person has a reference to their parents.
To work with relational data you need verbs that work with pairs of tables. There are three families of verbs designed to work with relational data:
* __Mutating joins__, which add new variables to one data frame from matching
rows in another.
observations in another.
* __Filtering joins__, which filter observations from one data frame based on
whether or not they match an observation in the other table.
* __Set operations__, which treat observations like they were set elements.
* __Set operations__, which treat observations as if they were set elements.
The most common place to find relational data is in a _relational_ database management system (or RDBMS), a term that encompasses almost all modern databases. If you've used a database before, you've almost certainly used SQL. If so, you should find the concepts in this chapter familiar, although their expression in dplyr is a little different. Generally, dplyr is a little easier to use than SQL because dplyr is specialised to data analysis: it makes common data analysis operations easier, at the expense of making it more difficult to do other things.
The most common place to find relational data is in a _relational_ database management system (or RDBMS), a term that encompasses almost all modern databases. If you've used a database before, you've almost certainly used SQL. If so, you should find the concepts in this chapter familiar, although their expression in dplyr is a little different. Generally, dplyr is a little easier to use than SQL because dplyr is specialised to data analysis: it makes common data analysis operations easier, at the expense of making it more difficult to do other things that don't commonly need for data analysis.
### Prerequisites
@ -27,8 +27,6 @@ library(nycflights13)
library(dplyr)
```
We'll also use the `str_c()` function from stringr, but rather than loading the complete package just to access this one function, we'll use it as `stringr::str_c()`.
## nycflights13 {#nycflights13-relational}
We will use the nycflights13 package to learn about relational data. nycflights13 contains four data frames that are related to the `flights` table that you used in [data transformation]:
@ -86,7 +84,7 @@ For nycflights13:
would you need to combine?
1. I forgot to draw the a relationship between `weather` and `airports`.
What is the relationship and what should it look like in the diagram?
What is the relationship and how should it appear in the diagram?
1. `weather` only contains information for the origin (NYC) airports. If
it contained weather records for all airports in the USA, what additional
@ -147,12 +145,15 @@ A primary key and the corresponding foreign key in another table form a __relati
### Exercises
1. Add a surrogate key to `flights`.
1. Identify the keys in the following datasets
1. `Lahman::Batting`,
1. `babynames::babynames`
1. `nasaweather::atmos`
1. `fueleconomy::vehicles`
1. `ggplot2::diamonds`
(You might need to install some packages and read some documentation.)
@ -175,20 +176,22 @@ flights2 <- flights %>%
flights2
```
(When you're in RStudio, you can use `View()` to avoid this problem).
(Remember, when you're in RStudio, you can also use `View()` to avoid this problem).
For example, imagine you want to add the full airline name to the `flights2` data. You can combine the `airlines` and `flights2` data frames with `left_join()`:
Imagine you want to add the full airline name to the `flights2` data. You can combine the `airlines` and `flights2` data frames with `left_join()`:
```{r}
flights2 %>%
select(-origin, -dest) %>%
left_join(airlines, by = "carrier")
```
The result of joining airlines to flights is an additional variable: `name`. This is why I call this type of join a mutating join. In this case, you could have got to the same place using `mutate()` and basic subsetting:
The result of joining airlines to flights is an additional variable: `name`. This is why I call this type of join a mutating join. In this case, you could have got to the same place using `mutate()` and R's base subsetting:
```{r}
flights2 %>%
mutate(carrier = airlines$name[match(carrier, airlines$carrier)])
select(-origin, -dest) %>%
mutate(name = airlines$name[match(carrier, airlines$carrier)])
```
But this is hard to generalise when you need to match multiple variables, and takes close reading to figure out the overall intent.
@ -240,7 +243,7 @@ x %>%
inner_join(y, by = "key")
```
The most important property of an inner join is that unmatched rows are not included in the result. This means that generally inner joins are not appropriate for use in analysis because it's too easy to lose observations.
The most important property of an inner join is that unmatched rows are not included in the result. This means that generally inner joins are usually not appropriate for use in analysis because it's too easy to lose observations.
### Outer joins {#outer-join}
@ -364,9 +367,15 @@ So far, the pairs of tables have always been joined by a single variable, and th
coord_quickmap()
```
(Don't worry if you don't understand what `semi_join()` does --- you'll
learn about it next.)
You might want to use the `size` or `colour` of the points to display
the average delay for each airport.
1. Add the location of the origin _and_ destination (i.e. the `lat` and `lon`)
to `flights`.
1. Is there a relationship between the age of a plane and its delays?
1. What weather conditions make it more likely to see a delay?
@ -477,6 +486,12 @@ flights %>%
tail numbers that don't have a matching record in `planes` have in common?
(Hint: one variable explains ~90% of the problems.)
1. Filter flights to only show flights with planes that have flown at least 100
flights.
1. Combine `fueleconomy::vehicles` and `fueleconomy::common` to find only the
records for the most common models.
1. Find the 48 hours (over the course of the whole year) that have the worst
delays. Cross-reference it with the `weather` data. Can you see any
patterns?
@ -496,11 +511,11 @@ The data you've been working with in this chapter has been cleaned up so that yo
unique in your current data but the relationship might not be true in
general.
For example, the altitude and latitude uniquely identify each airport,
For example, the altitude and longitude uniquely identify each airport,
but they are not good identifiers!
```{r}
airports %>% count(alt, lat) %>% filter(n > 1)
airports %>% count(alt, lon) %>% filter(n > 1)
```
1. Check that none of the variables in the primary key are missing. If
@ -519,9 +534,7 @@ Be aware that simply checking the number of rows before and after the join is no
## Set operations {#set-operations}
The final type of two-table verb is set operations. Generally, I use these the least frequently, but they are occasionally useful when you want to break a single complex filter into simpler pieces that you then combine.
All these operations work with a complete row, comparing the values of every variable. These expect the `x` and `y` inputs to have the same variables, and treat the observations like sets:
The final type of two-table verb are the set operations. Generally, I use these the least frequently, but they are occasionally useful when you want to break a single complex filter into simpler pieces. All these operations work with a complete row, comparing the values of every variable. These expect the `x` and `y` inputs to have the same variables, and treat the observations like sets:
* `intersect(x, y)`: return only observations in both `x` and `y`.
* `union(x, y)`: return unique observations in `x` and `y`.
@ -538,8 +551,11 @@ The four possibilities are:
```{r}
intersect(df1, df2)
# Note that we get 3 rows, not 4
union(df1, df2)
setdiff(df1, df2)
setdiff(df2, df1)
```

View File

@ -14,7 +14,7 @@ This chapter will give you a practical introduction to tidy data and the accompa
### Prerequisites
In this chapter we'll focus on tidyr, a package that provides a bunch of tools to help tidy up your messy datasets. We'll also need to use a little dplyr, as is common when tidying data.
In this chapter we'll focus on tidyr, a package that provides a bunch of tools to help tidy up your messy datasets. We'll also need to use a pinch of dplyr, as is common when tidying data.
```{r setup, message = FALSE}
library(tidyr)
@ -35,19 +35,21 @@ table4a # cases
table4b # population
```
These are all representations of the same underlying data, but they are not equally easy to use. One dataset, the tidy dataset, will be much easier work with inside the tidyverse. There are three interrelated rules which make a dataset tidy:
These are all representations of the same underlying data, but they are not equally easy to use. One dataset, the tidy dataset, will be much easier to work with inside the tidyverse.
1. Each variable has its own column.
1. Each observation has its own row.
1. Each value has its own cell.
There are three interrelated rules which make a dataset tidy:
Figure \@ref(fig:tidy-structure) shows this visually.
1. Each variable must have its own column.
1. Each observation must have its own row.
1. Each value much have its own cell.
Figure \@ref(fig:tidy-structure) shows the rules visually.
```{r tidy-structure, echo = FALSE, out.width = "100%", fig.cap = "Following three rules makes a dataset tidy: variables are in columns, observations are in rows, and values are in cells."}
knitr::include_graphics("images/tidy-1.png")
```
These three rules are interrelated because it's impossible to only satisfy two of the three rules. That interrelationship leads to even simpler set of practical instructions:
These three rules are interrelated because it's impossible to only satisfy two of the three. That interrelationship leads to even simpler set of practical instructions:
1. Put each dataset in a tibble.
1. Put each variable in a column.
@ -66,10 +68,10 @@ Why ensure that your data is tidy? There are two main advantages:
built-in R functions work with vectors of values. That makes transforming
tidy data feel particularly natural.
dplyr, ggplot2, and all other the packages in the tidyverse are designed to work with tidy data. Here are a couple of small examples showing how you might work with `table1`. Think about how you'd achieve the same result with the other representations.
dplyr, ggplot2, and all other the packages in the tidyverse are designed to work with tidy data. Here are a couple of small examples showing how you might work with `table1`.
```{r}
# Compute rate
```{r, out.width = "50%"}
# Compute rate per 10,000
table1 %>%
mutate(rate = cases / population * 10000)
@ -104,21 +106,26 @@ ggplot(table1, aes(year, cases)) +
## Spreading and gathering
The principles of tidy data seem so obvious that you might wonder if you'll ever encounter a dataset that isn't tidy. Unfortunately, while the principles are obvious in hindsight, it took Hadley over 5 years of struggling with many datasets to figure out these very simple principles. Most datasets that you will encounter in real life will not be tidy, either because the creator was not aware of the principles of tidy data, or because the data is stored in order to make data entry, not data analysis, easy.
The principles of tidy data seem so obvious that you might wonder if you'll ever encounter a dataset that isn't tidy. Unfortunately, however, most data that you will encounter will be untidy. There are two main reasons:
The first step to tidying any dataset is to study it and figure out what the variables are. Sometimes this is easy; other times you'll need to consult with the people who originally generated the data. Once you've identified the variables, you can start wrangling the data so each variable forms a column.
There are two common problems that it's best to solve first:
1. Most people aren't familiar with the principles of tidy data, and it's hard
to derive them yourself unless you spend a _lot_ of time working with data.
1. Data is often organised to facilitate some use other than analysis. For
example, data is often organised to make entry as easy as possible.
This means for most real analyses, you'll need to do some tidying. The first step is always to figure out what the variables and observations are. Sometimes this is easy; other times you'll need to consult with the people who originally generated the data.
The second step is to resolve one of two common problems:
1. One variable might be spread across multiple columns.
1. One observation might be spread across mutliple rows.
1. One observation might be scattered across mutliple rows.
To fix these problems, you'll need the two most important functions in tidyr: `gather()` and `spread()`.
Typically a dataset will only suffer from one of these problems; it'll only suffer from both if you're really unlucky! To fix these problems, you'll need the two most important functions in tidyr: `gather()` and `spread()`.
### Gathering
A common problem is a dataset where some of the column names are not names of a variable, but _values_ of a variable. Take `table4a`: the column names `1991` and `2000` represent values of the `year` variable. Each row represents two observations, not one.
A common problem is a dataset where some of the column names are not names of variables, but _values_ of a variable. Take `table4a`: the column names `1991` and `2000` represent values of the `year` variable, and each row represents two observations, not one.
```{r}
table4a
@ -144,13 +151,11 @@ table4a %>%
The columns to gather are specified with `dplyr::select()` style notation. Here there are only two columns, so we list them individually. Note that "1999" and "2000" are non-syntactic names so we have to surround them in backticks. To refresh your memory of the other ways to select columns, see [select](#select).
In the final result, the gathered columns are dropped, and we get new `key` and `value` columns. Otherwise, the relationships between the original variables are preserved. Visually, this is shown in Figure \@ref(fig:tidy-gather).
```{r tidy-gather, echo = FALSE, out.width = "100%", fig.cap = "Gathering `table4` into a tidy form."}
knitr::include_graphics("images/tidy-9.png")
```
We can use `gather()` to tidy `table4b` in a similar fashion. The only difference is the variable stored in the cell values:
In the final result, the gathered columns are dropped, and we get new `key` and `value` columns. Otherwise, the relationships between the original variables are preserved. Visually, this is shown in Figure \@ref(fig:tidy-gather). We can use `gather()` to tidy `table4b` in a similar fashion. The only difference is the variable stored in the cell values:
```{r}
table4b %>%
@ -223,7 +228,8 @@ As you might have guessed from the common `key` and `value` arguments, `spread()
gather(1999, 2000, key = "year", value = "cases")
```
1. Why does spreading this tibble fail?
1. Why does spreading this tibble fail? How could you add a new column to fix
the problem?
```{r}
people <- frame_data(
@ -250,24 +256,24 @@ As you might have guessed from the common `key` and `value` arguments, `spread()
## Separating and uniting
So far you've learned how to tidy `table2` and `table4`, but not `table3`. `table3` has a different problem: we have one column (`rate`) that contains two variables (`cases` and `population`). To fix this problem, we'll need the `separate()` function. You'll also learn about inverse of `separate()`: `unite()`, which you use if a single variable is spread across multiple columns.
So far you've learned how to tidy `table2` and `table4`, but not `table3`. `table3` has a different problem: we have one column (`rate`) that contains two variables (`cases` and `population`). To fix this problem, we'll need the `separate()` function. You'll also learn about complement of `separate()`: `unite()`, which you use if a single variable is spread across multiple columns.
### Separate
`separate()` pulls apart one column into multiple variables, by separating wherever a separator character appears. Take `table3`:
`separate()` pulls apart one column into multiple columns, by splitting wherever a separator character appears. Take `table3`:
```{r}
table3
```
The `rate` column contains both `cases` and `population` variables, and we need to split it into two variables. `separate()` takes the name of the column to separate, and the names of the columns to separate into, as shown in \@ref(fig:tidy-separate) and the code below.
The `rate` column contains both `cases` and `population` variables, and we need to split it into two variables. `separate()` takes the name of the column to separate, and the names of the columns to separate into, as shown in Figure \@ref(fig:tidy-separate) and the code below.
```{r}
table3 %>%
separate(rate, into = c("cases", "population"))
```
```{r tidy-separate, echo = FALSE, out.width = "100%", fig.cap = "Separating `table3` makes it tidy"}
```{r tidy-separate, echo = FALSE, out.width = "75%", fig.cap = "Separating `table3` makes it tidy"}
knitr::include_graphics("images/tidy-17.png")
```
@ -287,7 +293,9 @@ table3 %>%
separate(rate, into = c("cases", "population"), convert = TRUE)
```
You can also pass a vector of integers to `sep`. `separate()` will interpret the integers as positions to split at. Positive values start at 1 on the far-left of the strings; negative value start at -1 on the far-right of the strings. When using integers to separate strings, the length of `sep` should be one less than the number of names in `into`. You can use this arrangement to separate the last two digits of each year.
You can also pass a vector of integers to `sep`. `separate()` will interpret the integers as positions to split at. Positive values start at 1 on the far-left of the strings; negative value start at -1 on the far-right of the strings. When using integers to separate strings, the length of `sep` should be one less than the number of names in `into`.
You can use this arrangement to separate the last two digits of each year. This make this data lesss tidy, but is useful in other cases, as you'll see in a little bit.
```{r}
table3 %>%
@ -298,12 +306,13 @@ table3 %>%
`unite()` is inverse of `separate()`: it combines multiple columns into a single column. You'll need it much less frequently than `separate()`, but it's still a useful tool to have in your back pocket.
![](images/tidy-18.png)
```{r tidy-unite, echo = FALSE, out.width = "75%", fig.cap = "Uniting `table5` makes it tidy"}
knitr::include_graphics("images/tidy-18.png")
```
We can use `unite()` to rejoin the *century* and *year* columns that we created in the last example. That data is saved as `tidyr::table5`. `unite()` takes a data frame, the name of the new variable to create, and a set of columns to combine, again specified in `dplyr::select()` style:
```{r}
table5
table5 %>%
unite(new, century, year)
```
@ -325,7 +334,7 @@ table5 %>%
separate(x, c("one", "two", "three"))
tibble::tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
separate(x, c("one", "two", "three"))
separate(x, c("one", "two", "three"))
```
1. Both `unite()` and `separate()` have a `remove` argument. What does it
@ -412,7 +421,7 @@ treatment %>%
## Case Study
To finish off the chapter, let's pull together everything you've learned to tackle a realistic data tidying problem. The `tidyr::who` dataset contains reporter tuberculosis (TB) cases broken down by year, country, age, gender, and diagnosis method. The data comes from the *2014 World Health Organization Global Tuberculosis Report*, available for download at <http://www.who.int/tb/country/data/download/en/>.
To finish off the chapter, let's pull together everything you've learned to tackle a realistic data tidying problem. The `tidyr::who` dataset contains tuberculosis (TB) cases broken down by year, country, age, gender, and diagnosis method. The data comes from the *2014 World Health Organization Global Tuberculosis Report*, available at <http://www.who.int/tb/country/data/download/en/>.
There's a wealth of epidemiological information in this dataset, but it's challenging to work with the data in the form that it's provided:
@ -420,7 +429,7 @@ There's a wealth of epidemiological information in this dataset, but it's challe
who
```
This is a very typical example of data you are likely to encounter in real life. It contains redundant columns, odd variable codes, and many missing values. In short, `who` is messy, and we'll need multiple steps to tidy it. Like dplyr, tidyr is designed so that each function does one thing well. That means in real-life situations you'll usually need to string together multiple verbs into a pipeline.
This is a very typical real-life example dataset. It contains redundant columns, odd variable codes, and many missing values. In short, `who` is messy, and we'll need multiple steps to tidy it. Like dplyr, tidyr is designed so that each function does one thing well. That means in real-life situations you'll usually need to string together multiple verbs into a pipeline.
The best place to start is almost always to gathering together the columns that are not variables. Let's have a look at what we've got:
@ -441,7 +450,7 @@ who1 <- who %>%
who1
```
We can get some hint of the structure of the values in the new `key` column:
We can get some hint of the structure of the values in the new `key` column by counting them:
```{r}
who1 %>%
@ -477,7 +486,7 @@ You might be able to parse this out by yourself with a little thought and some e
* `5564` = 55 -- 64 years old
* `65` = 65 or older
We need to make a minor fix to the format of the column names: unfortunately the names are slightly inconsistent because instead of `new_rel_` we have `newrel` (it's hard to spot this here but if you don't fix it we'll get errors in subsequent steps). You'll learn about `str_replace()` in [strings], but the basic idea is pretty simple: replace the string "newrel" with "new_rel". This makes all variable names consistent.
We need to make a minor fix to the format of the column names: unfortunately the names are slightly inconsistent because instead of `new_rel_` we have `newrel` (it's hard to spot this here but if you don't fix it we'll get errors in subsequent steps). You'll learn about `str_replace()` in [strings], but the basic idea is pretty simple: replace the characters "newrel" with "new_rel". This makes all variable names consistent.
```{r}
who2 <- who1 %>%
@ -493,7 +502,7 @@ who3 <- who2 %>%
who3
```
Then we might as well drop the `new` colum because it's consistent in this dataset. While we're dropping columns, let's also drop `iso2` and `iso3` since they're redundant.
Then we might as well drop the `new` colum because it's constant in this dataset. While we're dropping columns, let's also drop `iso2` and `iso3` since they're redundant.
```{r}
who3 %>%
@ -510,11 +519,11 @@ who5 <- who4 %>%
who5
```
The `who` dataset is now tidy! It is far from clean (for example, it contains several redundant columns and many missing values), but it will now be much easier to work with in R.
The `who` dataset is now tidy!
I've shown you the code a piece at a time, assinging each interim result to a new variable. This typically isn't how you'd work interactively. Instead, you'd gradually build up a complex pipe:
```{r}
```{r, results = "hide"}
who %>%
gather(code, value, new_sp_m014:newrel_f65, na.rm = TRUE) %>%
mutate(code = stringr::str_replace(code, "newrel", "new_rel")) %>%