Tidy chapter updates

This commit is contained in:
hadley 2016-07-25 11:28:05 -05:00
parent f1cc2088f9
commit a2ff3ec52f
4 changed files with 397 additions and 265 deletions

View File

@ -35,6 +35,7 @@ Imports:
tidyr
Remotes:
hadley/modelr,
hadley/tidyr,
hadley/readr,
hadley/stringr,
rstudio/bookdown,

View File

@ -32,6 +32,17 @@ tibble(x = 1:5, y = 1, z = x ^ 2 + y)
`tibble()` automatically recycles inputs of length 1, and you can refer to variables that you just created. Compared to `data.frame()`, `tibble()` does much less: it never changes the type of the inputs (e.g. it never converts strings to factors!), it never changes the names of variables, and it never creates row names.
It's possible for a tibble to have column names that are not valid R variables, or __non-syntactic__ names. For example, they might not start with a letter, or they might contain unusual values like a space. To refer to these variables, you need to surround them with backticks, `` ` ``:
```{r}
tb <- tibble(
`:)` = "smile",
` ` = "space",
`2000` = "number"
)
tb
```
Another way to create a tibble is with `frame_data()`, which is customised for data entry in R code. Column headings are defined by formulas (`~`), and entries are separated by commas:
```{r}
@ -48,6 +59,21 @@ frame_data(
1. What does `enframe()` do? When might you use it?
1. Practice referring to non-syntactic names by:
1. Plotting a scatterplot of `1` vs `2`.
1. Creating a new column called `3` which is `2` divided by `1`.
1. Renaming the columns to `one`, `two` and `three`.
```{r}
annoying <- tibble(
`1` = 1:10,
`2` = `1` * 2 + rnorm(length(`1`))
)
```
## Tibbles vs. data frames
There are two main differences in the usage of a data frame vs a tibble: printing, and subsetting.
@ -84,6 +110,12 @@ nycflights13::flights %>%
You can see a complete list of options by looking at the package help: `package?tibble`.
Remember, you can also get a nicer view of the data set using RStudio's built-in data viewer. This is often useful at the end of a long chain of manipulations.
```{r, eval = FALSE}
nycflights13::flights %>% View()
```
### Subsetting
Tibbles are stricter about subsetting. If you try to access a variable that does not exist, you'll get a warning. Unlike data frames, tibbles do not use partial matching on column names:

623
tidy.Rmd
View File

@ -2,200 +2,116 @@
## Introduction
> "Happy families are all alike; every unhappy family is unhappy in its
> own way." -- Leo Tolstoy
> "Tidy datasets are all alike, but every messy dataset is messy in its
> own way." -- Hadley Wickham
Data science, at its heart, is a computer programming exercise. Data scientists use computers to store, transform, visualize, and model their data. Each computer program will expect your data to be organized in a predetermined way, which may vary from program to program. To be an effective data scientist, you will need to be able to reorganize your data to match the format required by your program.
In this chapter, you will learn a consistent way to organise your data in R, a organisation called __tidy data__. Getting your data into this format requires some upfront work, but that work pays off in the long-term. Once you have tidy data and the tidy tools provided by packages in the tidyverse, you will spend much less time munging data from one representation to another, allowing you to spend more time on the analytic questions at hand.
In this chapter, you will learn the best way to organize your data for R, a task that we call data tidying. Tidying your data will save you hours of time and make your data much easier to visualize, transform, and model with R.
Note that this chapter explains how to change the format, or layout, of tabular data. You will learn how to use different file formats with R in the next chapter, Import Data.
This chapter will give you a practical introduction to tidy data and the accompanying tools in the __tidyr__ package. If you'd like to learn more about the underlying theory, you might enjoy the *Tidy Data* paper published in the Journal of Statistical Software, <http://www.jstatsoft.org/v59/i10/paper>.
### Prerequisites
```{r}
In this chapter we'll focus on tidyr, a package that provides a bundle of tools to help tidy messy datasets. We'll also need to use a little dplyr, as is common when tidying data.
```{r setup}
library(tidyr)
library(dplyr)
```
## Tidy data
You can organize tabular data in many ways. For example, the datasets below show the same data organized in four different ways. Each dataset shows the same values of four variables *country*, *year*, *population*, and *cases*, but each dataset organizes the values into a different layout . You can access the datasets in tidyr.
You can represent the same underlying data in multiple ways. For example, the datasets below show the same data organized in four different ways. Each dataset shows the same values of four variables *country*, *year*, *population*, and *cases*, but each dataset organizes the values in different way.
```{r}
# dataset one
table1
# dataset two
table2
# dataset three
table3
# Spread across two tibbles
table4a # cases
table4b # population
```
The last dataset is a collection of two tables.
These are all representations of the same underlying data, but they are not equally easy to use. One dataset, the tidy dataset, will be much easier work with inside the tidyverse. There are three interrelated rules which make a dataset tidy:
```{r}
# dataset four
table4 # cases
table5 # population
```
You might think that these datasets are interchangeable since they display the same information, but one dataset will be much easier to work with in R than the others.
Why should that be?
R follows a set of conventions that makes one layout of tabular data much easier to work with than others. Your data will be easier to work with in R if it follows three rules
1. Each variable in the dataset is placed in its own column
2. Each observation is placed in its own row
3. Each value is placed in its own cell\*
Data that satisfies these rules is known as *tidy data*. Notice that `table1` is tidy data.
1. Each variable has its own column.
1. Each observation has its own row.
1. Each value has its own cell.
```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/tidy-1.png")
```
*In `table1`, each variable is placed in its own column, each observation in its own row, and each value in its own cell.*
These three rules are interrelated because it's impossible to only satisfy two of the three rules. That interrelationship leads to even simpler set of practical instructions:
Tidy data builds on a premise of data science that datasets contain *both values and relationships*. Tidy data displays the relationships in a dataset as consistently as it displays the values in a dataset.
1. Put each dataset in a tibble.
1. Put each variable in a column.
At this point, you might think that tidy data is so obvious that it is trivial. Surely, most datasets come in a tidy format, right? Wrong. In practice, raw data is rarely tidy and is much harder to work with as a result. *Section 2.4* provides a realistic example of data collected in the wild.
The principles of tidy data seem so obvious that you might wonder if you'll ever encounter a dataset that isn't tidy. Unfortunately, while the principles are obvious in hindsight, it took Hadley over 5 years of struggling with many datasets to figure out these very simple principles. Most datasets that you will encounter in real life will not be tidy, either because the creator was not aware of the principles of tidy data, or because the data is stored to optimise
Tidy data works well with R because it takes advantage of R's traits as a vectorized programming language. Data structures in R are organized around vectors, and R's functions are optimized to work with vectors. Tidy data takes advantage of both of these traits.
Tidy data arranges values so that the relationships between variables in a dataset will parallel the relationship between vectors in R's storage objects. R stores tabular data as a data frame, a list of atomic vectors arranged to look like a table. Each column in the table is an atomic vector in the list. In tidy data, each variable in the dataset is assigned to its own column, i.e., its own vector in the data frame.
```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/tidy-2.png")
```
*A data frame is a list of vectors that R displays as a table. When your data is tidy, the values of each variable fall in their own column vector.*
As a result, you can extract all the values of a variable in a tidy dataset by extracting the column vector that contains the variable. You can do this easily with R's list syntax, i.e.
Once you have your data in tidy form, it's easy to manipulate it with dplyr or visualise it with ggplot2:
```{r}
table1$cases
# Compute rate
table1 %>%
mutate(rate = cases / population * 10000)
# Compute cases per year
table1 %>%
count(year, wt = cases)
# Visualise changes over time
library(ggplot2)
ggplot(table1, aes(year, cases)) +
geom_line(aes(group = country), colour = "grey50") +
geom_point(aes(colour = country))
```
R will return the values as an atomic vector, one of the most versatile data structures in R. Many functions in R are written to take atomic vectors as input, as are R's mathematical operators. This adds up to an easy user experience; you can extract and manipulate the values of a variable in tidy data with concise, simple code, e.g.,
There are two advantages to tidy data:
```{r}
mean(table1$cases)
table1$cases / table1$population * 10000
```
1. There's a general advantage to just picking one consistent way of storing
data. If you have a consistent data structure, you can design tools that
work with that data without having to translate it into different
structures.
1. There's a specific advantage to placing variables in columns because
it allows R's vectorised nature to shine. As you learned in [useful
creation functions] and [useful summary functions], most built-in R
functions work with a vector of values. That makes transforming tidy
data feel particularly natural.
Tidy data also takes advantage of R's vectorized operations. In R, you often supply one or more vectors of values as input to a function or mathematical operator. Often, the function or operator will use the vectors to create a new vector of values as output, e.g.
As you'll learn later, tidy data is also very well suited for modelling, and in fact, the way that R's modelling functions work was an inspiration for the tidy data format.
```{r eval = FALSE}
table1$population # a vector
table1$cases # a vector
### Exercises
# people per case
table1$population / table1$cases # a vector of output
```
1. Using prose, describe how the variables and observations are organised in
each of the sample tables.
```{r echo = FALSE}
table1$population / table1$cases
```
1. Compute the `rate` for `table2`, and `table4a` + `table4b`.
You will need to perform four operations:
To create the output, R applies the function in element-wise fashion: R first applies the function (or operation) to the first elements of each vector involved. Then R applies the function (or operation) to the second elements of each vector involved, and so on until R reaches the end of the vectors. If one vector is shorter than the others, R will recycle its values as needed (according to a set of recycling rules).
1. Extract the number of TB cases per country per year.
2. Extract the matching population per country per year.
3. Divide cases by population, and multiply by 10000.
5. Store back in the appropriate place.
Which is easiest? Which is hardest?
```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/tidy-3.png")
```
1. Recreate the plot showing change in cases over time using `table2`
instead of `table1`. What do you need to do first?
If your data is tidy, element-wise execution will ensure that observations are preserved across functions and operations. Each value will only be paired with other values that appear in the same row of the data frame. In a tidy data frame, these values will be values of the same observation.
## Spreading and gathering
Do these small advantages matter in the long run? Yes. Consider what it would be like to do a simple calculation with each of the datasets from the start of this section.
Now that you understand the basic principles of tidy data, it's time to learn the tools that allow you to transform untidy datasets into tidy datasets.
Assume that in these datasets, `cases` refers to the number of people diagnosed with TB per country per year. To calculate the *rate* of TB cases per country per year (i.e, the number of people per 10,000 diagnosed with TB), you will need to do four operations with the data. You will need to:
The first step to tidying any dataset is to study it and figure out what the variables are. Sometimes this is easy; other times you'll need to consult with the people who originally generated the data.
1. Extract the number of TB cases per country per year
2. Extract the population per country per year (in the same order as
above)
3. Divide cases by population
4. Multiply by 10000
One of the most messy-data common problems is that you'll find some variables are not in the columns. One variable might be spread across multiple columns, or you might find that a set of variables is spread over the rows. To fix these problems, you'll need the two most important functions in tidyr: `gather()` and `spread()`. But before we can describe how they work, you need to understand the idea of the key-value pair.
If you use basic R syntax, your calculations will look like the code below. If you'd like to brush up on basic R syntax, see Appendix A: Getting Started.
#### Dataset one
```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/tidy-4.png")
```
Since `table1` is organized in a tidy fashion, you can calculate the rate like this,
```{r eval = FALSE}
# Dataset one
table1$cases / table1$population * 10000
```
#### Dataset two
```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/tidy-5.png")
```
Dataset two intermingles the values of *population* and *cases* in the same column, *value*. As a result, you will need to untangle the values whenever you want to work with each variable separately.
You'll need to perform an extra step to calculate the rate.
```{r eval = FALSE}
# Dataset two
case_rows <- c(1, 3, 5, 7, 9, 11)
pop_rows <- c(2, 4, 6, 8, 10, 12)
table2$value[case_rows] / table2$value[pop_rows] * 10000
```
#### Dataset three
```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/tidy-6.png")
```
Dataset three combines the values of cases and population into the same cells. It may seem that this would help you calculate the rate, but that is not so. You will need to separate the population values from the cases values if you wish to do math with them. This can be done, but not with "basic" R syntax.
```{r eval = FALSE}
# Dataset three
# No basic solution
```
#### Dataset four
```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/tidy-7.png")
```
Dataset four stores the values of each variable in a different format: as a column, a set of column names, or a field of cells. As a result, you will need to work with each variable differently. This makes code written for dataset four hard to generalize. The code that extracts the values of *year*, `names(table4)[-1]`, cannot be generalized to extract the values of population, `c(table5$1999, table5$2000, table5$2001)`. Compare this to dataset one. With `table1`, you can use the same code to extract the values of year, `table1$year`, that you use to extract the values of population. To do so, you only need to change the name of the variable that you will access: `table1$population`.
The organization of dataset four is inefficient in a second way as well. Dataset four separates the values of some variables across two separate tables. This is inconvenient because you will need to extract information from two different places whenever you want to work with the data.
After you collect your input, you can calculate the rate.
```{r eval = FALSE}
# Dataset four
cases <- c(table4$`1999`, table4$`2000`)
population <- c(table5$`1999`, table5$`2000`)
cases / population * 10000
```
Dataset one, the tidy dataset, is much easier to work with than with datasets two, three, or four. To work with datasets two, three, and four, you need to take extra steps, which makes your code harder to write, harder to understand, and harder to debug.
Keep in mind that this is a trivial calculation with a trivial dataset. The energy you must expend to manage a poor layout will increase with the size of your data. Extra steps will accumulate over the course of an analysis and allow errors to creep into your work. You can avoid these difficulties by converting your data into a tidy format at the start of your analysis.
The next sections will show you how to transform untidy datasets into tidy datasets.
Tidy data was popularized by Hadley Wickham, and it serves as the basis for many R packages and functions. You can learn more about tidy data by reading *Tidy Data* a paper written by Hadley Wickham and published in the Journal of Statistical Software. *Tidy Data* is available online at [www.jstatsoft.org/v59/i10/paper](http://www.jstatsoft.org/v59/i10/paper).
## `spread()` and `gather()`
The `tidyr` package by Hadley Wickham is designed to help you tidy your data. It contains four functions that alter the layout of tabular datasets, while preserving the values and relationships contained in the datasets.
The two most important functions in `tidyr` are `gather()` and `spread()`. Each relies on the idea of a key-value pair.
### key-value pairs
### Key-value
A key-value pair is a simple way to record information. A pair contains two parts: a *key* that explains what the information describes, and a *value* that contains the actual information. So, for example, this would be a key-value pair:
@ -236,215 +152,398 @@ In `table2`, the `key` column contains only keys (and not just because the colum
You can use the `spread()` function to tidy this layout.
### `spread()`
### Spreading
`spread()` turns a pair of key:value columns into a set of tidy columns. To use `spread()`, pass it the name of a data frame, then the name of the key column in the data frame, and then the name of the value column. Pass the column names as they are; do not use quotes.
To tidy `table2`, you would pass `spread()` the `key` column and then the `value` column.
`spread()` turns a pair of key:value columns into a set of tidy columns. To use `spread()`, pass a data frame, and the pair of key-value columns. This is particularly easy `table2` because the columns are already called key and value!
```{r}
spread(table2, key, value)
spread(table2, key = key, value = value)
```
`spread()` returns a copy of your dataset that has had the key and value columns removed. In their place, `spread()` adds a new column for each unique key in the key column. These unique keys will form the column names of the new columns. `spread()` distributes the cells of the former value column across the cells of the new columns and truncates any non-key, non-value columns in a way that prevents duplication.
You can see that `spread()` maintains each of the relationships expressed in the original dataset. The output contains the four original variables, *country*, *year*, *population*, and *cases*, and the values of these variables are grouped according to the original observations.
```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/tidy-8.png")
```
*`spread()` distributes a pair of key:value columns into a field of cells. The unique keys in the key column become the column names of the field of cells.*
You can see that `spread()` maintains each of the relationships expressed in the original dataset. The output contains the four original variables, *country*, *year*, *population*, and *cases*, and the values of these variables are grouped according to the original observations. As a bonus, now the layout of these relationships is tidy.
`spread()` takes three optional arguments in addition to `data`, `key`, and `value`:
- **`fill`** - If the tidy structure creates combinations of variables that do not exist in the original dataset, `spread()` will place an `NA` in the resulting cells. `NA` is R's missing value symbol. You can change this behaviour by passing `fill` an alternative value to use.
- **`convert`** - If a value column contains multiple types of data, its elements will be saved as a single type, usually character strings. As a result, the new columns created by `spread()` will also contain character strings. If you set `convert = TRUE`, `spread()` will run `type.convert()` on each new column, which will convert strings to doubles (numerics), integers, logicals, complexes, or factors if appropriate.
- **`drop`** - The `drop` argument controls how `spread()` handles factors in the key column. If you set `drop = FALSE`, spread will keep factor levels that do not appear in the key column, filling in the missing combinations with the value of `fill`.
### `gather()`
`gather()` does the reverse of `spread()`. `gather()` collects a set of column names and places them into a single "key" column. It also collects the field of cells associated with those columns and places them into a single value column. You can use `gather()` to tidy `table4`.
In general, you'll use `spread()` when you have a column that contains variable names, the `key` column, and a column that contains the values of that variable, the `value` column. Here's another simple example:
```{r}
table4 # cases
weather <- frame_data(
~ day, ~measurement, ~record,
"Jan 1", "temp", 31,
"Jan 1", "precip", 0,
"Jan 2", "temp", 35,
"Jan 2", "precip", 5
)
weather %>%
spread(key = measurement, value = record)
```
To use `gather()`, pass it the name of a data frame to reshape. Then pass `gather()` a character string to use for the name of the "key" column that it will make, as well as a character string to use as the name of the value column that it will make. Finally, specify which columns `gather()` should collapse into the key-value pair (here with integer notation).
The result of `spread()` without the `key` and `value` columns that you specified. Instead, it will have one new variable for each unique value in the `key` column.
### Gathering
`gather()` is the opposite of `spread()`. `gather()` collects a set of column names and places them into a single "key" column. It also collects the values associated with those columns and places them into a single value column. Let's use `gather()` to tidy `table4`.
```{r}
gather(table4, "year", "cases", 2:3)
table4a
```
`gather()` returns a copy of the data frame with the specified columns removed. To this data frame, `gather()` has added two new columns: a "key" column that contains the former column names of the removed columns, and a value column that contains the former values of the removed columns. `gather()` repeats each of the former column names (as well as each of the original columns) to maintain each combination of values that appeared in the original dataset. `gather()` uses the first string that you supplied as the name of the new "key" column, and it uses the second string as the name of the new value column. In our example, these were the strings "year" and "cases."
`gather()` takes a data frame, the names of the new key and value variables to create, and set a columns to gather:
We've placed "key" in quotation marks because you will usually use `gather()` to create tidy data. In this case, the "key" column will contain values, not keys. The values will only be keys in the sense that they were formerly in the column names, a place where keys belong.
```{r}
table4a %>% gather(key = "year", value = "cases", `1999`:`2000`)
```
Here, the column names (`key`) represent the years, and the cell values (`value`) represents the number of cases. We specify the columns to gather with `dplyr::select()` style notation: use all columns from "1999" to "2000". (Note that these are non-syntactic names so we have to surround in backticks.) To refresh your memory of the other ways you can select columns, see [select](#select).
`gather()` returns a copy of the data frame with the specified columns removed, and two new columns: a "key" column that contains the former column names of the removed columns, and a value column that contains the former values of the removed columns. `gather()` repeats each of the former column names (as well as each of the original columns) to maintain each combination of values that appeared in the original dataset.
```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/tidy-9.png")
```
Just like `spread()`, gather maintains each of the relationships in the original dataset. This time `table4` only contained three variables, *country*, *year* and *cases*. Each of these appears in the output of `gather()` in a tidy fashion.
Just like `spread()`, gather maintains each of the relationships in the original dataset. This time `table4` only contained three variables, *country*, *year* and *cases*. Each of these appears in the output of `gather()` in a tidy fashion. `gather()` also maintains each of the observations in the original dataset, organizing them in a tidy fashion.
`gather()` also maintains each of the observations in the original dataset, organizing them in a tidy fashion.
We can use `gather()` to tidy `table5` in a similar fashion.
We can use `gather()` to tidy `table4b` in a similar fashion. The only difference is the variable stored in the cell values:
```{r}
table5 # population
gather(table5, "year", "population", 2:3)
table4b %>% gather(key = "year", value = "population", `1999`:`2000`)
```
In this code, I identified the columns to collapse with a series of integers. `2:3` describes the second and third columns of the data frame. You can identify the same columns with each of the commands below.
```{r eval = FALSE}
gather(table5, "year", "population", c(2, 3))
gather(table5, "year", "population", -1)
```
You can also identify columns by name with the notation introduced by the `select` function in `dplyr`, see Section 3.1.
You can easily combine the new versions of `table4` and `table5` into a single data frame because the new versions are both tidy. To combine the datasets, use the `dplyr::left_join()` function which you'll learn about in [relational data].
It's easy to combine the `table4a` and `table4b` into a single single data frame because the new versions are both tidy. We'll use `dplyr::left_join()`, which you'll learn about in [relational data].
```{r}
tidy4 <- gather(table4, "year", "cases", 2:3)
tidy5 <- gather(table5, "year", "population", 2:3)
dplyr::left_join(tidy4, tidy5)
tidy4a <- table4a %>% gather("year", "cases", `1999`:`2000`)
tidy4b <- table4b %>% gather("year", "population", `1999`:`2000`)
left_join(tidy4a, tidy4b)
```
## `separate()` and `unite()`
### Exercises
You may have noticed that we skipped `table3` in the last section. `table3` is untidy too, but it cannot be tidied with `gather()` or `spread()`. To tidy `table3`, you will need two new functions, `separate()` and `unite()`.
1. Why are `gather()` and `spread()` not perfectly symmetrical?
Carefully consider the following example:
```{r, eval = FALSE}
stocks <- data_frame(
year = c(2015, 2015, 2016, 2016),
half = c( 1, 2, 1, 2),
return = c(1.88, 0.59, 0.92, 0.17)
)
stocks %>%
spread(year, return) %>%
gather("year", "return", `2015`:`2016`)
```
`separate()` and `unite()` help you split and combine cells to place a single, complete value in each cell.
1. Both `spread()` and `gather()` have a `convert` argument. What does it
do?
### `separate()`
1. Why does spreading this tibble fail?
`separate()` turns a single character column into multiple columns by splitting the values of the column wherever a separator character appears.
```{r}
people <- frame_data(
~name, ~key, ~value,
"Phillip Woods", "age", 45,
"Phillip Woods", "height", 186,
"Phillip Woods", "age", 50,
"Jessica Cordero", "age", 37,
"Jessica Cordero", "height", 156
)
```
1. Tidy the simple tibble below. Do you need to spread or gather it?
What are the variables?
```{r}
preg <- frame_data(
~pregnant, ~male, ~female,
"yes", NA, 10,
"no", 20, 12
)
```
## Separating and uniting
You may have noticed that we skipped `table3` in the last section. `table3` has a different problem: we have one column (`rate`) that contains two variables (`cases` and `population`). To fix this problem, we'll need the `separate()` function. In this section, we'll discuss the inverse of `separate()`: `unite()`, which you use if a single variable is spread across multiple columns.
### Separate
`separate()` pulls apart one column into multiple variables, by separating wherever a separator character appears.
![](images/tidy-17.png)
So, for example, we can use `separate()` to tidy `table3`, which combines values of *cases* and *population* in the same column.
We need to use `separate()` to tidy `table3`, which combines values of *cases* and *population* in the same column. `separate()` take a data frame, the name of the column to separate, and the names of the columns to seperate into:
```{r}
table3
table3 %>%
separate(rate, into = c("cases", "population"))
```
To use `separate()` pass separate the name of a data frame to reshape and the name of a column to separate. Also give `separate()` an `into` argument, which should be a vector of character strings to use as new column names. `separate()` will return a copy of the data frame with the untidy column removed. The previous values of the column will be split across several columns, one for each name in `into`.
```{r}
separate(table3, rate, into = c("cases", "population"))
```
By default, `separate()` will split values wherever a non-alphanumeric character appears. Non-alphanumeric characters are characters that are neither a number nor a letter. For example, in the code above, `separate()` split the values of `rate` at the forward slash characters.
If you wish to use a specific character to separate a column, you can pass the character to the `sep` argument of `separate()`. For example, we could rewrite the code above as
By default, `separate()` will split values wherever it sees a non-alphanumeric character (i.e. a character that isn't a number or letter). For example, in the code above, `separate()` split the values of `rate` at the forward slash characters. If you wish to use a specific character to separate a column, you can pass the character to the `sep` argument of `separate()`. For example, we could rewrite the code above as:
```{r eval=FALSE}
separate(table3, rate, into = c("cases", "population"), sep = "/")
table3 %>%
separate(rate, into = c("cases", "population"), sep = "/")
```
Look carefully at the column types: you'll notice that `case` and `population` are character columns. This is the default behaviour in `separate()`: it leaves the type of the column as is. Here, however, it's not very useful those really are numbers. We can ask `separate()` to try and convert to better types using `convert = TRUE`:
```{r}
table3 %>%
separate(rate, into = c("cases", "population"), convert = TRUE)
```
You can also pass an integer or vector of integers to `sep`. `separate()` will interpret the integers as positions to split at. Positive values start at 1 on the far-left of the strings; negative value start at -1 on the far-right of the strings. When using integers to separate strings, the length of `sep` should be one less than the number of names in `into`. You can use this arrangement to separate the last two digits of each year.
```{r}
separate(table3, year, into = c("century", "year"), sep = 2)
table3 %>%
separate(year, into = c("century", "year"), sep = 2)
```
You can further customize `separate()` with the `remove`, `convert`, and `extra` arguments:
### Unite
- **`remove`** - Set `remove = FALSE` to retain the column of values that were separated in the final data frame.
- **`convert`** - By default, `separate()` will return new columns as character columns. Set `convert = TRUE` to convert new columns to double (numeric), integer, logical, complex, and factor columns with `type.convert()`.
- **`extra`** - `extra` controls what happens when the number of new values in a cell does not match the number of new columns in `into`. If `extra = error` (the default), `separate()` will return an error. If `extra = "drop"`, `separate()` will drop new values and supply `NA`s as necessary to fill the new columns. If `extra = "merge"`, `separate()` will split at most `length(into)` times.
### `unite()`
`unite()` does the opposite of `separate()`: it combines multiple columns into a single column.
`unite()` does the opposite of `separate()`: it combines multiple columns into a single column. You'll need it much less frequently that `separate()`, but it's still a useful tool to have in your back pocket.
![](images/tidy-18.png)
We can use `unite()` to rejoin the *century* and *year* columns that we created in the last example. That data is saved as `tidyr::table6`.
We can use `unite()` to rejoin the *century* and *year* columns that we created in the last example. That data is saved as `tidyr::table5`. `unite()` takes a data frame, the name of the new variable to create, and a set of columns to combine, again specified in `dplyr::select()` style:
```{r}
table6
table5
table5 %>%
unite(new, century, year)
```
Give `unite()` the name of the data frame to reshape, the name of the new column to create (as a character string), and the names of the columns to unite. `unite()` will place an underscore (\_) between values from separate columns. If you would like to use a different separator, or no separator at all, pass the separator as a character string to `sep`.
In this case we also need to use the `sep` arguent. The default is will place an underscore (`_`) between values from separate columns. Here we don't want any separate so we use `""`:
```{r}
unite(table6, new, century, year, sep = "")
table5 %>%
unite(new, century, year, sep = "")
```
`unite()` returns a copy of the data frame that includes the new column, but not the columns used to build the new column. If you would like to retain these columns, add the argument `remove = FALSE`.
`unite()` returns a copy of the data frame that includes the new column, but not the columns used to build the new column.
You can also use integers or the syntax of the `dplyr::select()` function to specify columns to unite in a more concise way.
### Exercises
1. What do the `extra` and `fill` arguments do in `separate()`?
Experiment with the various options for the following two toy datasets.
```{r, eval = FALSE}
tibble::tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
separate(x, c("one", "two", "three"))
tibble::tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
separate(x, c("one", "two", "three"))
```
1. Both `unite()` and `separate()` have a `remove` argument. What does it
do? When would you set it to `FALSE`?
1. Compare and contrast `separate()` and `extract()`. Why are there
three variations of separation, but only one unite?
## Missing values
Changing the representation of a dataset brings up an important fact about missing values. There are two types of missing values:
* __Explicit__ missing values are flagged with `NA`.
* __Implicit__ missing values are simply not present in the data.
Let's illustrate this idea with a very simple data set:
```{r}
stocks <- data_frame(
year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
qtr = c( 1, 2, 3, 4, 2, 3, 4),
return = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66)
)
```
There are two missing values in this dataset:
* The return for the fourth quarter of 2015 is explicitly missing, because
the cell where its value should be instead contains `NA`.
* The return for the first quarter of 2016 is implicitly missing, because it
simply does not appear in the dataset.
One way to think about the difference is this Zen-like koan: An implicit missing value is the presence of an absence; an explicit missing value is the absence of a presence.
The way that a dataset is represented can make implicit values explicit. For exmaple, we can make the implicit missing value explicit putting years in the columns:
```{r}
stocks %>%
spread(year, return)
```
Because these explicit missing values may not be important in other representations of the data, you can set `na.rm = TRUE` in `gather()` to turn explicit missing values implicit:
```{r}
stocks %>%
spread(year, return) %>%
gather(year, return, `2015`:`2016`, na.rm = TRUE)
```
An important tool to making missing values explicit in tidy data is `complete()`:
```{r}
stocks %>%
complete(year, qtr)
```
`complete()` takes a set of columns, and finds all unique combinations. It then ensures the original dataset contains all those values, filling in explicit `NA`s where necessary.
There's one other important tool that you should know for working with missing values. Sometimes when a data source has primarily been used for data entry, missing values indicate the the previous value should be carried forward:
```{r}
treatment <- frame_data(
~ person, ~ treatment, ~response,
"Derrick Whitmore", 1, 7,
NA, 2, 10,
NA, 3, 9,
"Katherine Burke", 1, 4
)
```
You can fill in these missing values with `fill()`. It takes a set of columns where you want missing values to be replaced by the most recent non-missing value (sometimese called last observation carried forward).
```{r}
treatment %>%
fill(person)
```
### Exercises
1. Compare and contrast the `fill` arguments to `spread()` and `complete()`.
1. What does the direction argument to `fill()` do?
## Case Study
The `who` dataset in tidyr contains cases of tuberculosis (TB) reported between 1995 and 2013 sorted by country, age, and gender. The data comes in the *2014 World Health Organization Global Tuberculosis Report*, available for download at [www.who.int/tb/country/data/download/en/](http://www.who.int/tb/country/data/download/en/). The data provides a wealth of epidemiological information, but it would be difficult to work with the data as it is.
The `who` dataset in tidyr contains cases of tuberculosis (TB) reported between 1995 and 2013 sorted by country, age, and gender. The data comes in the *2014 World Health Organization Global Tuberculosis Report*, available for download at <www.who.int/tb/country/data/download/en/>. The data provides a wealth of epidemiological information, but it's challenging to work with the data in the form that it's provided:
```{r}
who
```
`who` provides a realistic example of tabular data in the wild. It contains redundant columns, odd variable codes, and many missing values. In short, `who` is messy.
This is a very typical example of data you are likely to encounter in real life. It contains redundant columns, odd variable codes, and many missing values. In short, `who` is messy. The most unique feature of `who` is its coding system. Columns five through sixty encode four separate pieces of information in their column names:
------------------------------------------------------------------------
1. The first three letters of each column denote whether the column
contains new or old cases of TB. In this dataset, each column contains
new cases.
*TIP*
1. The next two letters describe the type of case being counted. We will
treat each of these as a separate variable.
* `rel` stands for cases of relapse
* `ep` stands for cases of extrapulmonary TB
* `sn` stands for cases of pulmonary TB that could not be diagnosed by
a pulmonary smear (smear negative)
* `sp` stands for cases of pulmonary TB that could be diagnosed be
a pulmonary smear (smear positive)
The `View()` function opens a data viewer in the RStudio IDE. Here you can examine the dataset, search for values, and filter the display based on logical conditions. Notice that the `View()` function begins with a capital V.
3. The sixth letter describes the sex of TB patients. The dataset groups
cases by males (`m`) and females (`f`).
------------------------------------------------------------------------
4. The remaining numbers describe the age group of TB patients. The dataset
groups cases into seven age groups:
* `014` = 0 -- 14 years old
* `1524` = 15 -- 24 years old
* `2534` = 25 -- 34 years old
* `3544` = 35 -- 44 years old
* `4554` = 45 -- 54 years old
* `5564` = 55 -- 64 years old
* `65` = 65 or older
The most unique feature of `who` is its coding system. Columns five through sixty encode four separate pieces of information in their column names:
The `who` dataset is untidy in multiple ways, so we'll need multiple steps to tidy it. Like dplyr, tidyr is designed so that each function does one thing well. That means in real-life situations you'll typically need to string together multiple verbs.
1. The first three letters of each column denote whether the column contains new or old cases of TB. In this dataset, each column contains new cases.
2. The next two letters describe the type of case being counted. We will treat each of these as a separate variable.
- `rel` stands for cases of relapse
- `ep` stands for cases of extrapulmonary TB
- `sn` stands for cases of pulmonary TB that could not be diagnosed by a pulmonary smear (smear negative)
- `sp` stands for cases of pulmonary TB that could be diagnosed be a pulmonary smear (smear positive)
3. The sixth letter describes the sex of TB patients. The dataset groups cases by males (`m`) and females (`f`).
4. The remaining numbers describe the age group of TB patients. The dataset groups cases into seven age groups:
- `014` stands for patients that are 0 to 14 years old
- `1524` stands for patients that are 15 to 24 years old
- `2534` stands for patients that are 25 to 34 years old
- `3544` stands for patients that are 35 to 44 years old
- `4554` stands for patients that are 45 to 54 years old
- `5564` stands for patients that are 55 to 64 years old
- `65` stands for patients that are 65 years old or older
Notice that the `who` dataset is untidy in multiple ways. First, the data appears to contain values in its column names, coded values such as male, relapse, and 0 - 14 years of age. We can move the names into their own column with `gather()`. This will make it easy to separate the values combined in each column name.
Let's start by gathering the columns that are not variables. This is almost always the best place to start when tidying a new dataset. Here we'll use `na.rm` just so we can focus on the values that are present, not the many missing values.
```{r}
who <- gather(who, code, value, 5:60)
who
who1 <- who %>% gather(code, value, new_sp_m014:newrel_f65, na.rm = TRUE)
who1
```
We need to make a minor fix to the format of the column names: unfortunately the names are inconsistent because instead of `new_rel_` we have `newrel` (it's hard to spot this here but if you don't fix it we'll get errors in subsequent steps). You'll learn about `str_replace()` in [strings], but the basic idea is pretty simple: replace the string "newrel" with "new_rel". This makes all variable names consistent.
```{r}
who2 <- who1 %>% mutate(code = stringr::str_replace(code, "newrel", "new_rel"))
who2
```
We can separate the values in each code with two passes of `separate()`. The first pass will split the codes at each underscore.
```{r}
who <- separate(who, code, c("new", "var", "sexage"))
who
who3 <- who2 %>% separate(code, c("new", "type", "sexage"), sep = "_")
who3
```
Then we might as well drop the `new` colum because it's consistent in this dataset:
```{r}
who3 %>% count(new)
who4 <- who3 %>% select(-new)
```
The second pass will split `sexage` after the first character to create two columns, a sex column and an age column.
```{r}
who <- separate(who, sexage, c("sex", "age"), sep = 1)
who
who5 <- who4 %>% separate(sexage, c("sex", "age"), sep = 1)
who5
```
The `rel`, `ep`, `sn`, and `sp` keys are all contained in the same column. We can now move the keys into their own column names with `spread()`.
```{r}
who <- spread(who, var, value)
who
who6 <- who5 %>% spread(type, value)
who6
```
The `who` dataset is now tidy. It is far from sparkling (for example, it contains several redundant columns and many missing values), but it will now be much easier to work with in R.
The `who` dataset is now tidy. It is far from clean (for example, it contains several redundant columns and many missing values), but it will now be much easier to work with in R.
Typically you wouldn't assign each step to a new variable. Instead you'd join everything together in one big pipeline:
```{r}
who %>%
gather(code, value, new_sp_m014:newrel_f65, na.rm = TRUE) %>%
mutate(code = stringr::str_replace(code, "newrel", "new_rel")) %>%
separate(code, c("new", "var", "sexage")) %>%
select(-new) %>%
separate(sexage, c("sex", "age"), sep = 1) %>%
spread(var, value)
```
### Exercises
1. In this case study I set `na.rm = TRUE` just to make it easier to
check that we had the correct values. Is this reasonable? Think about
how missing values are represented in this dataset. What's the difference
between an `NA` and zero? Do you think we should use `fill = 0` in
the final `spread()` step?
1. What happens if you neglect the `mutate()` step? How might you use the
`fill` argument to `gather()`?
1. Compute the total number of cases of tb across all four diagnoses methods.
You can perform the computation either before or after the final
`spread()`. What are the advantages and disadvantages of each location?
## Non-tidy data
Before you go on further, it's worth talking a little bit about non-tidy data. Early in the chapter, I used the perjorative term "messy" to refer to non-tidy data. But that is an oversimplification: there are lots of useful and well founded data structures that are not tidy data.
There are two mains reasons to use other data structures:
* Alternative, non-tidy, representations maybe have substantial performance
or memory advantages.
* Specialised fields have evolved their own conventions for storing data
that may be quite different to the conventions of tidy data.
Generally, however, these reason will require the usage of something other than a tibble or a data frame. If you data does fit naturally into a rectangular structure composed of observations and variables, I think tidy data should be your default choice. But there are good reasons to other structures; tidy data is not the only way.
If you'd like to learn more about non-tidy data, I'd highly recommend this thoughtful blog post by Jeff Leek: <http://simplystatistics.org/2016/02/17/non-tidy-data/>

View File

@ -245,7 +245,7 @@ arrange(df, desc(x))
1. Which flights travelled the longest? Which travelled the shortest?
## Select columns with `select()`
## Select columns with `select()` {#select}
It's not uncommon to get datasets with hundreds or even thousands of variables. In this case, the first challenge is often narrowing in on the variables you're actually interested in. `select()` allows you to rapidly zoom in on a useful subset using operations based on the names of the variables.
@ -355,7 +355,7 @@ transmute(flights,
)
```
### Useful functions
### Useful creation functions
There are many functions for creating new variables that you can use with `mutate()`. The key property is that the function must be vectorised: it must take a vector of values as input, return a vector with the same number of values as output. There's no way to list every possible function that you might use, but here's a selection of functions that are frequently useful:
@ -655,7 +655,7 @@ batters %>% arrange(desc(ba))
You can find a good explanation of this problem at <http://varianceexplained.org/r/empirical_bayes_baseball/> and <http://www.evanmiller.org/how-not-to-sort-by-average-rating.html>.
### Other summary functions
### Useful summary functions
Just using means, counts, and sum can get you a long way, but R provides many other useful summary functions: