r4ds/tidy.Rmd

# Tidy data

## Introduction

> "Happy families are all alike; every unhappy family is unhappy in its
> own way." --– Leo Tolstoy

> "Tidy datasets are all alike, but every messy dataset is messy in its
> own way." --– Hadley Wickham

In this chapter, you will learn a consistent way to organise your data in R, a organisation called __tidy data__. Getting your data into this format requires some upfront work, but that work pays off in the long-term. Once you have tidy data and the tidy tools provided by packages in the tidyverse, you will spend much less time munging data from one representation to another, allowing you to spend more time on the analytic questions at hand.

This chapter will give you a practical introduction to tidy data and the accompanying tools in the __tidyr__ package. If you'd like to learn more about the underlying theory, you might enjoy the *Tidy Data* paper published in the Journal of Statistical Software, <http://www.jstatsoft.org/v59/i10/paper>.

### Prerequisites

In this chapter we'll focus on tidyr, a package that provides a bunch of tools to help tidy up your messy datasets. We'll also need to use a little dplyr, as is common when tidying data.

```{r setup, message = FALSE}
library(tidyr)
library(dplyr)
```

## Tidy data

You can represent the same underlying data in multiple ways. The example below shows the same data organised in four different ways. Each dataset shows the same values of four variables *country*, *year*, *population*, and *cases*, but each dataset organises the values in different way.

```{r}
table1
table2
table3

# Spread across two tibbles
table4a  # cases
table4b  # population
```

These are all representations of the same underlying data, but they are not equally easy to use. One dataset, the tidy dataset, will be much easier work with inside the tidyverse. There are three interrelated rules which make a dataset tidy:

1.  Each variable has its own column.
1.  Each observation has its own row.
1.  Each value has its own cell.

Figure \@ref(fig:tidy-structure) shows this visually.

```{r tidy-structure, echo = FALSE, out.width = "100%", fig.cap = "Following three rules makes a dataset tidy: variables are in columns, observations are in rows, and values are in cells."}
knitr::include_graphics("images/tidy-1.png")
```

These three rules are interrelated because it's impossible to only satisfy two of the three rules. That interrelationship leads to even simpler set of practical instructions:

1.  Put each dataset in a tibble.
1.  Put each variable in a column.

In this example, only `table1` is tidy. It's the only representation where each column is a variable.

Why ensure that your data is tidy? There are two main advantages:

1.  There's a general advantage to picking one consistent way of storing
    data. If you have a consistent data structure, it's easier to learn the
    tools that work with it because they have an underlying uniformity.
    
1.  There's a specific advantage to placing variables in columns because
    it allows R's vectorised nature to shine. As you learned in
    [mutate](#mutate-funs) and [summary functions](#summary-funs), most 
    built-in R functions work with vectors of values. That makes transforming 
    tidy data feel particularly natural.

dplyr, ggplot2, and all other the packages in the tidyverse are designed to work with tidy data. Here are a couple of small examples showing how you might work with `table1`. Think about how you'd achieve the same result with the other representations.

```{r}
# Compute rate 
table1 %>% 
  mutate(rate = cases / population * 10000)

# Compute cases per year
table1 %>% 
  count(year, wt = cases)

# Visualise changes over time
library(ggplot2)
ggplot(table1, aes(year, cases)) + 
  geom_line(aes(group = country), colour = "grey50") + 
  geom_point(aes(colour = country))
```

### Exercises

1.  Using prose, describe how the variables and observations are organised in
    each of the sample tables.

1.  Compute the `rate` for `table2`, and `table4a` + `table4b`. 
    You will need to perform four operations:

    1.  Extract the number of TB cases per country per year.
    1.  Extract the matching population per country per year.
    1.  Divide cases by population, and multiply by 10000.
    1.  Store back in the appropriate place.
    
    Which representation is easiest to work with? Which is hardest? Why?

1.  Recreate the plot showing change in cases over time using `table2`
    instead of `table1`. What do you need to do first?

## Spreading and gathering

The principles of tidy data seem so obvious that you might wonder if you'll ever encounter a dataset that isn't tidy. Unfortunately, while the principles are obvious in hindsight, it took Hadley over 5 years of struggling with many datasets to figure out these very simple principles. Most datasets that you will encounter in real life will not be tidy, either because the creator was not aware of the principles of tidy data, or because the data is stored in order to make data entry, not data analysis, easy.

The first step to tidying any dataset is to study it and figure out what the variables are. Sometimes this is easy; other times you'll need to consult with the people who originally generated the data. Once you've identified the variables, you can start wrangling the data so each variable forms a column.

There are two common problems that it's best to solve first:

1. One variable might be spread across multiple columns.

1. One observation might be spread across mutliple rows.

To fix these problems, you'll need the two most important functions in tidyr: `gather()` and `spread()`.

### Gathering

A common problem is a dataset where some of the column names are not names of a variable, but _values_ of a variable. Take `table4a`: the column names `1991` and `2000` represent values of the `year` variable. Each row represents two observations, not one.

```{r}
table4a
```

To tidy a dataset like this, we need to __gather__ those column into a new pair of variables. To describe that operation we need three parameters:

* The set of columns that represent values, not variables. In this example, 
  those are the columns `1999` and `2000`.

* The name of the variable whose values form the column names. I call that
  the `key`, and here it is `year`.

* The name of the variable whose values are spread over the cells. I call 
  that `value`, and here it's the number of `cases`.
  
Together those parameters generate the call to `gather()`:

```{r}
table4a %>% 
  gather(`1999`, `2000`, key = "year", value = "cases")
```

The columns to gather are specified with `dplyr::select()` style notation. Here there are only two columns, so we list them individually. Note that "1999" and "2000" are non-syntactic names so we have to surround them in backticks. To refresh your memory of the other ways to select columns, see [select](#select).

In the final result, the gathered columns are dropped, and we get new `key` and `value` columns. Otherwise, the relationships between the original variables are preserved. Visually, this is shown in Figure \@ref(fig:tidy-gather). 

```{r tidy-gather, echo = FALSE, out.width = "100%", fig.cap = "Gathering `table4` into a tidy form."}
knitr::include_graphics("images/tidy-9.png")
```

We can use `gather()` to tidy `table4b` in a similar fashion. The only difference is the variable stored in the cell values:

```{r}
table4b %>% 
  gather(`1999`, `2000`, key = "year", value = "population")
```

To combine the tidied versions of `table4a` and `table4b` into a single tibble, we need to use `dplyr::left_join()`, which you'll learn about in [relational data].

```{r}
tidy4a <- table4a %>% 
  gather(`1999`, `2000`, key = "year", value = "cases")
tidy4b <- table4b %>% 
  gather(`1999`, `2000`, key = "year", value = "population")
left_join(tidy4a, tidy4b)
```

### Spreading

Spreading is the opposite of gathering. You use it when an observation is scattered across multiple rows. For example, take `table2`: an observation is a country in a year, but each observation is spread across two rows.

```{r}
table2
```

To tidy this up, we first analyse the representation in similar way to `gather()`. This time, however, we only need two parameters:

* The column that contains variable names, the `key` column. Here, it's 
  `type`.

* The column that contains values froms multiple variables, the `value`
  column. Here it's `count`.

Once we've figured that out, we can use `spread()`, as shown progammatically below, and visually in Figure \@ref(fig:tidy-spread).

```{r}
spread(table2, key = type, value = count)
```

```{r tidy-spread, echo = FALSE, out.width = "100%", fig.cap = "Spreading `table2` makes it tidy"}
knitr::include_graphics("images/tidy-8.png")
```

As you might have guessed from the common `key` and `value` arguments, `spread()` and `gather()` are complements. `gather()` makes wide tables narrower and longer; `spread()` makes long tables shorter and wider.

### Exercises

1.  Why are `gather()` and `spread()` not perfectly symmetrical?  
    Carefully consider the following example:
    
    ```{r, eval = FALSE}
    stocks <- data_frame(
      year   = c(2015, 2015, 2016, 2016),
      half  = c(   1,    2,     1,    2),
      return = c(1.88, 0.59, 0.92, 0.17)
    )
    stocks %>% 
      spread(year, return) %>% 
      gather("year", "return", `2015`:`2016`)
    ```
    
    (Hint: look at the variable types and think about column _names_.)
    
    Both `spread()` and `gather()` have a `convert` argument. What does it 
    do?

1.  Why does this code fail?

    ```{r, error = TRUE}
    table4a %>% 
      gather(1999, 2000, key = "year", value = "cases")
    ```

1.  Why does spreading this tibble fail?

    ```{r}
    people <- frame_data(
      ~name,             ~key,    ~value,
      #-----------------|--------|------
      "Phillip Woods",   "age",       45,
      "Phillip Woods",   "height",   186,
      "Phillip Woods",   "age",       50,
      "Jessica Cordero", "age",       37,
      "Jessica Cordero", "height",   156
    )
    ```

1.  Tidy the simple tibble below. Do you need to spread or gather it?
    What are the variables?

    ```{r}
    preg <- frame_data(
      ~pregnant, ~male, ~female,
      "yes",     NA,    10,
      "no",      20,    12
    )
    ```

## Separating and uniting

So far you've learned how to tidy `table2` and `table4`, but not `table3`. `table3` has a different problem: we have one column (`rate`) that contains two variables (`cases` and `population`). To fix this problem, we'll need the `separate()` function. You'll also learn about inverse of `separate()`: `unite()`, which you use if a single variable is spread across multiple columns.

### Separate

`separate()` pulls apart one column into multiple variables, by separating wherever a separator character appears. Take `table3`:

```{r}
table3
```

The `rate` column contains both `cases` and `population` variables, and we need to split it into two variables. `separate()` takes the name of the column to separate, and the names of the columns to separate into, as shown in \@ref(fig:tidy-separate) and the code below.

```{r}
table3 %>% 
  separate(rate, into = c("cases", "population"))
```

```{r tidy-separate, echo = FALSE, out.width = "100%", fig.cap = "Separating `table3` makes it tidy"}
knitr::include_graphics("images/tidy-17.png")
```

By default, `separate()` will split values wherever it sees a non-alphanumeric character (i.e. a character that isn't a number or letter). For example, in the code above, `separate()` split the values of `rate` at the forward slash characters. If you wish to use a specific character to separate a column, you can pass the character to the `sep` argument of `separate()`. For example, we could rewrite the code above as:

```{r eval = FALSE}
table3 %>% 
  separate(rate, into = c("cases", "population"), sep = "/")
```

(Formally, `sep` is a regular expression, which you'll learn more about in [strings].)

Look carefully at the column types: you'll notice that `case` and `population` are character columns. This is the default behaviour in `separate()`: it leaves the type of the column as is. Here, however, it's not very useful those really are numbers. We can ask `separate()` to try and convert to better types using `convert = TRUE`:

```{r}
table3 %>% 
  separate(rate, into = c("cases", "population"), convert = TRUE)
```

You can also pass a vector of integers to `sep`. `separate()` will interpret the integers as positions to split at. Positive values start at 1 on the far-left of the strings; negative value start at -1 on the far-right of the strings. When using integers to separate strings, the length of `sep` should be one less than the number of names in `into`. You can use this arrangement to separate the last two digits of each year.

```{r}
table3 %>% 
  separate(year, into = c("century", "year"), sep = 2)
```

### Unite

`unite()` is inverse of `separate()`: it combines multiple columns into a single column. You'll need it much less frequently than `separate()`, but it's still a useful tool to have in your back pocket.

![](images/tidy-18.png)

We can use `unite()` to rejoin the *century* and *year* columns that we created in the last example. That data is saved as `tidyr::table5`. `unite()` takes a data frame, the name of the new variable to create, and a set of columns to combine, again specified in `dplyr::select()` style:

```{r}
table5
table5 %>% 
  unite(new, century, year)
```

In this case we also need to use the `sep` arguent. The default will place an underscore (`_`) between the values from different columns. Here we don't want any separator so we use `""`:

```{r}
table5 %>% 
  unite(new, century, year, sep = "")
```

### Exercises

1.  What do the `extra` and `fill` arguments do in `separate()`? 
    Experiment with the various options for the following two toy datasets.
    
    ```{r, eval = FALSE}
    tibble::tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>% 
      separate(x, c("one", "two", "three"))
    
    tibble::tibble(x = c("a,b,c", "d,e", "f,g,i")) %>% 
     separate(x, c("one", "two", "three"))
    ```

1.  Both `unite()` and `separate()` have a `remove` argument. What does it
    do? Why would you set it to `FALSE`?

1.  Compare and contrast `separate()` and `extract()`.  Why are there
    three variations of separation, but only one unite?

## Missing values

Changing the representation of a dataset brings up an important subtlety of missing values. Suprisingly, a value can be missing in one of two possible ways:

* __Explicitly__, i.e. flagged with `NA`.
* __Implicitly__, i.e. simply not present in the data.

Let's illustrate this idea with a very simple data set:

```{r}
stocks <- data_frame(
  year   = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
  qtr    = c(   1,    2,    3,    4,    2,    3,    4),
  return = c(1.88, 0.59, 0.35,   NA, 0.92, 0.17, 2.66)
)
```

There are two missing values in this dataset:

* The return for the fourth quarter of 2015 is explicitly missing, because
  the cell where its value should be instead contains `NA`.
  
* The return for the first quarter of 2016 is implicitly missing, because it
  simply does not appear in the dataset.
  
One way to think about the difference is with this Zen-like koan: An explicit missing value is the presence of an absence; an implicit missing value is the absence of a presence.

The way that a dataset is represented can make implicit values explicit. For example, we can make the implicit missing value explicit putting years in the columns:

```{r}
stocks %>% 
  spread(year, return)
```

Because these explicit missing values may not be important in other representations of the data, you can set `na.rm = TRUE` in `gather()` to turn explicit missing values implicit:

```{r}
stocks %>% 
  spread(year, return) %>% 
  gather(year, return, `2015`:`2016`, na.rm = TRUE)
```

Another important tool for making missing values explicit in tidy data is `complete()`:

```{r}
stocks %>% 
  complete(year, qtr)
```

`complete()` takes a set of columns, and finds all unique combinations. It then ensures the original dataset contains all those values, filling in explicit `NA`s where necessary.

There's one other important tool that you should know for working with missing values. Sometimes when a data source has primarily been used for data entry, missing values indicate that the previous value should be carried forward:

```{r}
treatment <- frame_data(
  ~ person,           ~ treatment, ~response,
  "Derrick Whitmore", 1,           7,
  NA,                 2,           10,
  NA,                 3,           9,
  "Katherine Burke",  1,           4
)
```

You can fill in these missing values with `fill()`. It takes a set of columns where you want missing values to be replaced by the most recent non-missing value (sometimese called last observation carried forward).

```{r}
treatment %>% 
  fill(person)
```

### Exercises

1.  Compare and contrast the `fill` arguments to `spread()` and `complete()`. 

1.  What does the direction argument to `fill()` do?

## Case Study

To finish off the chapter, let's pull together everything you've learned to tackle a realistic data tidying problem. The `tidyr::who` dataset contains reporter tuberculosis (TB) cases broken down by year, country, age, gender, and diagnosis method. The data comes from the *2014 World Health Organization Global Tuberculosis Report*, available for download at <http://www.who.int/tb/country/data/download/en/>. 

There's a wealth of epidemiological information in this dataset, but it's challenging to work with the data in the form that it's provided:

```{r}
who
```

This is a very typical example of data you are likely to encounter in real life. It contains redundant columns, odd variable codes, and many missing values. In short, `who` is messy, and we'll need multiple steps to tidy it. Like dplyr, tidyr is designed so that each function does one thing well. That means in real-life situations you'll usually need to string together multiple verbs into a pipeline. 

The best place to start is almost always to gathering together the columns that are not variables. Let's have a look at what we've got: 

* It looks like `country`, `iso2`, and `iso3` are three variables that 
  redundantly specify the country.
  
* `year` is clearly also a variable.

* We don't know what all the other columns are yet, but given the structure 
  in the variable names (e.g. `new_sp_m014`, `new_ep_m014`, `new_ep_f014`) 
  these are likely to be values, not variable.

So we need to gather together all the columns from `new_sp_m3544` to `newrel_f65`. We don't know what those values represent yet, so we'll give them the generic name `"key"`. We know the cells repesent the count of cases, so we'll use the variable `cases`. There are a lot of missing values in the current representation, so for now we'll use `na.rm` just so we can focus on the values that are present.

```{r}
who1 <- who %>% 
  gather(new_sp_m014:newrel_f65, key = "key", value = "cases", na.rm = TRUE)
who1
```

We can get some hint of the structure of the values in the new `key` column:

```{r}
who1 %>% 
  count(key)
```

You might be able to parse this out by yourself with a little thought and some experimentation, but luckily we have the data dictionary handy. It tells us:

1.  The first three letters of each column denote whether the column 
    contains new or old cases of TB. In this dataset, each column contains 
    new cases.

1.  The next two letters describe the type of TB:
    
    *   `rel` stands for cases of relapse
    *   `ep` stands for cases of extrapulmonary TB
    *   `sn` stands for cases of pulmonary TB that could not be diagnosed by 
        a pulmonary smear (smear negative)
    *   `sp` stands for cases of pulmonary TB that could be diagnosed be 
        a pulmonary smear (smear positive)

3.  The sixth letter gives the sex of TB patients. The dataset groups 
    cases by males (`m`) and females (`f`).

4.  The remaining numbers gives the age group. The dataset groups cases into 
    seven age groups:
    
    * `014` = 0 -- 14 years old
    * `1524` = 15 -- 24 years old
    * `2534` = 25 -- 34 years old
    * `3544` = 35 -- 44 years old
    * `4554` = 45 -- 54 years old
    * `5564` = 55 -- 64 years old
    * `65` = 65 or older

We need to make a minor fix to the format of the column names: unfortunately the names are slightly inconsistent because instead of `new_rel_` we have `newrel` (it's hard to spot this here but if you don't fix it we'll get errors in subsequent steps). You'll learn about `str_replace()` in [strings], but the basic idea is pretty simple: replace the string "newrel" with "new_rel". This makes all variable names consistent.

```{r}
who2 <- who1 %>% 
  mutate(key = stringr::str_replace(key, "newrel", "new_rel"))
who2
```

We can separate the values in each code with two passes of `separate()`. The first pass will split the codes at each underscore.

```{r}
who3 <- who2 %>% 
  separate(key, c("new", "type", "sexage"), sep = "_")
who3
```

Then we might as well drop the `new` colum because it's consistent in this dataset. While we're dropping columns, let's also drop `iso2` and `iso3` since they're redundant.

```{r}
who3 %>% 
  count(new)
who4 <- who3 %>% 
  select(-new, -iso2, -iso3)
```

Next we'll separate `sexage` into `sex` and `age` by splitting after the first character:

```{r}
who5 <- who4 %>% 
  separate(sexage, c("sex", "age"), sep = 1)
who5
```

The `who` dataset is now tidy! It is far from clean (for example, it contains several redundant columns and many missing values), but it will now be much easier to work with in R. 

I've shown you the code a piece at a time, assinging each interim result to a new variable. This typically isn't how you'd work interactively. Instead, you'd gradually build up a complex pipe:

```{r}
who %>%
  gather(code, value, new_sp_m014:newrel_f65, na.rm = TRUE) %>% 
  mutate(code = stringr::str_replace(code, "newrel", "new_rel")) %>%
  separate(code, c("new", "var", "sexage")) %>% 
  select(-new, -iso2, -iso3) %>% 
  separate(sexage, c("sex", "age"), sep = 1)
```

### Exercises

1.  In this case study I set `na.rm = TRUE` just to make it easier to
    check that we had the correct values. Is this reasonable? Think about
    how missing values are represented in this dataset. Are there implicit
    missing values? What's the difference between an `NA` and zero? 

1.  What happens if you neglect the `mutate()` step?

1.  I claimed that `iso2` and `iso3` were redundant with `country`. 
    Confirm my claim by creating a table that uniquely maps from `country`
    to `iso2` and `iso3`.

1.  For each country, year, and sex compute the total number of cases of 
    TB. Make an informative visualisation of the data.

## Non-tidy data

Before we continue on to other topics, it's worth talking briefly about non-tidy data. Earlier in the chapter, I used the perjorative term "messy" to refer to non-tidy data. That's an oversimplification: there are lots of useful and well founded data structures that are not tidy data. There are two mains reasons to use other data structures:

* Alternative representations may have substantial performance or space 
  advantages.
  
* Specialised fields have evolved their own conventions for storing data
  that may be quite different to the conventions of  tidy data.

Either of these reasons means you'll need something other than a tibble (or data frame). If your data does fit naturally into a rectangular structure composed of observations and variables, I think tidy data should be your default choice. But there are good reasons to use other structures; tidy data is not the only way.

If you'd like to learn more about non-tidy data, I'd highly recommend this thoughtful blog post by Jeff Leek: <http://simplystatistics.org/2016/02/17/non-tidy-data/>
-												Need heading

											
										
										
											2015-07-29 03:21:36 +08:00
+								# Tidy data
-												Add missing intro subheads

											
										
										
											2016-07-24 22:16:08 +08:00
+								## Introduction
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								> "Happy families are all alike; every unhappy family is unhappy in its
 								> own way." --– Leo Tolstoy
-												Fix typos (#146)


											
										
										
											2016-07-23 00:24:48 +08:00
+								> "Tidy datasets are all alike, but every messy dataset is messy in its
-												Consistent quote form

											
										
										
											2016-07-12 06:32:36 +08:00
+								> own way." --– Hadley Wickham
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Final tidy polishing

											
										
										
											2016-07-27 21:23:28 +08:00
+								In this chapter, you will learn a consistent way to organise your data in R, a organisation called __tidy data__. Getting your data into this format requires some upfront work, but that work pays off in the long-term. Once you have tidy data and the tidy tools provided by packages in the tidyverse, you will spend much less time munging data from one representation to another, allowing you to spend more time on the analytic questions at hand.
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								This chapter will give you a practical introduction to tidy data and the accompanying tools in the __tidyr__ package. If you'd like to learn more about the underlying theory, you might enjoy the *Tidy Data* paper published in the Journal of Statistical Software, <http://www.jstatsoft.org/v59/i10/paper>.
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Consistent chapter intro layout

											
										
										
											2016-07-19 21:01:50 +08:00
+								### Prerequisites
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidying the tidy chapter

											
										
										
											2016-07-27 06:44:17 +08:00
+								In this chapter we'll focus on tidyr, a package that provides a bunch of tools to help tidy up your messy datasets. We'll also need to use a little dplyr, as is common when tidying data.
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
-												Final tidy polishing

											
										
										
											2016-07-27 21:23:28 +08:00
+								```{r setup, message = FALSE}
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
+								library(tidyr)
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								library(dplyr)
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
+								```
-												Start moving towards Hadley style

											
										
										
											2015-07-29 03:15:28 +08:00
+								## Tidy data
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Update tidy.Rmd (#220)

UK English
											
										
										
											2016-08-02 22:13:31 +08:00
+								You can represent the same underlying data in multiple ways. The example below shows the same data organised in four different ways. Each dataset shows the same values of four variables *country*, *year*, *population*, and *cases*, but each dataset organises the values in different way.
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
 								```{r}
 								table1
 								table2
 								table3
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								# Spread across two tibbles
 								table4a  # cases
 								table4b  # population
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
+								```
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								These are all representations of the same underlying data, but they are not equally easy to use. One dataset, the tidy dataset, will be much easier work with inside the tidyverse. There are three interrelated rules which make a dataset tidy:
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+.  Each variable has its own column.
 .  Each observation has its own row.
 .  Each value has its own cell.
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Final tidy polishing

											
										
										
											2016-07-27 21:23:28 +08:00
+								Figure \@ref(fig:tidy-structure) shows this visually.
 								```{r tidy-structure, echo = FALSE, out.width = "100%", fig.cap = "Following three rules makes a dataset tidy: variables are in columns, observations are in rows, and values are in cells."}
-												Local bookdown working

											
										
										
											2015-12-12 03:28:10 +08:00
+								knitr::include_graphics("images/tidy-1.png")
 								```
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								These three rules are interrelated because it's impossible to only satisfy two of the three rules. That interrelationship leads to even simpler set of practical instructions:
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+.  Put each dataset in a tibble.
 .  Put each variable in a column.
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Replace `tabble1` with `table1` in Chapter 9 (#213)


											
										
										
											2016-08-01 19:45:15 +08:00
+								In this example, only `table1` is tidy. It's the only representation where each column is a variable.
-												Final tidy polishing

											
										
										
											2016-07-27 21:23:28 +08:00
 								Why ensure that your data is tidy? There are two main advantages:
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Final tidy polishing

											
										
										
											2016-07-27 21:23:28 +08:00
+.  There's a general advantage to picking one consistent way of storing
-												Tidying the tidy chapter

											
										
										
											2016-07-27 06:44:17 +08:00
+								    data. If you have a consistent data structure, it's easier to learn the
 								    tools that work with it because they have an underlying uniformity.
 .  There's a specific advantage to placing variables in columns because
-												Final tidy polishing

											
										
										
											2016-07-27 21:23:28 +08:00
+								    it allows R's vectorised nature to shine. As you learned in
 								    [mutate](#mutate-funs) and [summary functions](#summary-funs), most
 								    built-in R functions work with vectors of values. That makes transforming
 								    tidy data feel particularly natural.
-												Tidying the tidy chapter

											
										
										
											2016-07-27 06:44:17 +08:00
-												Final tidy polishing

											
										
										
											2016-07-27 21:23:28 +08:00
+								dplyr, ggplot2, and all other the packages in the tidyverse are designed to work with tidy data. Here are a couple of small examples showing how you might work with `table1`. Think about how you'd achieve the same result with the other representations.
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
 								```{r}
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								# Compute rate
 								table1 %>%
 								  mutate(rate = cases / population * 10000)
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								# Compute cases per year
 								table1 %>%
 								  count(year, wt = cases)
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								# Visualise changes over time
 								library(ggplot2)
 								ggplot(table1, aes(year, cases)) +
 								  geom_line(aes(group = country), colour = "grey50") +
 								  geom_point(aes(colour = country))
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
+								```
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								### Exercises
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+.  Using prose, describe how the variables and observations are organised in
 								    each of the sample tables.
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+.  Compute the `rate` for `table2`, and `table4a` + `table4b`.
 								    You will need to perform four operations:
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+.  Extract the number of TB cases per country per year.
-												Final tidy polishing

											
										
										
											2016-07-27 21:23:28 +08:00
+.  Extract the matching population per country per year.
 .  Divide cases by population, and multiply by 10000.
 .  Store back in the appropriate place.
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
-												Final tidy polishing

											
										
										
											2016-07-27 21:23:28 +08:00
+								    Which representation is easiest to work with? Which is hardest? Why?
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+.  Recreate the plot showing change in cases over time using `table2`
 								    instead of `table1`. What do you need to do first?
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								## Spreading and gathering
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidying the tidy chapter

											
										
										
											2016-07-27 06:44:17 +08:00
+								The principles of tidy data seem so obvious that you might wonder if you'll ever encounter a dataset that isn't tidy. Unfortunately, while the principles are obvious in hindsight, it took Hadley over 5 years of struggling with many datasets to figure out these very simple principles. Most datasets that you will encounter in real life will not be tidy, either because the creator was not aware of the principles of tidy data, or because the data is stored in order to make data entry, not data analysis, easy.
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Final tidy polishing

											
										
										
											2016-07-27 21:23:28 +08:00
+								The first step to tidying any dataset is to study it and figure out what the variables are. Sometimes this is easy; other times you'll need to consult with the people who originally generated the data. Once you've identified the variables, you can start wrangling the data so each variable forms a column.
 								There are two common problems that it's best to solve first:
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Final tidy polishing

											
										
										
											2016-07-27 21:23:28 +08:00
+. One variable might be spread across multiple columns.
 . One observation might be spread across mutliple rows.
 								To fix these problems, you'll need the two most important functions in tidyr: `gather()` and `spread()`.
-												Update tidy.Rmd

typos
											
										
										
											2016-01-29 05:10:04 +08:00
-												Tidying the tidy chapter

											
										
										
											2016-07-27 06:44:17 +08:00
+								### Gathering
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Final tidy polishing

											
										
										
											2016-07-27 21:23:28 +08:00
+								A common problem is a dataset where some of the column names are not names of a variable, but _values_ of a variable. Take `table4a`: the column names `1991` and `2000` represent values of the `year` variable. Each row represents two observations, not one.
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
 								```{r}
-												Tidying the tidy chapter

											
										
										
											2016-07-27 06:44:17 +08:00
+								table4a
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
+								```
-												Final tidy polishing

											
										
										
											2016-07-27 21:23:28 +08:00
+								To tidy a dataset like this, we need to __gather__ those column into a new pair of variables. To describe that operation we need three parameters:
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidying the tidy chapter

											
										
										
											2016-07-27 06:44:17 +08:00
+								* The set of columns that represent values, not variables. In this example,
 								  those are the columns `1999` and `2000`.
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Final tidy polishing

											
										
										
											2016-07-27 21:23:28 +08:00
+								* The name of the variable whose values form the column names. I call that
 								  the `key`, and here it is `year`.
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Final tidy polishing

											
										
										
											2016-07-27 21:23:28 +08:00
+								* The name of the variable whose values are spread over the cells. I call
-												Update tidy.Rmd (#221)

in Figure 9.2, you mention "table4" but the code uses "table4a and table4b". I can't run the code for table4a and table4b, the datasets from DSR refers to table4 for table4a and table5 for table4b.
											
										
										
											2016-08-02 22:13:39 +08:00
+								  that `value`, and here it's the number of `cases`.
-												Tidying the tidy chapter

											
										
										
											2016-07-27 06:44:17 +08:00
 								Together those parameters generate the call to `gather()`:
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
 								```{r}
-												Changes from @mine-cetinkaya-rundel

											
										
										
											2016-08-01 00:32:16 +08:00
+								table4a %>%
 								  gather(`1999`, `2000`, key = "year", value = "cases")
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
+								```
-												Final tidy polishing

											
										
										
											2016-07-27 21:23:28 +08:00
+								The columns to gather are specified with `dplyr::select()` style notation. Here there are only two columns, so we list them individually. Note that "1999" and "2000" are non-syntactic names so we have to surround them in backticks. To refresh your memory of the other ways to select columns, see [select](#select).
-												Tidying the tidy chapter

											
										
										
											2016-07-27 06:44:17 +08:00
-												Final tidy polishing

											
										
										
											2016-07-27 21:23:28 +08:00
+								In the final result, the gathered columns are dropped, and we get new `key` and `value` columns. Otherwise, the relationships between the original variables are preserved. Visually, this is shown in Figure \@ref(fig:tidy-gather).
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Final tidy polishing

											
										
										
											2016-07-27 21:23:28 +08:00
+								```{r tidy-gather, echo = FALSE, out.width = "100%", fig.cap = "Gathering `table4` into a tidy form."}
-												Tidying the tidy chapter

											
										
										
											2016-07-27 06:44:17 +08:00
+								knitr::include_graphics("images/tidy-9.png")
-												Local bookdown working

											
										
										
											2015-12-12 03:28:10 +08:00
+								```
-												Tidying the tidy chapter

											
										
										
											2016-07-27 06:44:17 +08:00
+								We can use `gather()` to tidy `table4b` in a similar fashion. The only difference is the variable stored in the cell values:
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								```{r}
-												Changes from @mine-cetinkaya-rundel

											
										
										
											2016-08-01 00:32:16 +08:00
+								table4b %>%
 								  gather(`1999`, `2000`, key = "year", value = "population")
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								```
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidying the tidy chapter

											
										
										
											2016-07-27 06:44:17 +08:00
+								To combine the tidied versions of `table4a` and `table4b` into a single tibble, we need to use `dplyr::left_join()`, which you'll learn about in [relational data].
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
 								```{r}
-												Changes from @mine-cetinkaya-rundel

											
										
										
											2016-08-01 00:32:16 +08:00
+								tidy4a <- table4a %>%
 								  gather(`1999`, `2000`, key = "year", value = "cases")
 								tidy4b <- table4b %>%
 								  gather(`1999`, `2000`, key = "year", value = "population")
-												Tidying the tidy chapter

											
										
										
											2016-07-27 06:44:17 +08:00
+								left_join(tidy4a, tidy4b)
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
+								```
-												Tidying the tidy chapter

											
										
										
											2016-07-27 06:44:17 +08:00
+								### Spreading
-												Final tidy polishing

											
										
										
											2016-07-27 21:23:28 +08:00
+								Spreading is the opposite of gathering. You use it when an observation is scattered across multiple rows. For example, take `table2`: an observation is a country in a year, but each observation is spread across two rows.
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
 								```{r}
-												Tidying the tidy chapter

											
										
										
											2016-07-27 06:44:17 +08:00
+								table2
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
+								```
-												Fixes in tidy (#210)

* Fixed URL to WHO data

The link was not rendered as missing the protocol.

* Typos

											
										
										
											2016-08-01 00:33:58 +08:00
+								To tidy this up, we first analyse the representation in similar way to `gather()`. This time, however, we only need two parameters:
-												Final tidy polishing

											
										
										
											2016-07-27 21:23:28 +08:00
 								* The column that contains variable names, the `key` column. Here, it's
 								  `type`.
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Final tidy polishing

											
										
										
											2016-07-27 21:23:28 +08:00
+								* The column that contains values froms multiple variables, the `value`
 								  column. Here it's `count`.
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Final tidy polishing

											
										
										
											2016-07-27 21:23:28 +08:00
+								Once we've figured that out, we can use `spread()`, as shown progammatically below, and visually in Figure \@ref(fig:tidy-spread).
-												Tidying the tidy chapter

											
										
										
											2016-07-27 06:44:17 +08:00
 								```{r}
-												Final tidy polishing

											
										
										
											2016-07-27 21:23:28 +08:00
+								spread(table2, key = type, value = count)
-												Local bookdown working

											
										
										
											2015-12-12 03:28:10 +08:00
+								```
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Final tidy polishing

											
										
										
											2016-07-27 21:23:28 +08:00
+								```{r tidy-spread, echo = FALSE, out.width = "100%", fig.cap = "Spreading `table2` makes it tidy"}
-												Tidying the tidy chapter

											
										
										
											2016-07-27 06:44:17 +08:00
+								knitr::include_graphics("images/tidy-8.png")
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
+								```
-												Tidying the tidy chapter

											
										
										
											2016-07-27 06:44:17 +08:00
+								As you might have guessed from the common `key` and `value` arguments, `spread()` and `gather()` are complements. `gather()` makes wide tables narrower and longer; `spread()` makes long tables shorter and wider.
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								### Exercises
 .  Why are `gather()` and `spread()` not perfectly symmetrical?
 								    Carefully consider the following example:
 								    ```{r, eval = FALSE}
 								    stocks <- data_frame(
 								      year   = c(2015, 2015, 2016, 2016),
 								      half  = c(   1,    2,     1,    2),
 								      return = c(1.88, 0.59, 0.92, 0.17)
 								    )
 								    stocks %>%
 								      spread(year, return) %>%
 								      gather("year", "return", `2015`:`2016`)
 								    ```
-												Tidying the tidy chapter

											
										
										
											2016-07-27 06:44:17 +08:00
 								    (Hint: look at the variable types and think about column _names_.)
 								    Both `spread()` and `gather()` have a `convert` argument. What does it
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								    do?
-												Tidying the tidy chapter

											
										
										
											2016-07-27 06:44:17 +08:00
+.  Why does this code fail?
 								    ```{r, error = TRUE}
-												Changes from @mine-cetinkaya-rundel

											
										
										
											2016-08-01 00:32:16 +08:00
+								    table4a %>%
 								      gather(1999, 2000, key = "year", value = "cases")
-												Tidying the tidy chapter

											
										
										
											2016-07-27 06:44:17 +08:00
+								    ```
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+.  Why does spreading this tibble fail?
 								    ```{r}
 								    people <- frame_data(
-												Tidying the tidy chapter

											
										
										
											2016-07-27 06:44:17 +08:00
+								      ~name,             ~key,    ~value,
 								      #-----------------|--------|------
 								      "Phillip Woods",   "age",       45,
 								      "Phillip Woods",   "height",   186,
 								      "Phillip Woods",   "age",       50,
 								      "Jessica Cordero", "age",       37,
 								      "Jessica Cordero", "height",   156
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								    )
 								    ```
 .  Tidy the simple tibble below. Do you need to spread or gather it?
 								    What are the variables?
 								    ```{r}
 								    preg <- frame_data(
 								      ~pregnant, ~male, ~female,
 								      "yes",     NA,    10,
 								      "no",      20,    12
 								    )
 								    ```
 								## Separating and uniting
-												Final tidy polishing

											
										
										
											2016-07-27 21:23:28 +08:00
+								So far you've learned how to tidy `table2` and `table4`, but not `table3`. `table3` has a different problem: we have one column (`rate`) that contains two variables (`cases` and `population`). To fix this problem, we'll need the `separate()` function. You'll also learn about inverse of `separate()`: `unite()`, which you use if a single variable is spread across multiple columns.
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
 								### Separate
-												Final tidy polishing

											
										
										
											2016-07-27 21:23:28 +08:00
+								`separate()` pulls apart one column into multiple variables, by separating wherever a separator character appears. Take `table3`:
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
 								```{r}
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								table3
-												Final tidy polishing

											
										
										
											2016-07-27 21:23:28 +08:00
+								```
-												Update tidy.Rmd (#222)

typo
											
										
										
											2016-08-02 22:25:57 +08:00
+								The `rate` column contains both `cases` and `population` variables, and we need to split it into two variables. `separate()` takes the name of the column to separate, and the names of the columns to separate into, as shown in \@ref(fig:tidy-separate) and the code below.
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
-												Final tidy polishing

											
										
										
											2016-07-27 21:23:28 +08:00
+								```{r}
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								table3 %>%
 								  separate(rate, into = c("cases", "population"))
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
+								```
-												Final tidy polishing

											
										
										
											2016-07-27 21:23:28 +08:00
+								```{r tidy-separate, echo = FALSE, out.width = "100%", fig.cap = "Separating `table3` makes it tidy"}
 								knitr::include_graphics("images/tidy-17.png")
 								```
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								By default, `separate()` will split values wherever it sees a non-alphanumeric character (i.e. a character that isn't a number or letter). For example, in the code above, `separate()` split the values of `rate` at the forward slash characters. If you wish to use a specific character to separate a column, you can pass the character to the `sep` argument of `separate()`. For example, we could rewrite the code above as:
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Final tidy polishing

											
										
										
											2016-07-27 21:23:28 +08:00
+								```{r eval = FALSE}
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								table3 %>%
 								  separate(rate, into = c("cases", "population"), sep = "/")
 								```
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidying the tidy chapter

											
										
										
											2016-07-27 06:44:17 +08:00
+								(Formally, `sep` is a regular expression, which you'll learn more about in [strings].)
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								Look carefully at the column types: you'll notice that `case` and `population` are character columns. This is the default behaviour in `separate()`: it leaves the type of the column as is. Here, however, it's not very useful those really are numbers. We can ask `separate()` to try and convert to better types using `convert = TRUE`:
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								```{r}
 								table3 %>%
 								  separate(rate, into = c("cases", "population"), convert = TRUE)
 								```
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidying the tidy chapter

											
										
										
											2016-07-27 06:44:17 +08:00
+								You can also pass a vector of integers to `sep`. `separate()` will interpret the integers as positions to split at. Positive values start at 1 on the far-left of the strings; negative value start at -1 on the far-right of the strings. When using integers to separate strings, the length of `sep` should be one less than the number of names in `into`. You can use this arrangement to separate the last two digits of each year.
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								```{r}
 								table3 %>%
 								  separate(year, into = c("century", "year"), sep = 2)
 								```
 								### Unite
-												Update tidy.Rmd (#222)

typo
											
										
										
											2016-08-02 22:25:57 +08:00
+								`unite()` is inverse of `separate()`: it combines multiple columns into a single column. You'll need it much less frequently than `separate()`, but it's still a useful tool to have in your back pocket.
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
 								![](images/tidy-18.png)
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								We can use `unite()` to rejoin the *century* and *year* columns that we created in the last example. That data is saved as `tidyr::table5`. `unite()` takes a data frame, the name of the new variable to create, and a set of columns to combine, again specified in `dplyr::select()` style:
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
 								```{r}
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								table5
 								table5 %>%
 								  unite(new, century, year)
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
+								```
-												Tidying the tidy chapter

											
										
										
											2016-07-27 06:44:17 +08:00
+								In this case we also need to use the `sep` arguent. The default will place an underscore (`_`) between the values from different columns. Here we don't want any separator so we use `""`:
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
 								```{r}
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								table5 %>%
 								  unite(new, century, year, sep = "")
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
+								```
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								### Exercises
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+.  What do the `extra` and `fill` arguments do in `separate()`?
 								    Experiment with the various options for the following two toy datasets.
 								    ```{r, eval = FALSE}
 								    tibble::tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
 								      separate(x, c("one", "two", "three"))
 								    tibble::tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
 								     separate(x, c("one", "two", "three"))
 								    ```
 .  Both `unite()` and `separate()` have a `remove` argument. What does it
-												Tidying the tidy chapter

											
										
										
											2016-07-27 06:44:17 +08:00
+								    do? Why would you set it to `FALSE`?
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
 .  Compare and contrast `separate()` and `extract()`.  Why are there
 								    three variations of separation, but only one unite?
 								## Missing values
-												Tidying the tidy chapter

											
										
										
											2016-07-27 06:44:17 +08:00
+								Changing the representation of a dataset brings up an important subtlety of missing values. Suprisingly, a value can be missing in one of two possible ways:
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
-												Tidying the tidy chapter

											
										
										
											2016-07-27 06:44:17 +08:00
+								* __Explicitly__, i.e. flagged with `NA`.
 								* __Implicitly__, i.e. simply not present in the data.
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
 								Let's illustrate this idea with a very simple data set:
 								```{r}
 								stocks <- data_frame(
 								  year   = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
 								  qtr    = c(   1,    2,    3,    4,    2,    3,    4),
 								  return = c(1.88, 0.59, 0.35,   NA, 0.92, 0.17, 2.66)
 								)
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
+								```
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								There are two missing values in this dataset:
 								* The return for the fourth quarter of 2015 is explicitly missing, because
 								  the cell where its value should be instead contains `NA`.
 								* The return for the first quarter of 2016 is implicitly missing, because it
 								  simply does not appear in the dataset.
-												Fix koan

Fixes #204

											
										
										
											2016-07-29 04:40:34 +08:00
+								One way to think about the difference is with this Zen-like koan: An explicit missing value is the presence of an absence; an implicit missing value is the absence of a presence.
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
-												Tidying the tidy chapter

											
										
										
											2016-07-27 06:44:17 +08:00
+								The way that a dataset is represented can make implicit values explicit. For example, we can make the implicit missing value explicit putting years in the columns:
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
 								```{r}
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								stocks %>%
 								  spread(year, return)
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
+								```
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								Because these explicit missing values may not be important in other representations of the data, you can set `na.rm = TRUE` in `gather()` to turn explicit missing values implicit:
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								```{r}
 								stocks %>%
 								  spread(year, return) %>%
 								  gather(year, return, `2015`:`2016`, na.rm = TRUE)
 								```
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidying the tidy chapter

											
										
										
											2016-07-27 06:44:17 +08:00
+								Another important tool for making missing values explicit in tidy data is `complete()`:
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								```{r}
 								stocks %>%
 								  complete(year, qtr)
 								```
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								`complete()` takes a set of columns, and finds all unique combinations. It then ensures the original dataset contains all those values, filling in explicit `NA`s where necessary.
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Fixes in tidy (#210)

* Fixed URL to WHO data

The link was not rendered as missing the protocol.

* Typos

											
										
										
											2016-08-01 00:33:58 +08:00
+								There's one other important tool that you should know for working with missing values. Sometimes when a data source has primarily been used for data entry, missing values indicate that the previous value should be carried forward:
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
 								```{r}
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								treatment <- frame_data(
 								  ~ person,           ~ treatment, ~response,
 								  "Derrick Whitmore", 1,           7,
 								  NA,                 2,           10,
 								  NA,                 3,           9,
 								  "Katherine Burke",  1,           4
 								)
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
+								```
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								You can fill in these missing values with `fill()`. It takes a set of columns where you want missing values to be replaced by the most recent non-missing value (sometimese called last observation carried forward).
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
 								```{r}
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								treatment %>%
 								  fill(person)
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
+								```
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								### Exercises
 .  Compare and contrast the `fill` arguments to `spread()` and `complete()`.
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+.  What does the direction argument to `fill()` do?
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Start moving towards Hadley style

											
										
										
											2015-07-29 03:15:28 +08:00
+								## Case Study
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Fixes in tidy (#210)

* Fixed URL to WHO data

The link was not rendered as missing the protocol.

* Typos

											
										
										
											2016-08-01 00:33:58 +08:00
+								To finish off the chapter, let's pull together everything you've learned to tackle a realistic data tidying problem. The `tidyr::who` dataset contains reporter tuberculosis (TB) cases broken down by year, country, age, gender, and diagnosis method. The data comes from the *2014 World Health Organization Global Tuberculosis Report*, available for download at <http://www.who.int/tb/country/data/download/en/>.
-												Tidying the tidy chapter

											
										
										
											2016-07-27 06:44:17 +08:00
 								There's a wealth of epidemiological information in this dataset, but it's challenging to work with the data in the form that it's provided:
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Start moving towards Hadley style

											
										
										
											2015-07-29 03:15:28 +08:00
+								```{r}
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
+								who
 								```
-												Final tidy polishing

											
										
										
											2016-07-27 21:23:28 +08:00
+								This is a very typical example of data you are likely to encounter in real life. It contains redundant columns, odd variable codes, and many missing values. In short, `who` is messy, and we'll need multiple steps to tidy it. Like dplyr, tidyr is designed so that each function does one thing well. That means in real-life situations you'll usually need to string together multiple verbs into a pipeline.
-												Tidying the tidy chapter

											
										
										
											2016-07-27 06:44:17 +08:00
 								The best place to start is almost always to gathering together the columns that are not variables. Let's have a look at what we've got:
-												Final tidy polishing

											
										
										
											2016-07-27 21:23:28 +08:00
+								* It looks like `country`, `iso2`, and `iso3` are three variables that
 								  redundantly specify the country.
-												Tidying the tidy chapter

											
										
										
											2016-07-27 06:44:17 +08:00
 								* `year` is clearly also a variable.
 								* We don't know what all the other columns are yet, but given the structure
-												Final tidy polishing

											
										
										
											2016-07-27 21:23:28 +08:00
+								  in the variable names (e.g. `new_sp_m014`, `new_ep_m014`, `new_ep_f014`)
 								  these are likely to be values, not variable.
-												Tidying the tidy chapter

											
										
										
											2016-07-27 06:44:17 +08:00
-												Final tidy polishing

											
										
										
											2016-07-27 21:23:28 +08:00
+								So we need to gather together all the columns from `new_sp_m3544` to `newrel_f65`. We don't know what those values represent yet, so we'll give them the generic name `"key"`. We know the cells repesent the count of cases, so we'll use the variable `cases`. There are a lot of missing values in the current representation, so for now we'll use `na.rm` just so we can focus on the values that are present.
-												Tidying the tidy chapter

											
										
										
											2016-07-27 06:44:17 +08:00
 								```{r}
-												Changes from @mine-cetinkaya-rundel

											
										
										
											2016-08-01 00:32:16 +08:00
+								who1 <- who %>%
 								  gather(new_sp_m014:newrel_f65, key = "key", value = "cases", na.rm = TRUE)
-												Tidying the tidy chapter

											
										
										
											2016-07-27 06:44:17 +08:00
+								who1
 								```
 								We can get some hint of the structure of the values in the new `key` column:
 								```{r}
-												Changes from @mine-cetinkaya-rundel

											
										
										
											2016-08-01 00:32:16 +08:00
+								who1 %>%
 								  count(key)
-												Tidying the tidy chapter

											
										
										
											2016-07-27 06:44:17 +08:00
+								```
 								You might be able to parse this out by yourself with a little thought and some experimentation, but luckily we have the data dictionary handy. It tells us:
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+.  The first three letters of each column denote whether the column
 								    contains new or old cases of TB. In this dataset, each column contains
 								    new cases.
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidying the tidy chapter

											
										
										
											2016-07-27 06:44:17 +08:00
+.  The next two letters describe the type of TB:
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
 								    *   `rel` stands for cases of relapse
 								    *   `ep` stands for cases of extrapulmonary TB
 								    *   `sn` stands for cases of pulmonary TB that could not be diagnosed by
 								        a pulmonary smear (smear negative)
 								    *   `sp` stands for cases of pulmonary TB that could be diagnosed be
 								        a pulmonary smear (smear positive)
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Final tidy polishing

											
										
										
											2016-07-27 21:23:28 +08:00
+.  The sixth letter gives the sex of TB patients. The dataset groups
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								    cases by males (`m`) and females (`f`).
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Final tidy polishing

											
										
										
											2016-07-27 21:23:28 +08:00
+.  The remaining numbers gives the age group. The dataset groups cases into
 								    seven age groups:
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
 								    * `014` = 0 -- 14 years old
 								    * `1524` = 15 -- 24 years old
 								    * `2534` = 25 -- 34 years old
 								    * `3544` = 35 -- 44 years old
 								    * `4554` = 45 -- 54 years old
 								    * `5564` = 55 -- 64 years old
 								    * `65` = 65 or older
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidying the tidy chapter

											
										
										
											2016-07-27 06:44:17 +08:00
+								We need to make a minor fix to the format of the column names: unfortunately the names are slightly inconsistent because instead of `new_rel_` we have `newrel` (it's hard to spot this here but if you don't fix it we'll get errors in subsequent steps). You'll learn about `str_replace()` in [strings], but the basic idea is pretty simple: replace the string "newrel" with "new_rel". This makes all variable names consistent.
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								```{r}
-												Changes from @mine-cetinkaya-rundel

											
										
										
											2016-08-01 00:32:16 +08:00
+								who2 <- who1 %>%
 								  mutate(key = stringr::str_replace(key, "newrel", "new_rel"))
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								who2
 								```
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								We can separate the values in each code with two passes of `separate()`. The first pass will split the codes at each underscore.
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
 								```{r}
-												Changes from @mine-cetinkaya-rundel

											
										
										
											2016-08-01 00:32:16 +08:00
+								who3 <- who2 %>%
 								  separate(key, c("new", "type", "sexage"), sep = "_")
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								who3
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
+								```
-												Tidying the tidy chapter

											
										
										
											2016-07-27 06:44:17 +08:00
+								Then we might as well drop the `new` colum because it's consistent in this dataset. While we're dropping columns, let's also drop `iso2` and `iso3` since they're redundant.
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
 								```{r}
-												Changes from @mine-cetinkaya-rundel

											
										
										
											2016-08-01 00:32:16 +08:00
+								who3 %>%
 								  count(new)
 								who4 <- who3 %>%
 								  select(-new, -iso2, -iso3)
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
+								```
-												Final tidy polishing

											
										
										
											2016-07-27 21:23:28 +08:00
+								Next we'll separate `sexage` into `sex` and `age` by splitting after the first character:
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
 								```{r}
-												Changes from @mine-cetinkaya-rundel

											
										
										
											2016-08-01 00:32:16 +08:00
+								who5 <- who4 %>%
 								  separate(sexage, c("sex", "age"), sep = 1)
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								who5
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
+								```
-												Final tidy polishing

											
										
										
											2016-07-27 21:23:28 +08:00
+								The `who` dataset is now tidy! It is far from clean (for example, it contains several redundant columns and many missing values), but it will now be much easier to work with in R.
 								I've shown you the code a piece at a time, assinging each interim result to a new variable. This typically isn't how you'd work interactively. Instead, you'd gradually build up a complex pipe:
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
 								```{r}
 								who %>%
 								  gather(code, value, new_sp_m014:newrel_f65, na.rm = TRUE) %>%
 								  mutate(code = stringr::str_replace(code, "newrel", "new_rel")) %>%
 								  separate(code, c("new", "var", "sexage")) %>%
-												Final tidy polishing

											
										
										
											2016-07-27 21:23:28 +08:00
+								  select(-new, -iso2, -iso3) %>%
-												Tidying the tidy chapter

											
										
										
											2016-07-27 06:44:17 +08:00
+								  separate(sexage, c("sex", "age"), sep = 1)
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
+								```
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								### Exercises
 .  In this case study I set `na.rm = TRUE` just to make it easier to
 								    check that we had the correct values. Is this reasonable? Think about
-												Final tidy polishing

											
										
										
											2016-07-27 21:23:28 +08:00
+								    how missing values are represented in this dataset. Are there implicit
 								    missing values? What's the difference between an `NA` and zero?
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
-												Tidying the tidy chapter

											
										
										
											2016-07-27 06:44:17 +08:00
+.  What happens if you neglect the `mutate()` step?
 .  I claimed that `iso2` and `iso3` were redundant with `country`.
 								    Confirm my claim by creating a table that uniquely maps from `country`
 								    to `iso2` and `iso3`.
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
-												Tidying the tidy chapter

											
										
										
											2016-07-27 06:44:17 +08:00
+.  For each country, year, and sex compute the total number of cases of
 								    TB. Make an informative visualisation of the data.
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
 								## Non-tidy data
-												Final tidy polishing

											
										
										
											2016-07-27 21:23:28 +08:00
+								Before we continue on to other topics, it's worth talking briefly about non-tidy data. Earlier in the chapter, I used the perjorative term "messy" to refer to non-tidy data. That's an oversimplification: there are lots of useful and well founded data structures that are not tidy data. There are two mains reasons to use other data structures:
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
-												Tidying the tidy chapter

											
										
										
											2016-07-27 06:44:17 +08:00
+								* Alternative representations may have substantial performance or space
 								  advantages.
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
 								* Specialised fields have evolved their own conventions for storing data
 								  that may be quite different to the conventions of  tidy data.
-												Tidying the tidy chapter

											
										
										
											2016-07-27 06:44:17 +08:00
+								Either of these reasons means you'll need something other than a tibble (or data frame). If your data does fit naturally into a rectangular structure composed of observations and variables, I think tidy data should be your default choice. But there are good reasons to use other structures; tidy data is not the only way.
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
 								If you'd like to learn more about non-tidy data, I'd highly recommend this thoughtful blog post by Jeff Leek: <http://simplystatistics.org/2016/02/17/non-tidy-data/>