Start moving towards Hadley style

This commit is contained in:
hadley 2015-07-28 14:15:28 -05:00
parent 559b6e0795
commit 06311a38d8
3 changed files with 41 additions and 52 deletions

View File

@ -1,3 +1,4 @@
<li><a href="intro.html">Introduction</a></li>
<li><a href="tidy.html">Tidy data</a></li>
<!--
<li class="dropdown-header">R for Data Science</li>

23
intro.Rmd Normal file
View File

@ -0,0 +1,23 @@
---
layout: default
title: Welcome
output: bookdown::html_chapter
---
## Prerequsitives
You will need to have R installed on your computer to run the code in this chapter, as well as the RStudio IDE, a free program that makes it easier to use R. You can learn how to install both in *Appendix A: Getting Started*.
You will also need to install the `tidyr`, `dplyr`, `devtools`, and `DSR` packages. To install, `tidyr`, `dplyr`, and `devtools`, open RStudio and run the command
```{r eval = FALSE}
install.packages(c("tidyr", "dplyr", "devtools"))
```
`DSR` is a collection of data sets that I have assembled for this book and saved online as a github repository ([github.com/garrettgman/DSR](http://github.com/garrettgman/DSR)). To install `DSR`, run the command
```{r eval = FALSE}
devtools::install_github("garrettgman/DSR")
```
To use the packages, load them with `library()`, i.e.

View File

@ -13,8 +13,7 @@ In this chapter, you will learn the best way to organize your data for R, a task
Note that this chapter explains how to change the format, or layout, of tabular data. You will learn how to use different file formats with R in the next chapter, Import Data.
Outline
-------
## Outline
In *Section 4.1*, you will learn how the features of R determine the best way to layout your data. This section introduces "tidy data," a way to organize your data that works particularly well with R.
@ -24,24 +23,7 @@ In *Section 4.1*, you will learn how the features of R determine the best way to
*Section 4.4* concludes the chapter, combining everything you've learned about `tidyr` to tidy a real data set on tuberculosis epidemiology collected by the *World Health Organization*.
Prerequisites
-------------
You will need to have R installed on your computer to run the code in this chapter, as well as the RStudio IDE, a free program that makes it easier to use R. You can learn how to install both in *Appendix A: Getting Started*.
You will also need to install the `tidyr`, `dplyr`, `devtools`, and `DSR` packages. To install, `tidyr`, `dplyr`, and `devtools`, open RStudio and run the command
```{r eval = FALSE}
install.packages(c("tidyr", "dplyr", "devtools"))
```
`DSR` is a collection of data sets that I have assembled for this book and saved online as a github repository ([github.com/garrettgman/DSR](http://github.com/garrettgman/DSR)). To install `DSR`, run the command
```{r eval = FALSE}
devtools::install_github("garrettgman/DSR")
```
To use the packages, load them with `library()`, i.e.
## Prerequisites
```{r message=FALSE}
library(tidyr)
@ -49,8 +31,7 @@ library(dplyr)
library(DSR)
```
2.1 Tidy data
-------------
## Tidy data
You can organize tabular data in many ways. For example, the data sets below show the same data organized in four different ways. Each data set shows the same values of four variables *country*, *year*, *population*, and *cases*, but each data set organizes the values into a different layout . You can access the data sets in the `DSR` package.
@ -204,20 +185,15 @@ Keep in mind that this is a trivial calculation with a trivial data set. The ene
The next sections will show you how to transform untidy data sets into tidy data sets.
------------------------------------------------------------------------
Tidy data was popularized by Hadley Wickham, and it serves as the basis for many R packages and functions. You can learn more about tidy data by reading *Tidy Data* a paper written by Hadley Wickham and published in the Journal of Statistical Software. *Tidy Data* is available online at [www.jstatsoft.org/v59/i10/paper](http://www.jstatsoft.org/v59/i10/paper).
------------------------------------------------------------------------
2.2 `spread()` and `gather()`
-----------------------------
## `spread()` and `gather()`
The `tidyr` package by Hadley Wickham is designed to help you tidy your data. It contains four functions that alter the layout of tabular data sets, while preserving the values and relationships contained in the data sets.
The two most important functions in `tidyr` are `gather()` and `spread()`. Each relies on the idea of a key value pair.
### 2.2.1 key value pairs
### key value pairs
A key value pair is a simple way to record information. A pair contains two parts: a *key* that explains what the information describes, and a *value* that contains the actual information. So for example, this would be a key value pair:
@ -258,7 +234,7 @@ In `table2`, the `key` column contains only keys (and not just because the colum
You can use the `spread()` function to tidy this layout.
### 2.2.2 `spread()`
### `spread()`
`spread()` turns a pair of key:value columns into a set of tidy columns. To use `spread()`, pass it the name of a data frame, then the name of the key column in the data frame, and then the name of the value column. Pass the column names as they are; do not use quotes.
@ -283,7 +259,7 @@ You can see that `spread()` maintains each of the relationships expressed in the
- **`drop`** - The `drop` argument controls how `spread()` handles factors in the key column. If you set `drop = FALSE`, spread will keep factor levels that do not appear in the key column, filling in the missing combinations with the value of `fill`.
### 2.2.3 `gather()`
### `gather()`
`gather()` does the reverse of `spread()`. `gather()` collects a set of column names and places them into a single "key" column. It also collects the field of cells associated with those columns and places them into a single value column. You can use `gather()` to tidy `table4`.
@ -331,14 +307,13 @@ tidy5 <- gather(table5, "year", "population", 2:3)
left_join(tidy4, tidy5)
```
2.3 `separate()` and `unite()`
------------------------------
## `separate()` and `unite()`
You may have noticed that we skipped `table3` in the last section. `table3` is untidy too, but it cannot be tidied with `gather()` or `spread()`. To tidy `table3`, you will need two new functions, `separate()` and `unite()`.
`separate()` and `unite()` help you split and combine cells to place a single, complete value in each cell.
### 2.3.1 `separate()`
### `separate()`
`separate()` turns a single character column into multiple columns by splitting the values of the column wherever a separator character appears.
@ -376,11 +351,11 @@ You can futher customize `separate()` with the `remove`, `convert`, and `extra`
- **`convert`** - By default, `separate()` will return new columns as character columns. Set `convert = TRUE` to convert new columns to double (numeric), integer, logical, complex, and factor columns with `type.convert()`.
- **`extra`** - `extra` controls what happens when the number of new values in a cell does not match the number of new columns in `into`. If `extra = error` (the default), `separate()` will return an error. If `extra = drop`, `separate()` will drop new values and supply `NA`s as necessary to fill the new columns. If `extra = merge`, `separate()` will split at most `length(into)` times.
### 2.3.2 `unite()`
### `unite()`
`unite()` does the opposite of `separate()`: it combines multiple columns into a single column.
[UNITE DESCRIPTION] ![](images/blank.png)
**TODO: UNITE DESCRIPTION**
We can use `unite()` to rejoin the *century* and *year* columns that we created in the last example. That data is saved in the `DSR` package as `table6`.
@ -398,20 +373,14 @@ unite(table6, "new", century, year, sep = "")
You can also use integers or the syntax of the `dplyr::select()` function to specify columns to unite in a more concise way.
2.4 Case Study
--------------
## Case Study
The `who` data set in the `DSR` package contains cases of tuberculosis (TB) reported between 1995 and 2013 sorted by country, age, and gender. The data comes in the *2014 World Health Organization Global Tuberculosis Report*, available for download at [www.who.int/tb/country/data/download/en/](http://www.who.int/tb/country/data/download/en/). The data provides a wealth of epidemiological information, but it would be difficult to work with the data as it is.
To see the data in its raw form, load `DSR` with `library(DSR)` then run
```{r eval = FALSE}
```{r}
who
```
![](images/tidy-12.png)
*A subset of the `who` data frame displayed with `View()`.*
`who` provides a realistic example of tabular data in the wild. It contains redundant columns, odd variable codes, and many missing values. In short, `who` is messy.
------------------------------------------------------------------------
@ -447,32 +416,28 @@ Notice that the `who` data set is untidy in multiple ways. First, the data appea
```{r}
who <- gather(who, "code", "value", 5:60)
who
```
![](images/tidy-13.png)
We can separate the values in each code with two passes of `separate()`. The first pass will split the codes at each underscore.
```{r}
who <- separate(who, code, c("new", "var", "sexage"))
who
```
![](images/tidy-14.png)
The second pass will split `sexage` after the first character to create two columns, a sex column and an age column.
```{r}
who <- separate(who, sexage, c("sex", "age"), sep = 1)
who
```
![](images/tidy-15.png)
The `rel`, `ep`, `sn`, and `sp` keys are all contained in the same column. We can now move the keys into their own column names with `spread()`.
```{r}
who <- spread(who, var, value)
who
```
![](images/tidy-16.png)
The `who` data set is now tidy. It is far from sparkling (for example, it contains several redundant columns and many missing values), but it will now be much easier to work with in R.