Fix typos (#146)

This commit is contained in:
behrman 2016-07-22 09:24:48 -07:00 committed by Hadley Wickham
parent b3e09d087b
commit 05a0efe27c
1 changed files with 16 additions and 16 deletions

View File

@ -1,6 +1,6 @@
# Tidy data
> "Tidy datasets are all alike but every messy dataset is messy in its
> "Tidy datasets are all alike, but every messy dataset is messy in its
> own way." -- Hadley Wickham
Data science, at its heart, is a computer programming exercise. Data scientists use computers to store, transform, visualize, and model their data. Each computer program will expect your data to be organized in a predetermined way, which may vary from program to program. To be an effective data scientist, you will need to be able to reorganize your data to match the format required by your program.
@ -142,8 +142,8 @@ You'll need to perform an extra step to calculate the rate.
```{r eval = FALSE}
# Dataset two
case_rows <- c(1, 3, 5, 7, 9, 11, 13, 15, 17)
pop_rows <- c(2, 4, 6, 8, 10, 12, 14, 16, 18)
case_rows <- c(1, 3, 5, 7, 9, 11)
pop_rows <- c(2, 4, 6, 8, 10, 12)
table2$value[case_rows] / table2$value[pop_rows] * 10000
```
@ -174,8 +174,8 @@ After you collect your input, you can calculate the rate.
```{r eval = FALSE}
# Dataset four
cases <- c(table4$1999, table4$2000, table4$2001)
population <- c(table5$1999, table5$2000, table5$2001)
cases <- c(table4$`1999`, table4$`2000`)
population <- c(table5$`1999`, table5$`2000`)
cases / population * 10000
```
@ -191,17 +191,17 @@ Tidy data was popularized by Hadley Wickham, and it serves as the basis for many
The `tidyr` package by Hadley Wickham is designed to help you tidy your data. It contains four functions that alter the layout of tabular datasets, while preserving the values and relationships contained in the datasets.
The two most important functions in `tidyr` are `gather()` and `spread()`. Each relies on the idea of a key value pair.
The two most important functions in `tidyr` are `gather()` and `spread()`. Each relies on the idea of a key-value pair.
### key value pairs
### key-value pairs
A key value pair is a simple way to record information. A pair contains two parts: a *key* that explains what the information describes, and a *value* that contains the actual information. So for example, this would be a key value pair:
A key-value pair is a simple way to record information. A pair contains two parts: a *key* that explains what the information describes, and a *value* that contains the actual information. So, for example, this would be a key-value pair:
Password: 0123456789
`0123456789` is the value, and it is associated with the key `Password`.
Data values form natural key value pairs. The value is the value of the pair and the variable that the value describes is the key. So for example, you could decompose `table1` into a group of key value pairs, like this:
Data values form natural key-value pairs. The value is the value of the pair and the variable that the value describes is the key. So for example, you could decompose `table1` into a group of key-value pairs, like this:
Country: Afghanistan
Country: Brazil
@ -222,9 +222,9 @@ Data values form natural key value pairs. The value is the value of the pair and
Cases: 212258
Cases: 213766
However, the key value pairs would cease to be a useful dataset because you no longer know which values belong to the same observation.
However, the key-value pairs would cease to be a useful dataset because you no longer know which values belong to the same observation.
Every cell in a table of data contains one half of a key value pair, as does every column name. In tidy data, each cell will contain a value and each column name will contain a key, but this doesn't need to be the case for untidy data. Consider `table2`.
Every cell in a table of data contains one half of a key-value pair, as does every column name. In tidy data, each cell will contain a value and each column name will contain a key, but this doesn't need to be the case for untidy data. Consider `table2`.
```{r}
table2
@ -270,7 +270,7 @@ You can see that `spread()` maintains each of the relationships expressed in the
table4 # cases
```
To use `gather()`, pass it the name of a data frame to reshape. Then pass `gather()` a character string to use for the name of the "key" column that it will make, as well as a character string to use as the name of the value column that it will make. Finally, specify which columns `gather()` should collapse into the key value pair (here with integer notation).
To use `gather()`, pass it the name of a data frame to reshape. Then pass `gather()` a character string to use for the name of the "key" column that it will make, as well as a character string to use as the name of the value column that it will make. Finally, specify which columns `gather()` should collapse into the key-value pair (here with integer notation).
```{r}
gather(table4, "year", "cases", 2:3)
@ -278,7 +278,7 @@ gather(table4, "year", "cases", 2:3)
`gather()` returns a copy of the data frame with the specified columns removed. To this data frame, `gather()` has added two new columns: a "key" column that contains the former column names of the removed columns, and a value column that contains the former values of the removed columns. `gather()` repeats each of the former column names (as well as each of the original columns) to maintain each combination of values that appeared in the original dataset. `gather()` uses the first string that you supplied as the name of the new "key" column, and it uses the second string as the name of the new value column. In our example, these were the strings "year" and "cases."
We've placed "key" in quotation marks because you will usually use `gather()` to create tidy data. In this case, the "key" column will contain values, not keys. The values will only be keys in the sense that they were formally in the column names, a place where keys belong.
We've placed "key" in quotation marks because you will usually use `gather()` to create tidy data. In this case, the "key" column will contain values, not keys. The values will only be keys in the sense that they were formerly in the column names, a place where keys belong.
```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/tidy-9.png")
@ -354,7 +354,7 @@ You can further customize `separate()` with the `remove`, `convert`, and `extra`
- **`remove`** - Set `remove = FALSE` to retain the column of values that were separated in the final data frame.
- **`convert`** - By default, `separate()` will return new columns as character columns. Set `convert = TRUE` to convert new columns to double (numeric), integer, logical, complex, and factor columns with `type.convert()`.
- **`extra`** - `extra` controls what happens when the number of new values in a cell does not match the number of new columns in `into`. If `extra = error` (the default), `separate()` will return an error. If `extra = drop`, `separate()` will drop new values and supply `NA`s as necessary to fill the new columns. If `extra = merge`, `separate()` will split at most `length(into)` times.
- **`extra`** - `extra` controls what happens when the number of new values in a cell does not match the number of new columns in `into`. If `extra = error` (the default), `separate()` will return an error. If `extra = "drop"`, `separate()` will drop new values and supply `NA`s as necessary to fill the new columns. If `extra = "merge"`, `separate()` will split at most `length(into)` times.
### `unite()`
@ -371,7 +371,7 @@ table6
Give `unite()` the name of the data frame to reshape, the name of the new column to create (as a character string), and the names of the columns to unite. `unite()` will place an underscore (\_) between values from separate columns. If you would like to use a different separator, or no separator at all, pass the separator as a character string to `sep`.
```{r}
unite(table6, "new", century, year, sep = "")
unite(table6, new, century, year, sep = "")
```
`unite()` returns a copy of the data frame that includes the new column, but not the columns used to build the new column. If you would like to retain these columns, add the argument `remove = FALSE`.
@ -420,7 +420,7 @@ The most unique feature of `who` is its coding system. Columns five through sixt
Notice that the `who` dataset is untidy in multiple ways. First, the data appears to contain values in its column names, coded values such as male, relapse, and 0 - 14 years of age. We can move the names into their own column with `gather()`. This will make it easy to separate the values combined in each column name.
```{r}
who <- gather(who, "code", "value", 5:60)
who <- gather(who, code, value, 5:60)
who
```