Better image scaling for tidy data

This commit is contained in:
hadley 2015-12-08 06:20:11 -06:00
parent 939091bbec
commit e4b1d60743
1 changed files with 9 additions and 9 deletions

View File

@ -68,7 +68,7 @@ R follows a set of conventions that makes one layout of tabular data much easier
Data that satisfies these rules is known as *tidy data*. Notice that `table1` is tidy data.
![](images/tidy-1.png)
`r bookdown::embed_png("images/tidy-1.png", 220)`
*In `table1`, each variable is placed in its own column, each observation in its own row, and each value in its own cell.*
Tidy data builds on a premise of data science that data sets contain *both values and relationships*. Tidy data displays the relationships in a data set as consistently as it displays the values in a data set.
@ -79,7 +79,7 @@ Tidy data works well with R because it takes advantage of R's traits as a vector
Tidy data arranges values so that the relationships between variables in a data set will parallel the relationship between vectors in R's storage objects. R stores tabular data as a data frame, a list of atomic vectors arranged to look like a table. Each column in the table is an atomic vector in the list. In tidy data, each variable in the data set is assigned to its own column, i.e., its own vector in the data frame.
![](images/tidy-2.png)
`r bookdown::embed_png("images/tidy-2.png", 220)`
*A data frame is a list of vectors that R displays as a table. When your data is tidy, the values of each variable fall in their own column vector.*
As a result, you can extract the all of the values of a variable in a tidy data set by extracting the column vector that contains the variable. You can do this easily with R's list syntax, i.e.
@ -111,7 +111,7 @@ table1$population / table1$cases
To create the output, R applies the function in element-wise fashion: R first applies the function (or operation) to the first elements of each vector involved. Then R applies the function (or operation) to the second elements of each vector involved, and so on until R reaches the end of the vectors. If one vector is shorter than the others, R will recycle its values as needed (according to a set of recycling rules).
![](images/tidy-3.png)
`r bookdown::embed_png("images/tidy-3.png", 220)`
If your data is tidy, element-wise execution will ensure that observations are preserved across functions and operations. Each value will only be paired with other values that appear in the same row of the data frame. In a tidy data frame, these values will be values of the same observation.
@ -129,7 +129,7 @@ If you use basic R syntax, your calculations will look like the code below. If y
#### Data set one
![](images/tidy-4.png)
`r bookdown::embed_png("images/tidy-4.png", 220)`
Since `table1` is organized in a tidy fashion, you can calculate the rate like this,
@ -140,7 +140,7 @@ table1$cases / table1$population * 10000
#### Data set two
![](images/tidy-5.png)
`r bookdown::embed_png("images/tidy-5.png", 220)`
Data set two intermingles the values of *population* and *cases* in the same column, *value*. As a result, you will need to untangle the values whenever you want to work with each variable separately.
@ -155,7 +155,7 @@ table2$value[case_rows] / table2$value[pop_rows] * 10000
#### Data set three
![](images/tidy-6.png)
`r bookdown::embed_png("images/tidy-6.png", 220)`
Data set three combines the values of cases and population into the same cells. It may seem that this would help you calculate the rate, but that is not so. You will need to separate the population values from the cases values if you wish to do math with them. This can be done, but not with "basic" R syntax.
@ -166,7 +166,7 @@ Data set three combines the values of cases and population into the same cells.
#### Data set four
![](images/tidy-7.png)
`r bookdown::embed_png("images/tidy-7.png", 220)`
Data set four stores the values of each variable in a different format: as a column, a set of column names, or a field of cells. As a result, you will need to work with each variable differently. This makes code written for data set four hard to generalize. The code that extracts the values of *year*, `names(table4)[-1]`, cannot be generalized to extract the values of population, `c(table5$1999, table5$2000, table5$2001)`. Compare this to data set one. With `table1`, you can use the same code to extract the values of year, `table1$year`, that you use to extract the values of population. To do so, you only need to change the name of the variable that you will access: `table1$population`.
@ -248,7 +248,7 @@ spread(table2, key, value)
`spread()` returns a copy of your data set that has had the key and value columns removed. In their place, `spread()` adds a new column for each unique key in the key column. These unique keys will form the column names of the new columns. `spread()` distributes the cells of the former value column across the cells of the new columns and truncates any non-key, non-value columns in a way that prevents duplication.
![](images/tidy-8.png)
`r bookdown::embed_png("images/tidy-8.png", 220)`
*`spread()` distributes a pair of key:value columns into a field of cells. The unique keys in the key column become the column names of the field of cells.*
You can see that `spread()` maintains each of the relationships expressed in the original data set. The output contains the four original variables, *country*, *year*, *population*, and *cases*, and the values of these variables are grouped according to the orginal observations. As a bonus, now the layout of these relationships is tidy.
@ -279,7 +279,7 @@ gather(table4, "year", "cases", 2:3)
We've placed "key" in quotation marks because you will usually use `gather()` to create tidy data. In this case, the "key" column will contain values, not keys. The values will only be keys in the sense that they were formally in the column names, a place where keys belong.
![](images/tidy-9.png)
`r bookdown::embed_png("images/tidy-9.png", 220)`
Just like `spread()`, gather maintains each of the relationships in the original data set. This time `table4` only contained three variables, *country*, *year* and *cases*. Each of these appears in the output of `gather()` in a tidy fashion.