Use dataset instead of data set

This commit is contained in:
hadley 2016-07-11 10:40:44 -05:00
parent 7a285374de
commit 1822802696
10 changed files with 113 additions and 89 deletions

View File

@ -7,7 +7,7 @@ library(ggplot2)
# Communication with plots
The previous sections showed you how to make plots that you can use as a tools for _exploration_. When you made these plots, you knew---even before you looked at them---which variables the plot would display and which data sets the variables would come from. You might have even known what to look for in the completed plots, assuming that you made each plot with a goal in mind. As a result, it was not very important to put a title or a useful set of labels on your plots.
The previous sections showed you how to make plots that you can use as a tools for _exploration_. When you made these plots, you knew---even before you looked at them---which variables the plot would display and which datasets the variables would come from. You might have even known what to look for in the completed plots, assuming that you made each plot with a goal in mind. As a result, it was not very important to put a title or a useful set of labels on your plots.
The importance of titles and labels changes once you use your plots for _communication_. Your audience will not share your background knowledge. In fact, they may not know anything about your plots except what the plots themselves display. If you want your plots to communicate your findings effectively, you will need to make them as self-explanatory as possible.

View File

@ -14,14 +14,14 @@ library(lubridate)
## Parsing times
Time data normally comes as character strings, or numbers spread across columns, as in the `flights` data set from [Relational data].
Time data normally comes as character strings, or numbers spread across columns, as in the `flights` dataset from [Relational data].
```{r}
flights %>%
select(year, month, day, hour, minute)
```
Getting R to agree that your data set contains the dates and times that you think it does can be tricky. Lubridate simplifies that. To combine separate numbers into datetimes, use `make_datetime()`.
Getting R to agree that your dataset contains the dates and times that you think it does can be tricky. Lubridate simplifies that. To combine separate numbers into datetimes, use `make_datetime()`.
```{r}
datetimes <- flights %>%

View File

@ -29,6 +29,6 @@ Visualization works because your brain processes visual information in a differe
You can also comprehend data by transforming it. You can easily attend to a small set of summary values, which lets you absorb important information about the data. This is why it feels natural to work with things like averages, maximums, minimums, medians, and so on.
Another way to summarize your data is to replace it with a model, a function that describes the relationships between two or more variables. You can attend to the important parts of a model more easily than you can attend to the raw values in your data set.
Another way to summarize your data is to replace it with a model, a function that describes the relationships between two or more variables. You can attend to the important parts of a model more easily than you can attend to the raw values in your dataset.
The first problem in Data Science is a cognitive problem: how can you understand your own data? In this part of the book, you'll learn how to use R to discover and understand the information contained in your data.

View File

@ -75,7 +75,7 @@ However, we strongly believe that it's best to master one tool at a time. You wi
### Non-rectangular data
This book focuses exclusively on structured data sets: collections of values that are each associated with a variable and an observation. There are lots of data sets that do not naturally fit in this paradigm: images, sounds, trees, text. But data frames are extremely common in science and in industry and we believe that they're a great place to start your data analysis journey.
This book focuses exclusively on structured datasets: collections of values that are each associated with a variable and an observation. There are lots of datasets that do not naturally fit in this paradigm: images, sounds, trees, text. But data frames are extremely common in science and in industry and we believe that they're a great place to start your data analysis journey.
### Hypothesis confirmation

View File

@ -242,7 +242,7 @@ Now that we've given you a quick overview and intuition for these techniques, le
Both the boostrap and cross-validation are build on top of a "resample" object. In modelr, you can access these low-level tools directly with the `resample_*` functions.
These functions return an object of class "resample", which represents the resample in a memory efficient way. Instead of storing the resampled data set itself, it instead stores the integer indices, and a "pointer" to the original dataset. This makes resamples take up much less memory.
These functions return an object of class "resample", which represents the resample in a memory efficient way. Instead of storing the resampled dataset itself, it instead stores the integer indices, and a "pointer" to the original dataset. This makes resamples take up much less memory.
```{r}
x <- resample_bootstrap(as_data_frame(mtcars))

View File

@ -1,3 +1,15 @@
```{r include=FALSE, cache=FALSE}
set.seed(1014)
options(digits = 3)
knitr::opts_chunk$set(
comment = "#>",
collapse = TRUE,
cache = TRUE
)
options(dplyr.print_min = 6, dplyr.print_max = 6)
```
# Model
The goal of a fitted model is to provide a simple low-dimensional summary of a dataset. Ideally, the fitted model will capture true "signals" (i.e. patterns generated by the phenomenon of interest), and ignore "noise" (i.e. random variation that you're not interested in).
@ -29,7 +41,7 @@ We are going to focus on predictive models, how you can use simple fitted models
### Prerequisites
To access the functions and data sets that we will use in the chapter, load the following packages:
To access the functions and datasets that we will use in the chapter, load the following packages:
```{r setup, message = FALSE}
# Modelling functions
@ -160,7 +172,7 @@ Have you heard that a relationship exists between your height and your income? I
Luckily, it is easy to measure someone's height, as well as their income, which means that we can collect data relevant to the question. In fact, the Bureau of Labor Statistics has been doing this in a controlled way for over 50 years. The BLS [National Longitudinal Surveys (NLS)](https://www.nlsinfo.org/) track the income, education, and life circumstances of a large cohort of Americans across several decades. In case you are wondering just how your tax dollars are being spent, the point of the NLS is not to study the relationship between height and income, that's just a lucky accident.
A small sample of the full data set is included in modelr:
A small sample of the full dataset is included in modelr:
```{r}
heights

View File

@ -1,3 +1,15 @@
```{r include=FALSE, cache=FALSE}
set.seed(1014)
options(digits = 3)
knitr::opts_chunk$set(
comment = "#>",
collapse = TRUE,
cache = TRUE
)
options(dplyr.print_min = 6, dplyr.print_max = 6)
```
# (PART) Model {-}
# Introduction
@ -23,7 +35,7 @@ In this book we are going to focus on models primarily as tools for description.
In other words, in this book, we're typically going to think about a good model as a model that well captures the patterns that we see in the data. For now, a good model captures the majority of the patterns that are generated by the underlying mechanism of interest, and captures few patterns that are not generated by that mechanism. When you go on from this book and learn other ways of thinking about models this will stand you in good stead: if you can't capture patterns in the data that you can see, it's unlikely you'll be able to make good predictions about data that you haven't seen.
It's not possible to do both on the same data set.
It's not possible to do both on the same dataset.
Doing correct inference is hard!

View File

@ -13,11 +13,11 @@ Note that this chapter explains how to change the format, or layout, of tabular
In *Section 4.1*, you will learn how the features of R determine the best way to layout your data. This section introduces "tidy data," a way to organize your data that works particularly well with R.
*Section 4.2* teaches the basic method for making untidy data tidy. In this section, you will learn how to reorganize the values in your data set with the `spread()` and `gather()` functions of the `tidyr` package.
*Section 4.2* teaches the basic method for making untidy data tidy. In this section, you will learn how to reorganize the values in your dataset with the `spread()` and `gather()` functions of the `tidyr` package.
*Section 4.3* explains how to split apart and combine values in your data set to make them easier to access with R.
*Section 4.3* explains how to split apart and combine values in your dataset to make them easier to access with R.
*Section 4.4* concludes the chapter, combining everything you've learned about `tidyr` to tidy a real data set on tuberculosis epidemiology collected by the *World Health Organization*.
*Section 4.4* concludes the chapter, combining everything you've learned about `tidyr` to tidy a real dataset on tuberculosis epidemiology collected by the *World Health Organization*.
## Prerequisites
@ -28,34 +28,34 @@ library(dplyr)
## Tidy data
You can organize tabular data in many ways. For example, the data sets below show the same data organized in four different ways. Each data set shows the same values of four variables *country*, *year*, *population*, and *cases*, but each data set organizes the values into a different layout . You can access the data sets in tidyr.
You can organize tabular data in many ways. For example, the datasets below show the same data organized in four different ways. Each dataset shows the same values of four variables *country*, *year*, *population*, and *cases*, but each dataset organizes the values into a different layout . You can access the datasets in tidyr.
```{r}
# Data set one
# dataset one
table1
# Data set two
# dataset two
table2
# Data set three
# dataset three
table3
```
The last data set is a collection of two tables.
The last dataset is a collection of two tables.
```{r}
# Data set four
# dataset four
table4 # cases
table5 # population
```
You might think that these data sets are interchangeable since they display the same information, but one data set will be much easier to work with in R than the others.
You might think that these datasets are interchangeable since they display the same information, but one dataset will be much easier to work with in R than the others.
Why should that be?
R follows a set of conventions that makes one layout of tabular data much easier to work with than others. Your data will be easier to work with in R if it follows three rules
1. Each variable in the data set is placed in its own column
1. Each variable in the dataset is placed in its own column
2. Each observation is placed in its own row
3. Each value is placed in its own cell\*
@ -67,13 +67,13 @@ knitr::include_graphics("images/tidy-1.png")
*In `table1`, each variable is placed in its own column, each observation in its own row, and each value in its own cell.*
Tidy data builds on a premise of data science that data sets contain *both values and relationships*. Tidy data displays the relationships in a data set as consistently as it displays the values in a data set.
Tidy data builds on a premise of data science that datasets contain *both values and relationships*. Tidy data displays the relationships in a dataset as consistently as it displays the values in a dataset.
At this point, you might think that tidy data is so obvious that it is trivial. Surely, most data sets come in a tidy format, right? Wrong. In practice, raw data is rarely tidy and is much harder to work with as a result. *Section 2.4* provides a realistic example of data collected in the wild.
At this point, you might think that tidy data is so obvious that it is trivial. Surely, most datasets come in a tidy format, right? Wrong. In practice, raw data is rarely tidy and is much harder to work with as a result. *Section 2.4* provides a realistic example of data collected in the wild.
Tidy data works well with R because it takes advantage of R's traits as a vectorized programming language. Data structures in R are organized around vectors, and R's functions are optimized to work with vectors. Tidy data takes advantage of both of these traits.
Tidy data arranges values so that the relationships between variables in a data set will parallel the relationship between vectors in R's storage objects. R stores tabular data as a data frame, a list of atomic vectors arranged to look like a table. Each column in the table is an atomic vector in the list. In tidy data, each variable in the data set is assigned to its own column, i.e., its own vector in the data frame.
Tidy data arranges values so that the relationships between variables in a dataset will parallel the relationship between vectors in R's storage objects. R stores tabular data as a data frame, a list of atomic vectors arranged to look like a table. Each column in the table is an atomic vector in the list. In tidy data, each variable in the dataset is assigned to its own column, i.e., its own vector in the data frame.
```{r, echo = FALSE}
knitr::include_graphics("images/tidy-2.png")
@ -81,7 +81,7 @@ knitr::include_graphics("images/tidy-2.png")
*A data frame is a list of vectors that R displays as a table. When your data is tidy, the values of each variable fall in their own column vector.*
As a result, you can extract all the values of a variable in a tidy data set by extracting the column vector that contains the variable. You can do this easily with R's list syntax, i.e.
As a result, you can extract all the values of a variable in a tidy dataset by extracting the column vector that contains the variable. You can do this easily with R's list syntax, i.e.
```{r}
table1$cases
@ -116,9 +116,9 @@ knitr::include_graphics("images/tidy-3.png")
If your data is tidy, element-wise execution will ensure that observations are preserved across functions and operations. Each value will only be paired with other values that appear in the same row of the data frame. In a tidy data frame, these values will be values of the same observation.
Do these small advantages matter in the long run? Yes. Consider what it would be like to do a simple calculation with each of the data sets from the start of this section.
Do these small advantages matter in the long run? Yes. Consider what it would be like to do a simple calculation with each of the datasets from the start of this section.
Assume that in these data sets, `cases` refers to the number of people diagnosed with TB per country per year. To calculate the *rate* of TB cases per country per year (i.e, the number of people per 10,000 diagnosed with TB), you will need to do four operations with the data. You will need to:
Assume that in these datasets, `cases` refers to the number of people diagnosed with TB per country per year. To calculate the *rate* of TB cases per country per year (i.e, the number of people per 10,000 diagnosed with TB), you will need to do four operations with the data. You will need to:
1. Extract the number of TB cases per country per year
2. Extract the population per country per year (in the same order as
@ -128,7 +128,7 @@ Assume that in these data sets, `cases` refers to the number of people diagnosed
If you use basic R syntax, your calculations will look like the code below. If you'd like to brush up on basic R syntax, see Appendix A: Getting Started.
#### Data set one
#### Dataset one
```{r, echo = FALSE}
knitr::include_graphics("images/tidy-4.png")
@ -137,70 +137,70 @@ knitr::include_graphics("images/tidy-4.png")
Since `table1` is organized in a tidy fashion, you can calculate the rate like this,
```{r eval = FALSE}
# Data set one
# Dataset one
table1$cases / table1$population * 10000
```
#### Data set two
#### Dataset two
```{r, echo = FALSE}
knitr::include_graphics("images/tidy-5.png")
```
Data set two intermingles the values of *population* and *cases* in the same column, *value*. As a result, you will need to untangle the values whenever you want to work with each variable separately.
Dataset two intermingles the values of *population* and *cases* in the same column, *value*. As a result, you will need to untangle the values whenever you want to work with each variable separately.
You'll need to perform an extra step to calculate the rate.
```{r eval = FALSE}
# Data set two
# Dataset two
case_rows <- c(1, 3, 5, 7, 9, 11, 13, 15, 17)
pop_rows <- c(2, 4, 6, 8, 10, 12, 14, 16, 18)
table2$value[case_rows] / table2$value[pop_rows] * 10000
```
#### Data set three
#### Dataset three
```{r, echo = FALSE}
knitr::include_graphics("images/tidy-6.png")
```
Data set three combines the values of cases and population into the same cells. It may seem that this would help you calculate the rate, but that is not so. You will need to separate the population values from the cases values if you wish to do math with them. This can be done, but not with "basic" R syntax.
Dataset three combines the values of cases and population into the same cells. It may seem that this would help you calculate the rate, but that is not so. You will need to separate the population values from the cases values if you wish to do math with them. This can be done, but not with "basic" R syntax.
```{r eval = FALSE}
# Data set three
# Dataset three
# No basic solution
```
#### Data set four
#### Dataset four
```{r, echo = FALSE}
knitr::include_graphics("images/tidy-7.png")
```
Data set four stores the values of each variable in a different format: as a column, a set of column names, or a field of cells. As a result, you will need to work with each variable differently. This makes code written for data set four hard to generalize. The code that extracts the values of *year*, `names(table4)[-1]`, cannot be generalized to extract the values of population, `c(table5$1999, table5$2000, table5$2001)`. Compare this to data set one. With `table1`, you can use the same code to extract the values of year, `table1$year`, that you use to extract the values of population. To do so, you only need to change the name of the variable that you will access: `table1$population`.
Dataset four stores the values of each variable in a different format: as a column, a set of column names, or a field of cells. As a result, you will need to work with each variable differently. This makes code written for dataset four hard to generalize. The code that extracts the values of *year*, `names(table4)[-1]`, cannot be generalized to extract the values of population, `c(table5$1999, table5$2000, table5$2001)`. Compare this to dataset one. With `table1`, you can use the same code to extract the values of year, `table1$year`, that you use to extract the values of population. To do so, you only need to change the name of the variable that you will access: `table1$population`.
The organization of data set four is inefficient in a second way as well. Data set four separates the values of some variables across two separate tables. This is inconvenient because you will need to extract information from two different places whenever you want to work with the data.
The organization of dataset four is inefficient in a second way as well. Dataset four separates the values of some variables across two separate tables. This is inconvenient because you will need to extract information from two different places whenever you want to work with the data.
After you collect your input, you can calculate the rate.
```{r eval = FALSE}
# Data set four
# Dataset four
cases <- c(table4$1999, table4$2000, table4$2001)
population <- c(table5$1999, table5$2000, table5$2001)
cases / population * 10000
```
Data set one, the tidy data set, is much easier to work with than with data sets two, three, or four. To work with data sets two, three, and four, you need to take extra steps, which makes your code harder to write, harder to understand, and harder to debug.
Dataset one, the tidy dataset, is much easier to work with than with datasets two, three, or four. To work with datasets two, three, and four, you need to take extra steps, which makes your code harder to write, harder to understand, and harder to debug.
Keep in mind that this is a trivial calculation with a trivial data set. The energy you must expend to manage a poor layout will increase with the size of your data. Extra steps will accumulate over the course of an analysis and allow errors to creep into your work. You can avoid these difficulties by converting your data into a tidy format at the start of your analysis.
Keep in mind that this is a trivial calculation with a trivial dataset. The energy you must expend to manage a poor layout will increase with the size of your data. Extra steps will accumulate over the course of an analysis and allow errors to creep into your work. You can avoid these difficulties by converting your data into a tidy format at the start of your analysis.
The next sections will show you how to transform untidy data sets into tidy data sets.
The next sections will show you how to transform untidy datasets into tidy datasets.
Tidy data was popularized by Hadley Wickham, and it serves as the basis for many R packages and functions. You can learn more about tidy data by reading *Tidy Data* a paper written by Hadley Wickham and published in the Journal of Statistical Software. *Tidy Data* is available online at [www.jstatsoft.org/v59/i10/paper](http://www.jstatsoft.org/v59/i10/paper).
## `spread()` and `gather()`
The `tidyr` package by Hadley Wickham is designed to help you tidy your data. It contains four functions that alter the layout of tabular data sets, while preserving the values and relationships contained in the data sets.
The `tidyr` package by Hadley Wickham is designed to help you tidy your data. It contains four functions that alter the layout of tabular datasets, while preserving the values and relationships contained in the datasets.
The two most important functions in `tidyr` are `gather()` and `spread()`. Each relies on the idea of a key value pair.
@ -233,7 +233,7 @@ Data values form natural key value pairs. The value is the value of the pair and
Cases: 212258
Cases: 213766
However, the key value pairs would cease to be a useful data set because you no longer know which values belong to the same observation.
However, the key value pairs would cease to be a useful dataset because you no longer know which values belong to the same observation.
Every cell in a table of data contains one half of a key value pair, as does every column name. In tidy data, each cell will contain a value and each column name will contain a key, but this doesn't need to be the case for untidy data. Consider `table2`.
@ -255,7 +255,7 @@ To tidy `table2`, you would pass `spread()` the `key` column and then the `value
spread(table2, key, value)
```
`spread()` returns a copy of your data set that has had the key and value columns removed. In their place, `spread()` adds a new column for each unique key in the key column. These unique keys will form the column names of the new columns. `spread()` distributes the cells of the former value column across the cells of the new columns and truncates any non-key, non-value columns in a way that prevents duplication.
`spread()` returns a copy of your dataset that has had the key and value columns removed. In their place, `spread()` adds a new column for each unique key in the key column. These unique keys will form the column names of the new columns. `spread()` distributes the cells of the former value column across the cells of the new columns and truncates any non-key, non-value columns in a way that prevents duplication.
```{r, echo = FALSE}
knitr::include_graphics("images/tidy-8.png")
@ -263,11 +263,11 @@ knitr::include_graphics("images/tidy-8.png")
*`spread()` distributes a pair of key:value columns into a field of cells. The unique keys in the key column become the column names of the field of cells.*
You can see that `spread()` maintains each of the relationships expressed in the original data set. The output contains the four original variables, *country*, *year*, *population*, and *cases*, and the values of these variables are grouped according to the original observations. As a bonus, now the layout of these relationships is tidy.
You can see that `spread()` maintains each of the relationships expressed in the original dataset. The output contains the four original variables, *country*, *year*, *population*, and *cases*, and the values of these variables are grouped according to the original observations. As a bonus, now the layout of these relationships is tidy.
`spread()` takes three optional arguments in addition to `data`, `key`, and `value`:
- **`fill`** - If the tidy structure creates combinations of variables that do not exist in the original data set, `spread()` will place an `NA` in the resulting cells. `NA` is R's missing value symbol. You can change this behaviour by passing `fill` an alternative value to use.
- **`fill`** - If the tidy structure creates combinations of variables that do not exist in the original dataset, `spread()` will place an `NA` in the resulting cells. `NA` is R's missing value symbol. You can change this behaviour by passing `fill` an alternative value to use.
- **`convert`** - If a value column contains multiple types of data, its elements will be saved as a single type, usually character strings. As a result, the new columns created by `spread()` will also contain character strings. If you set `convert = TRUE`, `spread()` will run `type.convert()` on each new column, which will convert strings to doubles (numerics), integers, logicals, complexes, or factors if appropriate.
@ -287,7 +287,7 @@ To use `gather()`, pass it the name of a data frame to reshape. Then pass `gathe
gather(table4, "year", "cases", 2:3)
```
`gather()` returns a copy of the data frame with the specified columns removed. To this data frame, `gather()` has added two new columns: a "key" column that contains the former column names of the removed columns, and a value column that contains the former values of the removed columns. `gather()` repeats each of the former column names (as well as each of the original columns) to maintain each combination of values that appeared in the original data set. `gather()` uses the first string that you supplied as the name of the new "key" column, and it uses the second string as the name of the new value column. In our example, these were the strings "year" and "cases."
`gather()` returns a copy of the data frame with the specified columns removed. To this data frame, `gather()` has added two new columns: a "key" column that contains the former column names of the removed columns, and a value column that contains the former values of the removed columns. `gather()` repeats each of the former column names (as well as each of the original columns) to maintain each combination of values that appeared in the original dataset. `gather()` uses the first string that you supplied as the name of the new "key" column, and it uses the second string as the name of the new value column. In our example, these were the strings "year" and "cases."
We've placed "key" in quotation marks because you will usually use `gather()` to create tidy data. In this case, the "key" column will contain values, not keys. The values will only be keys in the sense that they were formally in the column names, a place where keys belong.
@ -295,9 +295,9 @@ We've placed "key" in quotation marks because you will usually use `gather()` to
knitr::include_graphics("images/tidy-9.png")
```
Just like `spread()`, gather maintains each of the relationships in the original data set. This time `table4` only contained three variables, *country*, *year* and *cases*. Each of these appears in the output of `gather()` in a tidy fashion.
Just like `spread()`, gather maintains each of the relationships in the original dataset. This time `table4` only contained three variables, *country*, *year* and *cases*. Each of these appears in the output of `gather()` in a tidy fashion.
`gather()` also maintains each of the observations in the original data set, organizing them in a tidy fashion.
`gather()` also maintains each of the observations in the original dataset, organizing them in a tidy fashion.
We can use `gather()` to tidy `table5` in a similar fashion.
@ -315,7 +315,7 @@ gather(table5, "year", "population", -1)
You can also identify columns by name with the notation introduced by the `select` function in `dplyr`, see Section 3.1.
You can easily combine the new versions of `table4` and `table5` into a single data frame because the new versions are both tidy. To combine the data sets, use the `left_join()` function from Section 3.6.
You can easily combine the new versions of `table4` and `table5` into a single data frame because the new versions are both tidy. To combine the datasets, use the `left_join()` function from Section 3.6.
```{r}
tidy4 <- gather(table4, "year", "cases", 2:3)
@ -391,7 +391,7 @@ You can also use integers or the syntax of the `dplyr::select()` function to spe
## Case Study
The `who` data set in tidyr contains cases of tuberculosis (TB) reported between 1995 and 2013 sorted by country, age, and gender. The data comes in the *2014 World Health Organization Global Tuberculosis Report*, available for download at [www.who.int/tb/country/data/download/en/](http://www.who.int/tb/country/data/download/en/). The data provides a wealth of epidemiological information, but it would be difficult to work with the data as it is.
The `who` dataset in tidyr contains cases of tuberculosis (TB) reported between 1995 and 2013 sorted by country, age, and gender. The data comes in the *2014 World Health Organization Global Tuberculosis Report*, available for download at [www.who.int/tb/country/data/download/en/](http://www.who.int/tb/country/data/download/en/). The data provides a wealth of epidemiological information, but it would be difficult to work with the data as it is.
```{r}
who
@ -403,13 +403,13 @@ who
*TIP*
The `View()` function opens a data viewer in the RStudio IDE. Here you can examine the data set, search for values, and filter the display based on logical conditions. Notice that the `View()` function begins with a capital V.
The `View()` function opens a data viewer in the RStudio IDE. Here you can examine the dataset, search for values, and filter the display based on logical conditions. Notice that the `View()` function begins with a capital V.
------------------------------------------------------------------------
The most unique feature of `who` is its coding system. Columns five through sixty encode four separate pieces of information in their column names:
1. The first three letters of each column denote whether the column contains new or old cases of TB. In this data set, each column contains new cases.
1. The first three letters of each column denote whether the column contains new or old cases of TB. In this dataset, each column contains new cases.
2. The next two letters describe the type of case being counted. We will treat each of these as a separate variable.
- `rel` stands for cases of relapse
@ -417,9 +417,9 @@ The most unique feature of `who` is its coding system. Columns five through sixt
- `sn` stands for cases of pulmonary TB that could not be diagnosed by a pulmonary smear (smear negative)
- `sp` stands for cases of pulmonary TB that could be diagnosed be a pulmonary smear (smear positive)
3. The sixth letter describes the sex of TB patients. The data set groups cases by males (`m`) and females (`f`).
3. The sixth letter describes the sex of TB patients. The dataset groups cases by males (`m`) and females (`f`).
4. The remaining numbers describe the age group of TB patients. The data set groups cases into seven age groups:
4. The remaining numbers describe the age group of TB patients. The dataset groups cases into seven age groups:
- `014` stands for patients that are 0 to 14 years old
- `1524` stands for patients that are 15 to 24 years old
- `2534` stands for patients that are 25 to 34 years old
@ -428,7 +428,7 @@ The most unique feature of `who` is its coding system. Columns five through sixt
- `5564` stands for patients that are 55 to 64 years old
- `65` stands for patients that are 65 years old or older
Notice that the `who` data set is untidy in multiple ways. First, the data appears to contain values in its column names, coded values such as male, relapse, and 0 - 14 years of age. We can move the names into their own column with `gather()`. This will make it easy to separate the values combined in each column name.
Notice that the `who` dataset is untidy in multiple ways. First, the data appears to contain values in its column names, coded values such as male, relapse, and 0 - 14 years of age. We can move the names into their own column with `gather()`. This will make it easy to separate the values combined in each column name.
```{r}
who <- gather(who, "code", "value", 5:60)
@ -456,4 +456,4 @@ who <- spread(who, var, value)
who
```
The `who` data set is now tidy. It is far from sparkling (for example, it contains several redundant columns and many missing values), but it will now be much easier to work with in R.
The `who` dataset is now tidy. It is far from sparkling (for example, it contains several redundant columns and many missing values), but it will now be much easier to work with in R.

View File

@ -22,9 +22,9 @@ There is no formal way to do Exploratory Data Analysis because you must be free
> "Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise."---John Tukey
Your goal during EDA is to develop a complete understanding of your data set and the information that it contains. The easiest way to do this is to use questions as tools to guide your investigation. When you ask a question, the question focuses your attention on a specific part of your data set and helps you decide which graphs or models to make.
Your goal during EDA is to develop a complete understanding of your dataset and the information that it contains. The easiest way to do this is to use questions as tools to guide your investigation. When you ask a question, the question focuses your attention on a specific part of your dataset and helps you decide which graphs or models to make.
During EDA, the _quantity_ of questions that you ask matters more than the quality of the questions. It is difficult to ask revealing questions at the start of your analysis because you do not know what insights are contained in your data set. On the other hand, each new question that you ask will expose you to a new aspect of your data and increase your chance of making a discovery. You can quickly drill down into the most interesting parts of your data---and develop a set of thought provoking questions---if you follow up each question with a new question based on what you find.
During EDA, the _quantity_ of questions that you ask matters more than the quality of the questions. It is difficult to ask revealing questions at the start of your analysis because you do not know what insights are contained in your dataset. On the other hand, each new question that you ask will expose you to a new aspect of your data and increase your chance of making a discovery. You can quickly drill down into the most interesting parts of your data---and develop a set of thought provoking questions---if you follow up each question with a new question based on what you find.
There is no rule about which questions you should ask to guide your research. However, two types of questions will always be useful for making discoveries within your data. You can loosely word these questions as
@ -231,7 +231,7 @@ Visualize covariation between continuous and categorical variables with boxplots
knitr::include_graphics("images/EDA-boxplot.pdf")
```
The chart below shows several boxplots, one for each level of the class variable in the mpg data set. Each boxplot represents the distribution of hwy values for points with the given level of class. To make boxplots, use `geom_boxplot()`.
The chart below shows several boxplots, one for each level of the class variable in the mpg dataset. Each boxplot represents the distribution of hwy values for points with the given level of class. To make boxplots, use `geom_boxplot()`.
```{r fig.height = 3}
ggplot(data = mpg) +
@ -271,7 +271,7 @@ ggplot(data = diamonds) +
geom_point(aes(x = carat, y = price))
```
Scatterplots become less useful as the size of your data set grows, because points begin to pile up into areas of uniform black (as above). You can make patterns clear again with `geom_bin2d()`, `geom_hex()`, or `geom_density2d()`.
Scatterplots become less useful as the size of your dataset grows, because points begin to pile up into areas of uniform black (as above). You can make patterns clear again with `geom_bin2d()`, `geom_hex()`, or `geom_density2d()`.
`geom_bin2d()` and `geom_hex()` divide the coordinate plane into two dimensional bins and then use a fill color to display how many points fall into each bin. `geom_bin2d()` creates rectangular bins. `geom_hex()` creates hexagonal bins. You will need to install the hexbin package to use `geom_hex()`.
@ -284,7 +284,7 @@ ggplot(data = diamonds) +
geom_hex(aes(x = carat, y = price))
```
`geom_density2d()` fits a 2D kernel density estimation to the data and then uses contour lines to highlight areas of high density. It is very useful for overlaying on raw data even when your data set is not big.
`geom_density2d()` fits a 2D kernel density estimation to the data and then uses contour lines to highlight areas of high density. It is very useful for overlaying on raw data even when your dataset is not big.
```{r}
@ -382,7 +382,7 @@ Hierarchical clustering uses a simple algorithm to locate groups of points that
3. Treat the new cluster as a point
4. Repeat until all of the points are grouped into a single cluster
You can visualize the results of the algorithm as a dendrogram, and you can use the dendrogram to divide your data into any number of clusters. The figure below demonstrates how the algorithm would proceed in a two dimensional data set.
You can visualize the results of the algorithm as a dendrogram, and you can use the dendrogram to divide your data into any number of clusters. The figure below demonstrates how the algorithm would proceed in a two dimensional dataset.
```{r, echo = FALSE}
knitr::include_graphics("images/EDA-hclust.pdf")
@ -399,7 +399,7 @@ iris_hclust <- small_iris %>%
hclust(method = "complete")
```
Use `plot()` to visualize the results as a dendrogram. Each observation in the data set will appear at the bottom of the dendrogram labeled by its rowname. You can use the labels argument to set the labels to something more informative.
Use `plot()` to visualize the results as a dendrogram. Each observation in the dataset will appear at the bottom of the dendrogram labeled by its rowname. You can use the labels argument to set the labels to something more informative.
```{r fig.height = 4}
plot(iris_hclust, labels = small_iris$Species)
@ -409,7 +409,7 @@ To see how near two data points are to each other, trace the paths of the data p
You can split your data into any number of clusters by drawing a horizontal line across the tree. Each vertical branch that the line crosses will represent a cluster that contains all of the points downstream from the branch. Move the line up the y axis to intersect fewer branches (and create fewer clusters), move the line down the y axis to intersect more branches and (create more clusters).
`cutree()` provides a useful way to split data points into clusters. Give cutree the output of `hclust()` as well as the number of clusters that you want to split the data into. `cutree()` will return a vector of cluster labels for your data set. To visualize the results, map the output of `cutree()` to an aesthetic.
`cutree()` provides a useful way to split data points into clusters. Give cutree the output of `hclust()` as well as the number of clusters that you want to split the data into. `cutree()` will return a vector of cluster labels for your dataset. To visualize the results, map the output of `cutree()` to an aesthetic.
```{r}
(clusters <- cutree(iris_hclust, 3))
@ -465,7 +465,7 @@ iris_kmeans <- small_iris %>%
iris_kmeans$cluster
```
Unlike `hclust()`, the k means algorithm does not provide an intuitive visual interface. Instead, `kmeans()` returns a kmeans class object. Subset the object with `$cluster` to access a list of cluster assignments for your data set, e.g. `iris_kmeans$cluster`. You can visualize the results by mapping them to an aesthetic, or you can apply the results by passing them to dplyr's `group_by()` function.
Unlike `hclust()`, the k means algorithm does not provide an intuitive visual interface. Instead, `kmeans()` returns a kmeans class object. Subset the object with `$cluster` to access a list of cluster assignments for your dataset, e.g. `iris_kmeans$cluster`. You can visualize the results by mapping them to an aesthetic, or you can apply the results by passing them to dplyr's `group_by()` function.
```{r}
ggplot(small_iris, aes(x = Sepal.Width, y = Sepal.Length)) +
@ -535,9 +535,9 @@ I'll postpone teaching you how to fit and interpret models with R until Part 4.
## Exploring further
> Every data set contains more variables and observations than it displays.
> Every dataset contains more variables and observations than it displays.
You now know how to explore the variables displayed in your data set, but you should know that these are not the only variables in your data. Nor are the observations that are displayed in your data the only observations. You can use the values in your data to compute new variables or to measure new (group-level) observations. These new variables and observations provide a further source of insights that you can explore with visualizations, clustering algorithms, and models.
You now know how to explore the variables displayed in your dataset, but you should know that these are not the only variables in your data. Nor are the observations that are displayed in your data the only observations. You can use the values in your data to compute new variables or to measure new (group-level) observations. These new variables and observations provide a further source of insights that you can explore with visualizations, clustering algorithms, and models.
### To make new variables
@ -555,7 +555,7 @@ If you are statistically trained, you can use R to extract potential variables w
### To make new observations
If your data set contains subgroups, you can derive from your data a new data set of observations that describe the subgroups. To do this, first use dplyr's `group_by()` function to group the data into subgroups. Then use dplyr's `summarise()` function to calculate group level statistics. The measures of location, rank and spread listed in Chapter 3 are particularly useful for describing subgroups.
If your dataset contains subgroups, you can derive from your data a new dataset of observations that describe the subgroups. To do this, first use dplyr's `group_by()` function to group the data into subgroups. Then use dplyr's `summarise()` function to calculate group level statistics. The measures of location, rank and spread listed in Chapter 3 are particularly useful for describing subgroups.
```{r}
mpg %>%
@ -575,7 +575,7 @@ Due to a quirk of the human cognitive system, the easiest way to spot signal ami
As a term, "data science" has been used in different ways by many people. This fluidity is necessary for a term that describes a wide breadth of activity, as data science does. Nonetheless, you can use the principles in this chapter to build a general model of data science. The model requires one limit to the definition of data science: data science must rely in some way on human judgement applied to data.
To judge or interpret the information in a data set, you must first comprehend that information, which is difficult to do. The easiest way to comprehend data is to visualize, transform, and model it, a process that we have referred to as Exploratory Data Analysis.
To judge or interpret the information in a dataset, you must first comprehend that information, which is difficult to do. The easiest way to comprehend data is to visualize, transform, and model it, a process that we have referred to as Exploratory Data Analysis.
```{r, echo = FALSE}
knitr::include_graphics("images/EDA-data-science-1.png")

View File

@ -6,7 +6,7 @@ This chapter will teach you how to visualize your data with R and the `ggplot2`
### Prerequisites
To access the data sets, help pages, and functions that we will use in this chapter, load the `ggplot2` package:
To access the datasets, help pages, and functions that we will use in this chapter, load the `ggplot2` package:
```{r echo = FALSE, message = FALSE, warning = FALSE}
library(ggplot2)
@ -21,7 +21,7 @@ library(ggplot2)
Let's use our first graph to answer a question: Do cars with big engines use more fuel than cars with small engines? You probably already have an answer, but try to make your answer precise. What does the relationship between engine size and fuel efficiency look like? Is it positive? Negative? Linear? Nonlinear?
You can test your answer with the `mpg` data set in the `ggplot2` package. The data set contains observations collected by the EPA on 38 models of car. Among the variables in `mpg` are
You can test your answer with the `mpg` dataset in the `ggplot2` package. The dataset contains observations collected by the EPA on 38 models of car. Among the variables in `mpg` are
1. `displ` - a car's engine size in litres, and
2. `hwy` - a car's fuel efficiency on the highway in miles per gallon (mpg). A car with a low fuel efficiency consumes more fuel than a car with a high fuel efficiency when they travel the same distance.
@ -44,13 +44,13 @@ ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
```
With `ggplot2`, you begin a plot with the function `ggplot()`. `ggplot()` creates a coordinate system that you can add layers to. The first argument of `ggplot()` is the data set to use in the graph. So `ggplot(data = mpg)` creates an empty graph that will use the `mpg` data set.
With `ggplot2`, you begin a plot with the function `ggplot()`. `ggplot()` creates a coordinate system that you can add layers to. The first argument of `ggplot()` is the dataset to use in the graph. So `ggplot(data = mpg)` creates an empty graph that will use the `mpg` dataset.
You complete your graph by adding one or more layers to `ggplot()`. Here, the function `geom_point()` adds a layer of points to your plot, which creates a scatterplot. `ggplot2` comes with many geom functions that each add a different type of layer to a plot.
Each geom function in `ggplot2` takes a mapping argument. The mapping argument of your geom function explains where your points should go. You must set `mapping` to a call to `aes()`. The `x` and `y` arguments of `aes()` explain which variables to map to the x and y axes of your plot. `ggplot()` will look for those variables in your data set, `mpg`.
Each geom function in `ggplot2` takes a mapping argument. The mapping argument of your geom function explains where your points should go. You must set `mapping` to a call to `aes()`. The `x` and `y` arguments of `aes()` explain which variables to map to the x and y axes of your plot. `ggplot()` will look for those variables in your dataset, `mpg`.
Let's turn this code into a reusable template for making graphs with `ggplot2`. To make a graph, replace the bracketed sections in the code below with a data set, a geom function, or a set of mappings.
Let's turn this code into a reusable template for making graphs with `ggplot2`. To make a graph, replace the bracketed sections in the code below with a dataset, a geom function, or a set of mappings.
```{r eval = FALSE}
ggplot(data = <DATA>) +
@ -69,7 +69,7 @@ In the plot below, one group of points seems to fall outside of the linear trend
knitr::include_graphics("images/visualization-1.png")
```
Let's hypothesize that the cars are hybrids. One way to test this hypothesis is to look at the `class` value for each car. The `class` variable of the `mpg` data set classifies cars into groups such as compact, midsize, and suv. If the outlying points are hybrids, they should be classified as compact cars or, perhaps, subcompact cars (keep in mind that this data was collected before hybrid trucks and suvs became popular).
Let's hypothesize that the cars are hybrids. One way to test this hypothesis is to look at the `class` value for each car. The `class` variable of the `mpg` dataset classifies cars into groups such as compact, midsize, and suv. If the outlying points are hybrids, they should be classified as compact cars or, perhaps, subcompact cars (keep in mind that this data was collected before hybrid trucks and suvs became popular).
You can add a third variable, like `class`, to a two dimensional scatterplot by mapping it to an _aesthetic_.
@ -79,7 +79,7 @@ An aesthetic is a visual property of the objects in your plot. Aesthetics includ
knitr::include_graphics("images/visualization-2.png")
```
You can convey information about your data by mapping the aesthetics in your plot to the variables in your data set. For example, you can map the colors of your points to the `class` variable to reveal the class of each car.
You can convey information about your data by mapping the aesthetics in your plot to the variables in your dataset. For example, you can map the colors of your points to the `class` variable to reveal the class of each car.
```{r}
ggplot(data = mpg) +
@ -169,7 +169,7 @@ If you get an odd result, double check that you are calling the aesthetic as its
### Exercises
Now that you know how to use aesthetics, take a moment to experiment with the `mpg` data set.
Now that you know how to use aesthetics, take a moment to experiment with the `mpg` dataset.
1. Map a discrete variable to `color`, `size`, `alpha`, and `shape`. Then map a continuous variable to each. Does `ggplot2` behave differently for discrete vs. continuous variables?
+ The discrete variables in `mpg` are: `manufacturer`, `model`, `trans`, `drv`, `fl`, `class`
@ -179,7 +179,7 @@ Now that you know how to use aesthetics, take a moment to experiment with the `m
***
**Tip** - See the help page for `geom_point()` (by running `?geom_point`) to learn which aesthetics are available to use in a scatterplot. See the help page for the `mpg` data set (`?mpg`) to learn which variables are in the data set.
**Tip** - See the help page for `geom_point()` (by running `?geom_point`) to learn which aesthetics are available to use in a scatterplot. See the help page for the `mpg` dataset (`?mpg`) to learn which variables are in the dataset.
***
@ -295,7 +295,7 @@ ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_smooth()
```
You can use the same system to specify individual data sets for each layer. Here, our smooth line displays just a subset of the `mpg` data set, the subcompact cars. The local data argument in `geom_smooth()` overrides the global data argument in `ggplot()` for the smooth layer only.
You can use the same system to specify individual datasets for each layer. Here, our smooth line displays just a subset of the `mpg` dataset, the subcompact cars. The local data argument in `geom_smooth()` overrides the global data argument in `ggplot()` for the smooth layer only.
```{r, message = FALSE, warning = FALSE}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
@ -335,7 +335,7 @@ ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
```
The chart above displays the total number of diamonds in the `diamonds` data set, grouped by `cut`. The `diamonds` data set comes in `ggplot2` and contains information about 53,940 diamonds, including the `price`, `carat`, `color`, `clarity`, and `cut` of each diamond. The chart shows that more diamonds are available with high quality cuts than with low quality cuts.
The chart above displays the total number of diamonds in the `diamonds` dataset, grouped by `cut`. The `diamonds` dataset comes in `ggplot2` and contains information about 53,940 diamonds, including the `price`, `carat`, `color`, `clarity`, and `cut` of each diamond. The chart shows that more diamonds are available with high quality cuts than with low quality cuts.
A bar has different visual properties than a point, which can create some surprises. For example, how would you create this simple chart? If you have an R session open, give it a try.
@ -429,7 +429,7 @@ ggplot(data = diamonds) +
### Position = "jitter"
The last type of position adjustment does not make sense for bar charts, but it can be very useful for scatterplots. Recall our first scatterplot. Did you notice that the plot displays only 126 points, even though there are 234 observations in the data set?
The last type of position adjustment does not make sense for bar charts, but it can be very useful for scatterplots. Recall our first scatterplot. Did you notice that the plot displays only 126 points, even though there are 234 observations in the dataset?
```{r echo = FALSE}
ggplot(data = mpg) +
@ -461,7 +461,7 @@ ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
```
On the x axis, the chart displays `cut`, a variable in the `diamonds` data set. On the y axis, it displays count; but count is not a variable in the diamonds data set:
On the x axis, the chart displays `cut`, a variable in the `diamonds` dataset. On the y axis, it displays count; but count is not a variable in the diamonds dataset:
```{r}
head(diamonds)
@ -469,7 +469,7 @@ head(diamonds)
Where does count come from?
Some graphs, like scatterplots, plot the raw values of your data set. Other graphs, like bar charts, calculate new values to plot.
Some graphs, like scatterplots, plot the raw values of your dataset. Other graphs, like bar charts, calculate new values to plot.
* **bar charts** and **histograms** bin your data and then plot bin counts, the number of points that fall in each bin.
* **smooth lines** fit a model to your data and then plot the model line.
@ -525,7 +525,7 @@ ggplot(data = diamonds) +
geom_count(mapping = aes(x = cut, y = clarity))
```
The help page of `?stat_sum` reveals that the sum stat creates two variables, n (count) and prop. By default, `geom_count()` uses the n variable to create the size of each bubble. To tell `geom_count()` to use the prop variable, map $size$ to `..prop..`. The two dots that surround prop notify `ggplot2` that the prop variable appears in the transformed data set that is created by the stat, and not in the raw data set. Be sure to include these dots whenever you refer to a variable that is created by a stat.
The help page of `?stat_sum` reveals that the sum stat creates two variables, n (count) and prop. By default, `geom_count()` uses the n variable to create the size of each bubble. To tell `geom_count()` to use the prop variable, map $size$ to `..prop..`. The two dots that surround prop notify `ggplot2` that the prop variable appears in the transformed dataset that is created by the stat, and not in the raw dataset. Be sure to include these dots whenever you refer to a variable that is created by a stat.
```{r}
ggplot(data = diamonds) +
@ -612,7 +612,7 @@ ggplot(data = diamonds) +
facet_grid(color ~ clarity)
```
Here the first subplot displays all of the points that have an `I1` code for `clarity` _and_ a `D` code for `color`. Don't be confused by the word color here; `color` is a variable name in the `diamonds` data set. It contains the codes `D`, `E`, `F`, `G`, `H`, `I`, and `J`. `facet_grid(color ~ clarity)` is not invoking the color aesthetic.
Here the first subplot displays all of the points that have an `I1` code for `clarity` _and_ a `D` code for `color`. Don't be confused by the word color here; `color` is a variable name in the `diamonds` dataset. It contains the codes `D`, `E`, `F`, `G`, `H`, `I`, and `J`. `facet_grid(color ~ clarity)` is not invoking the color aesthetic.
If you prefer to not facet on the rows or columns dimension, place a `.` instead of a variable name before or after the `~`, e.g. `+ facet_grid(. ~ clarity)`.
@ -656,9 +656,9 @@ ggplot(data = <DATA>) +
Our new template takes seven parameters, the bracketed words that appear in the template. In practice, you rarely need to supply all seven parameters to make a graph because `ggplot2` will provide useful defaults for everything except the data, the mappings, and the geom function.
The seven parameters in the template compose the grammar of graphics, a formal system for building plots. The grammar of graphics is based on the insight that you can uniquely describe _any_ plot as a combination of a data set, a geom, a set of mappings, a stat, a position adjustment, a coordinate system, and a faceting scheme.
The seven parameters in the template compose the grammar of graphics, a formal system for building plots. The grammar of graphics is based on the insight that you can uniquely describe _any_ plot as a combination of a dataset, a geom, a set of mappings, a stat, a position adjustment, a coordinate system, and a faceting scheme.
To see how this works, consider how you could build a basic plot from scratch: you could start with a data set and then transform it into the information that you want to display (with a stat).
To see how this works, consider how you could build a basic plot from scratch: you could start with a dataset and then transform it into the information that you want to display (with a stat).
```{r, echo = FALSE}
knitr::include_graphics("images/visualization-grammar-1.png")
@ -670,7 +670,7 @@ Next, you could choose a geometric object to represent each observation in the t
knitr::include_graphics("images/visualization-grammar-2.png")
```
You'd then select a coordinate system to place the geoms into. You'd use the location of the objects (which is itself an aesthetic property) to display the values of the x and y variables. At that point, you would have a complete graph, but you could further adjust the positions of the geoms within the coordinate system (a position adjustment) or split the graph into subplots (facetting). You could also extend the plot by adding one or more additional layers, where each additional layer uses a data set, a geom, a set of mappings, a stat, and a position adjustment.
You'd then select a coordinate system to place the geoms into. You'd use the location of the objects (which is itself an aesthetic property) to display the values of the x and y variables. At that point, you would have a complete graph, but you could further adjust the positions of the geoms within the coordinate system (a position adjustment) or split the graph into subplots (facetting). You could also extend the plot by adding one or more additional layers, where each additional layer uses a dataset, a geom, a set of mappings, a stat, and a position adjustment.
```{r, echo = FALSE}
knitr::include_graphics("images/visualization-grammar-3.png")