Complete pass over EDA chapter

This commit is contained in:
hadley 2016-07-18 08:47:52 -05:00
parent 5aa1ec38a8
commit dfc4e95c60
9 changed files with 83 additions and 97 deletions

Binary file not shown.

Binary file not shown.

Before

Width:  |  Height:  |  Size: 31 KiB

After

Width:  |  Height:  |  Size: 13 KiB

BIN
images/EDA-boxplot.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 148 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 20 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 21 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 30 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 37 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 60 KiB

View File

@ -62,15 +62,13 @@ The rest of this chapter will look at these two questions. I'll explain what var
each associated with a different variable. I'll sometimes refer to
an observation as a data point.
* _tabular data_ is a set of values, each associated with a variable and an
* _Tabular data_ is a set of values, each associated with a variable and an
observation. Tabular data is _tidy_ if each value is placed in its own
"cell", each variable in its own column, and each observation in its own
row.
For now, assume all the data you see in this book is be tidy. You'll encounter lots of other data in practice, so we'll come back to these ideas again in [tidy data] where you'll learn how to tidy messy data.
## Variation
> "What type of variation occurs within my variables?"
@ -220,8 +218,12 @@ When you discover an outlier it's a good idea to trace it back as far as possibl
1. Explore the distribution of `price`. Do you discover anything unusual
or surprising? (Hint: carefully think about the `binwidth` and make sure
you)
1. Explore the distribution of `carat`. What do you think drives the pattern?
1. How many diamonds have 0.99 carats? Why?
1. Compare and contract `coord_cartesian()` vs `xlim()`/`ylim()` when
1. Compare and contrast `coord_cartesian()` vs `xlim()`/`ylim()` when
zooming in on a histogram. What happens if you leave `binwidth` unset?
What happens if you try and zoom so only half a bar shows?
@ -259,7 +261,7 @@ ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
To suppress that warning, set `na.rm = TRUE`:
```{r}
```{r, eval = FALSE}
ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
geom_point(na.rm = TRUE)
```
@ -278,12 +280,14 @@ nycflights13::flights %>%
geom_freqpoly(mapping = aes(colour = cancelled), binwidth = 1/4)
```
However this plot isn't great because there are many more non-cancelled flights than cancelled flights. In the next section we'll explore some techniques for making improving this comparison.
However this plot isn't great because there are many more non-cancelled flights than cancelled flights. In the next section we'll explore some techniques for improving this comparison.
### Exercises
1. Recall what the `na.rm = TRUE` argument does in `mean()` and `sum()`.
Why is that a similar operation for `geom_point()`?
1. What happens to missing values in a histogram? What happens to missing
values in bar chart? Why is there a difference?
1. What does `na.rm = TRUE` do in `mean()` and `sum()`?
## Covariation
@ -291,7 +295,7 @@ However this plot isn't great because there are many more non-cancelled flights
If variation describes the behavior _within_ a variable, covariation describes the behavior _between_ variables. **Covariation** is the tendency for the values of two or more variables to vary together in a correlated way. The best way to spot covariation is to visualize the relationship between two or more variables. How you do that should again depend on the type of variables involved.
### Visualizing one categorical variable and one continuous variable
### Categorical + continuous
It's common to want to explore the distribution of a continuous variable broken down by a categorical, as in the previous histogram. The default appearance of `geom_freqpoly()` is not that useful for that sort of comparison because the height is the count. That means if one of the groups is much smaller than the others, it's hard to see the differences in shape. For example, lets explore how the price of a diamond varies with its quality:
@ -369,12 +373,16 @@ ggplot(data = mpg) +
#### Exercises
1. Use what you've learned to improve the visualisation of the departure times
of cancelled vs. non-cancelled flights.
1. What variable in the diamonds dataset is most important for predicting
the price of a diamond? How is that variable correlated with cut?
Why does that combination lead to lower quality diamonds being more
expensive.
1. Install the ggstance pacakge, and create a horizontal boxplot.
How does this compare to using `coord_flip()`?
1. One problem with boxplots is that they were developed in an era of
much smaller datasets and tend to display an prohibitively large
@ -392,7 +400,7 @@ ggplot(data = mpg) +
The ggbeeswarm package provides a number of methods similar to
`geom_jitter()`. List them and briefly describe what each one does.
### Visualizing two categorical variables
### Categorical x2
There are two basic techniques for visulaising covariation between categorical variables. One is to count the number of observations at each location and display the count with the size of a point. That's the job of `geom_count()`:
@ -420,26 +428,36 @@ If the categorical variables are unordered, you might want to use the seriation
#### Exercises
1. How could you rescale the count dataset above to more clearly see
the differences across colours or across cuts?
1. How could you rescale the count dataset above to more clearly see
the differences across colours or across cuts?
1. Use `geom_raster()` together with dplyr to explore how average flight
delays vary by destination and month of year.
1. Use `geom_raster()` together with dplyr to explore how average flight
delays vary by destination and month of year.
### Vizualizing two continuous variables
1. Why is slightly better to use `aes(x = color, y = cut)` rather
than `aes(x = cut, y = color)` in the example above?
You've already seen one great way to visualise the covariation between two continuous variables: a scatterplot, i.e. `geom_point()`. Covariation will appear as a structure or pattern in the data points. For example, an exponential relationship seems to exist between the carat size and price of a diamond.
### Continuous x2
You've already seen one great way to visualise the covariation between two continuous variables: draw a scatterplot with `geom_point()`. You can see covariation as a pattern in the points. For example, you can see an exponential relationship between the carat size and price of a diamond.
```{r}
ggplot(data = diamonds) +
geom_point(aes(x = carat, y = price))
```
Scatterplots become less useful as the size of your dataset grows, because points begin to pile up into areas of uniform black (as above). You can make patterns clear by binning the data with `geom_bin2d()` or `geom_hex()`.
Scatterplots become less useful as the size of your dataset grows, because points begin to pile up into areas of uniform black (as above). This problem is known as __overplotting__. This problem is similar to showing the distribution of price by color using a scatterplot:
```{r}
ggplot(data = diamonds, mapping = aes(x = price, y = cut)) +
geom_point()
```
And we can fix it in the same way: by using binning. Previously you used `geom_histogram()` and `geom_freqpoly()` to bin in one dimension. Now you'll learn how to use `geom_bin2d()` and `geom_hex()` to bin in two dimensions.
`geom_bin2d()` and `geom_hex()` divide the coordinate plane into two dimensional bins and then use a fill color to display how many points fall into each bin. `geom_bin2d()` creates rectangular bins. `geom_hex()` creates hexagonal bins. You will need to install the hexbin package to use `geom_hex()`.
```{r fig.show='hold', fig.width=3}
```{r fig.show='hold', fig.asp = 1, out.width = "50%"}
ggplot(data = smaller) +
geom_bin2d(aes(x = carat, y = price))
@ -448,43 +466,51 @@ ggplot(data = smaller) +
geom_hex(aes(x = carat, y = price))
```
Another option is to use grouping to discretize a continuous variable. Then you can use one of the techniques for visualising the combination of a discrete and a continuous variable.
Another option is to bin one continuous variable so it acts like a categorical variable. Then you can use one of the techniques for visualising the combination of a discrete and a continuous variable that you learned about. For example, you could bin `carat` and then for each group displaying a boxplot:
```{r}
ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
geom_boxplot(aes(group = cut_width(carat, 0.1)))
```
By default, boxplots look roughly the same (apart from number of outliers) regardless of how many observations there are. If you want the width of the boxplot to be proportional to the number of points, set `varwidth = TRUE`.
`cut_width(x, width)`, as used above, divides `x` into bins of width `width`. By default, boxplots look roughly the same (apart from number of outliers) regardless of how many observations there are, so it's difficult to tell the each boxplot summarises a different number of points. One way to show that is to make the width of the boxplot to be proportional to the number of points with `varwidth = TRUE`.
Another approach is to display approximately the same number of points in each bin. That's the job of `cut_number()`:
```{r}
ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
geom_boxplot(aes(group = cut_number(carat, 20)))
```
#### Exercises
1. Instead of summarising the conditional distribution with a boxplot, you
could use a frequency polygon. What do you need to consider when using
`cut_width()` vs `cut_number()`? How does that impact a visualiation of
the 2d distribution of `carat` and `price`?
## Asking questions about covariation
1. Visualise the distribution of carat, partitioned by price.
When you explore plots of covariation, look for the following sources of insight:
1. How does the price distribution of very large diamonds compare to small
diamonds. Is it as you expect, or does it surprise you?
1. Combine two of the techniques you've learned to visualise the
combined distribution of cut, carat, and price.
### Outliers
1. Two dimensional plots reveal outliers that are not visible in one
dimensional plots. For example, some points in the plot below have an
unusual combination of $x$ and $y$ values, which makes the points outliers
even though their $x$ and $y$ values appear normal when examined separately.
```{r}
ggplot(data = diamonds) +
geom_point(aes(x = x, y = y)) +
coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))
```
Why is a scatterplot a better display than a binned plot for this case?
Two dimensional plots reveal outliers that are not visible in one dimensional plots. For example, some points in the plot below have an unusual combination of $x$ and $y$ values, which makes the points outliers even though their $x$ and $y$ values appear normal when examined separately.
```{r}
ggplot(data = diamonds) +
geom_point(aes(x = x, y = y)) +
coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))
```
### Clusters
Two dimensional plots can also reveal clusters that may not be visible in one dimensional plots. For example, the two dimensional pattern in the plot below reveals two clusters, a separation that is not visible in the distribution of either variable by itself, as verified with a rug geom.
```{r fig.height = 3}
ggplot(data = iris, aes(y = Sepal.Length, x = Sepal.Width)) +
geom_jitter() +
geom_rug(position = "jitter")
```
### Patterns
## Patterns and models
Patterns in your data provide clues about relationships. If a systematic relationship exists between two variables it will appear as a pattern in the data. If you spot a pattern, ask yourself:
@ -500,18 +526,14 @@ Patterns in your data provide clues about relationships. If a systematic relatio
A scatterplot of Old Faithful eruption lengths versus the wait time between eruptions shows a pattern: longer wait times are associated with longer eruptions. The scatterplot also displays the two clusters that we noticed above.
```{r echo = FALSE, message = FALSE, fig.height = 2}
ggplot(faithful) + geom_point(aes(x = eruptions, y = waiting))
```{r fig.height = 2}
ggplot(data = faithful) +
geom_point(aes(x = eruptions, y = waiting))
```
Patterns provide one of the most useful tools for data scientists because they reveal covariation. If you think of variation as a phenomenon that creates uncertainty, covariation is a phenomenon that reduces it. If two variables covary, you can use the values of one variable to make better predictions about the values of the second. If the covariation is due to a causal relationship (a special case), then you can use the value of one variable to control the value
of the second.
Patterns provide one of the most useful tools for data scientists because they reveal covariation. If you think of variation as a phenomenon that creates uncertainty, covariation is a phenomenon that reduces it. If two variables covary, you can use the values of one variable to make better predictions about the values of the second. If the covariation is due to a causal relationship (a special case), then you can use the value of one variable to control the value of the second.
## Models
Models are a rich tool for extracting patterns out of data.
For example, consider the diamonds data. It's hard to understand the relationship between cut and price, because cut and carat, and carat and price are tightly related. It's possible to use a model to remove the very strong relationship between price and carat so we we can explore the subtleties that remain.
Models are a rich tool for extracting patterns out of data. For example, consider the diamonds data. It's hard to understand the relationship between cut and price, because cut and carat, and carat and price are tightly related. It's possible to use a model to remove the very strong relationship between price and carat so we we can explore the subtleties that remain.
```{r}
library(modelr)
@ -531,58 +553,22 @@ ggplot(data = diamonds2, mapping = aes(x = cut, y = resid)) +
geom_boxplot()
```
I'll postpone teaching you how to fit and interpret models with R until Part 4. Although models are something simple, descriptions of patterns, they are tied into the logic of statistical inference: if a model describes your data accurately _and_ your data is similar to the world at large, then your model should describe the world at large accurately. This chain of reasoning provides a basis for using models to make inferences and predictions. As a result, there is more to learn about models than we can examine here.
## What's next?
## A last word on variables, values, and observations
__Part 1__ (this part) of the book has given you the basic tools to do data science. Just by knowing how to transform and visualise data, there is are tremendous number of insights that you can understand. And somewhat counterintuitively, these tools scale really well to big data: the bigger data the more important that simple tools like binning and counting become.
Variables, values, and observations provide a basis for EDA: _if a relationship exists between two_ variables, _then the relationship will exist between the_ values _of those variables when those values are measured in the same_ observation. As a result, relationships between variables will appear as patterns in your data.
To see what's coming up in the rest of the book, it's useful to refer back to my model of data science:
Within any particular observation, the exact form of the relationship between variables may be obscured by mediating factors, measurement error, or random noise; which means that the patterns in your data will appear as signals obscured by noise.
Due to a quirk of the human cognitive system, the easiest way to spot signal amidst noise is to visualize your data. The concepts of variables, values, and observations have a role to play here as well. To visualize your data, represent each observation with its own geometric object, such as a point. Then map each variable to an aesthetic property of the point, setting specific values of the variable to specific levels of the aesthetic. You could also compute group-level statistics from your data (i.e. new observations) and map them to geoms, something that `geom_bar()`, `geom_boxplot()` and other geoms do for you automatically.
You now know how to explore the variables displayed in your dataset, but you should know that these are not the only variables in your data. Nor are the observations that are displayed in your data the only observations. You can use the values in your data to compute new variables or to measure new (group-level) observations. These new variables and observations provide a further source of insights that you can explore with visualizations, clustering algorithms, and models.
## EDA and Data Science
As a term, "data science" has been used in different ways by many people. This fluidity is necessary for a term that describes a wide breadth of activity, as data science does. Nonetheless, you can use the principles in this chapter to build a general model of data science. The model requires one limit to the definition of data science: data science must rely in some way on human judgement applied to data.
To judge or interpret the information in a dataset, you must first comprehend that information, which is difficult to do. The easiest way to comprehend data is to visualize, transform, and model it, a process that we have referred to as Exploratory Data Analysis.
```{r, echo = FALSE}
knitr::include_graphics("images/EDA-data-science-1.png")
```{r echo = FALSE, out.width = "75%"}
knitr::include_graphics("diagrams/data-science.png")
```
Once you comprehend the information in your data, you can use it to make inferences from your data. Often this involves making deductions from a model. This is what you do when you conduct a hypothesis test, make a prediction (with or without a confidence interval), or score cases in a database.
The main tool that you are missing is modelling. Modelling is important because once you have recognise a pattern, a model allows you to make that pattern quantitative and precise, and partition it out from what remains. That supports a powerful interative appraoch where you indentify a pattern with visualisation, then subtract with a model, allowing you to see the subtler trends that remain. I deliberately chose not to teach modelling yet, because understanding what models are and how they work are easiest once you have some other tools in hand: data wrangling, and programming.
```{r, echo = FALSE}
knitr::include_graphics("images/EDA-data-science-2.png")
```
__Part 2__, up next, covers data wrangling. So far we've focussed on datasets that are already in the right form in R. In real life, you'll need tools to get your data into R (import it), organise it into an consistent format (tidy it), and then specialised tools for specialised types of data (like strings and dates).
But all of this work will involve a computer; you cannot do it in your head, nor on paper with a pencil. To work efficiently, you will need to know how to program in a computer language, such as R. You will also need to know how to import data to use with the computer language, and how to tidy the data into the format that works best for that computer language.
__Part 3__ teaches you more about programming. All of this work will involve a computer; you cannot do it in your head, nor with paper and pencil. To work efficiently, you will need to know how to program in a computer language, such as R.
```{r, echo = FALSE}
knitr::include_graphics("images/EDA-data-science-3.png")
```
Now we can return to modelling in __Part 4__. You'll use your new tools of data wrangling and programming, to fit many models and understand how they work. The focus of this book is on exploration, not confirmation or formal inference. But you'll learn a few basic tools that help you understand the variation within your models.
Finally, if your work is meaningful at all, it will have an audience, which means that you will need to share your work in a way that your audience can understand. Your audience might be fellow scientists who will want to reproduce the work, non-scientists who will want to understand your findings in plain terms, or yourself (in the future) who will be thankful if you make your work easy to re-learn and recreate. To satisfy these audiences, you may choose to communicate your results in a report or to bundle your work into some type of useful format, like an R package or a Shiny app.
```{r, echo = FALSE}
knitr::include_graphics("images/EDA-data-science-4.png")
```
This model of data science forms a road map for the rest of the book.
* Part 1 of the book covered the central tasks of the model above, Exploratory Data Analysis.
* Part 2 will cover the logistical tasks of working with data in a computer language: importing and tidying the data, skills I call Data Wrangling.
* Part 3 will teach you some of the most efficient ways to program in R with data.
* Part 4 will discuss models and how to apply them.
* Part 5 will teach you R Markdown, the most popular format for reporting and reproducing the results of an R analysis.
```{r, echo = FALSE}
knitr::include_graphics("images/EDA-data-science-5.png")
```
The successful completion of a data science project you will have built up a good understand of what is going on with the data. It doesn't matter how brilliant your understand is unless you can communicate it with others. You will need to share your work in a way that your audience can understand. Your audience might be fellow scientists who will want to reproduce the work, non-scientists who will want to understand your findings in plain terms, or yourself (in the future) who will be thankful if you make your work easy to re-learn and recreate. __Part 5__ discusses communication, and how you can use RMarkdown to generate reproducible artefacts that combine prose and code.