EDA changes

This commit is contained in:
hadley 2016-07-22 16:00:11 -05:00
parent 77f5dde93d
commit 959cac4d08
1 changed files with 66 additions and 60 deletions

126
EDA.Rmd
View File

@ -2,26 +2,23 @@
## Introduction
This chapter will show you how to use visualization and transformation to explore your data in a systematic way, a task that statisticians call Exploratory Data Analysis, or EDA for short. EDA is an interative cycle that involves:
This chapter will show you how to use visualization and transformation to explore your data in a systematic way, a task that statisticians call exploratory data analysis, or EDA for short. EDA is an interative cycle. You:
1. Forming questions about your data.
1. Generate questions about your data.
1. Searching for answers by visualizing, transforming, and modeling your data.
1. Search for answers by visualizing, transforming, and modeling your data.
1. Using what you discover to refine your questions about the data, or
to choose new questions to investigate
1. Use what you learn to refine your questions and or generate new questions.
EDA is not a formal process with a strict set of rules: you must be free to investigate every idea that occurs to you. Instead, EDA is a loose set of tactics that are more likely to lead to useful insights. This chapter will teach you a basic toolkit of these useful EDA techniques. Our discussion will lead to a model of data science itself, the model that I've built this book around.
EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA you should feel be free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be deadends. As your exploration continues you will hone in on a few particularly productive areas that you'll eventually write up and communicate to others.
This chapter will point you towards many other interesting packages, more so than any other chapter in the book.
Also recommend the ggplot2 book <https://amzn.com/331924275X>. The 2nd edition was recently published so it's up-to-date. Contains a lot more details on visualisation. Unfortunately it's not free, but if you're at a university you can get electronic version for free through SpringerLink. This book doesn't contain as much visualisation as it probably should because you can use ggplot2 book as a reference as well.
EDA is an important part of any data analysis, even if the questions are handed to your on a platter, because you always need to investigate the quality of your data. Data cleaning is just one application of EDA: you're ask questions about whether your data meets your expectations or not. To do data cleaning, you'll need to deploy all the tools of EDA: visualisation, transformation, and modelling.
### Prerequisites
In this chapter we'll combine what you've learned about dplyr and ggplot2 to iteratively ask questions, answer them with data, and then ask new questions.
```{r setup}
```{r setup, message = FALSE}
library(ggplot2)
library(dplyr)
```
@ -33,9 +30,9 @@ library(dplyr)
> "Far better an approximate answer to the right question, which is often
> vague, than an exact answer to the wrong question, which can always be made
> precise." ---John Tukey
> precise." --- John Tukey
Your goal during EDA is to develop your understanding of your data. The easiest way to do this is to use questions as tools to guide your investigation. When you ask a question, the question focuses your attention on a specific part of your dataset and helps you decide which graphs, models, or transforamtions to make.
Your goal during EDA is to develop your understanding of your data. The easiest way to do this is to use questions as tools to guide your investigation. When you ask a question, the question focuses your attention on a specific part of your dataset and helps you decide which graphs, models, or transformations to make.
EDA is fundamentally a creative process. And like most creative processes, the key to asking _quality_ questions is to generate a large _quantity_ of questions. It is difficult to ask revealing questions at the start of your analysis because you do not know what insights are contained in your dataset. On the other hand, each new question that you ask will expose you to a new aspect of your data and increase your chance of making a discovery. You can quickly drill down into the most interesting parts of your data---and develop a set of thought provoking questions---if you follow up each question with a new question based on what you find.
@ -58,18 +55,16 @@ The rest of this chapter will look at these two questions. I'll explain what var
each associated with a different variable. I'll sometimes refer to
an observation as a data point.
* _Tabular data_ is a set of values, each associated with a variable and an
* __Tabular data__ is a set of values, each associated with a variable and an
observation. Tabular data is _tidy_ if each value is placed in its own
"cell", each variable in its own column, and each observation in its own
row.
For now, assume all the data you see in this book is be tidy. You'll encounter lots of other data in practice, so we'll come back to these ideas again in [tidy data] where you'll learn how to tidy messy data.
So far, all the data you've seen so far has been tidy. In real-life, most data isn't tidy, so we'll come back to these ideas again in [tidy data].
## Variation
> "What type of variation occurs within my variables?"
**Variation** is the tendency of the values of a variable to change from measurement to measurement. You can see variation easily in real life; if you measure any continuous variable twice---and precisely enough, you will get two different results. This is true even if you measure quantities that are constant, like the speed of light (below). Each of your measurements will include a small amount of error that varies from measurement to measurement.
**Variation** is the tendency of the values of a variable to change from measurement to measurement. You can see variation easily in real life; if you measure any continuous variable twice, you will get two different results. This is true even if you measure quantities that are constant, like the speed of light (below). Each of your measurements will include a small amount of error that varies from measurement to measurement.
```{r, variation, echo = FALSE}
old <- options(digits = 7)
@ -82,33 +77,39 @@ knitr::kable(
options(old)
```
Discrete and categorical variables can also vary if you measure across different subjects (e.g. the eye colors of different people), or different times (e.g. the energy levels of an electron at different moments).
Categorical variables can also vary if you measure across different subjects (e.g. the eye colors of different people), or different times (e.g. the energy levels of an electron at different moments).
Every variable has its own pattern of variation, which can reveal interesting information. The best way to understand that pattern is to visualize the distribution of the values that you observe for the variable.
Every variable has its own pattern of variation, which can reveal interesting information. The best way to understand that pattern is to visualize the distribution of variable's values.
### Visualizing distributions
How you visualize the distribution of a variable will depend on whether the variable is categorical or continuous. A variable is **categorical** if it can only have a finite (or countably infinite) set of unique values. In R, categorical variables are usually saved as factors, integers, or character strings. To examine the distribution of a categorical variable, use a bar chart.
How you visualize the distribution of a variable will depend on whether the variable is categorical or continuous. A variable is **categorical** if it can only have a finite (or countably infinite) set of unique values. In R, categorical variables are usually saved as factors or character vectors. To examine the distribution of a categorical variable, use a bar chart:
```{r}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
```
The height of the bars displays how many observations occurred with each x value. You can compute these values manually with `dplyr::count()`.
The height of the bars displays how many observations occurred with each x value. You can compute these values manually with `dplyr::count()`:
```{r}
diamonds %>% count(cut)
```
A variable is **continuous** if you can arrange its values in order _and_ an infinite number of unique values can exist between any two values of the variable. Numbers and date-times are two examples of continuous variables. To examine the distribution of a continuous variable, use a histogram.
A variable is **continuous** if you can arrange its values in order _and_ an infinite number of unique values can exist between any two values of the variable. Numbers and date-times are two examples of continuous variables. To examine the distribution of a continuous variable, use a histogram:
```{r}
ggplot(data = diamonds) +
geom_histogram(aes(x = carat), binwidth = 0.5)
```
A histogram divides the x axis into equally spaced intervals and then uses a bar to display how many observations fall into each interval. In the graph above, the tallest bar shows that almost 30,000 observations have a $carat$ value between 0.25 and 0.75, which are the left and right edges of the bar.
You can compute this by hand by combining `dplyr::count()` and `ggplot2::cut_width()`:
```{r}
diamonds %>% count(cut_width(carat, 0.5))
```
A histogram divides the x axis into equally spaced bins and then uses the heigh of bar to display the number observations fall in each bun. In the graph above, the tallest bar shows that almost 30,000 observations have a $carat$ value between 0.25 and 0.75, which are the left and right edges of the bar.
You can set the width of the intervals in a histogram with the `binwidth` argument, which is measured in the units of the $x$ variable. You should always explore a variety of binwidths when working with histograms, as different binwidths can reveal different patterns. For example, here is how the graph above looks when we zoom into just the diamonds with a binwidth of less than three and choose a smaller binwidth.
@ -130,7 +131,7 @@ Now that you can visualize variation, what should you look for in your plots? An
### Typical values
In both bar charts and histograms, tall bars reveal common values of a variable. Shorter bars reveal rarer values. Places that do not have bars reveal values that were not seen in your data. To turn this information into useful questions, look for anything unexpected:
In both bar charts and histograms, tall bars show the common values of a variable, and shorter bars show less-common values. Places that do not have bars reveal values that were not seen in your data. To turn this information into useful questions, look for anything unexpected:
* Which values are the most common? Why?
@ -169,20 +170,18 @@ ggplot(data = faithful, mapping = aes(x = eruptions)) +
geom_histogram(binwidth = 0.25)
```
Many of the questions above will prompt you to explore a relationship *between* variables, for example, to see if the values of one variable can explain the behavior of another variable.
Many of the questions above will prompt you to explore a relationship *between* variables, for example, to see if the values of one variable can explain the behavior of another variable. We'll get to that shortly.
### Unusual values
Outliers are observations that are unusual; data points that are don't seem to fit the pattern. Sometimes outliers are data entry errors; other times outliers suggest important new science. When you have a lot of data, outliers are sometimes difficult to see in a histogram.
For example, take this distribution of the `x` variable from the diamonds dataset. The only evidence of outliers is the unusually wide limits on the x-axis.
Outliers are observations that are unusual; data points that are don't seem to fit the pattern. Sometimes outliers are data entry errors; other times outliers suggest important new science. When you have a lot of data, outliers are sometimes difficult to see in a histogram. For example, take the distribution of the `x` variable from the diamonds dataset. The only evidence of outliers is the unusually wide limits on the x-axis.
```{r}
ggplot(diamonds) +
geom_histogram(aes(x = y), binwidth = 0.5)
```
This is because there are so many observations in the common bins that the rare bins are so short that you can't see them (although maybe if you stare intently at 0 you'll spot something). To make it easy to see the unusual vaues, we need to zoom into to small values of the y-axis with `coord_cartesian()`:
There are so many observations in the common bins that the rare bins are so short that you can't see them (although maybe if you stare intently at 0 you'll spot something). To make it easy to see the unusual vaues, we need to zoom into to small values of the y-axis with `coord_cartesian()`:
```{r}
ggplot(diamonds) +
@ -201,9 +200,9 @@ unusual <- diamonds %>%
unusual
```
The y variable measures one of the three dimensions of these diamonds in mm. We know that diamonds can't have a 0 measurement. So these must be invalid measurements. We might also suspect that measureents of 32mm and 59mm are implausible: those diamonds are over an inch long, but don't cost hundreds of thousands of dollars!
The `y` variable measures one of the three dimensions of these diamonds, in mm. We know that diamonds can't have a width of 0mm, so these values must be incorrect.. We might also suspect that measurements of 32mm and 59mm are implausible: those diamonds are over an inch long, but don't cost hundreds of thousands of dollars!
When you discover an outlier it's a good idea to trace it back as far as possible. You'll be in a much stronger analytical position if you can figure out why it happened. If you can't figure it out, and want to just move on with your analysis, it's a good idea to replace it with a missing value, which we'll discuss in the next section.
When you discover an outlier it's a good idea to trace it back as far as possible. You'll be in a much stronger analytical position if you can figure out why it happened. If you can't figure it out, and want to just move on with your analysis, replace it with a missing value, which we'll discuss in the next section.
### Exercises
@ -215,8 +214,6 @@ When you discover an outlier it's a good idea to trace it back as far as possibl
or surprising? (Hint: carefully think about the `binwidth` and make sure
you)
1. Explore the distribution of `carat`. What do you think drives the pattern?
1. How many diamonds have 0.99 carats? Why?
1. Compare and contrast `coord_cartesian()` vs `xlim()`/`ylim()` when
@ -235,8 +232,8 @@ If you've encountered unusual values in your dataset, and simply want to move on
I don't recommend this option because just because one measurement
is invalid, doesn't mean all the measurements are. Additionally, if you
have very noisy data, you might find by time that you've applied this
approach to every variable that you don't have any data left!
have low quality data, by time that you've applied this approach to every
variable you might find that you don't have any data left!
1. Instead, I recommend replacing the unusual values with missing values.
The easiest way to do this is use `mutate()` to replace the variable
@ -248,7 +245,9 @@ If you've encountered unusual values in your dataset, and simply want to move on
mutate(y = ifelse(y < 3 | y > 20, NA, y))
```
ggplot2 subscribes to the philosophy that missing values should never silently go missing. It's not obvious where you should plot missing values, so ggplot2 doesn't include them in the plot, but does warn that they're been removed:
`ifelse()` has three arguments. The first argument `test` should be a logical vector. The result will contain the value of the second argument, `yes`, when `test` is `TRUE`, and the value of the third argument, `no`, when it is false.
Like R, ggplot2 subscribes to the philosophy that missing values should never silently go missing. It's not obvious where you should plot missing values, so ggplot2 doesn't include them in the plot, but does warn that they're been removed:
```{r, dev = "png"}
ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
@ -269,7 +268,7 @@ nycflights13::flights %>%
mutate(
cancelled = is.na(dep_time),
sched_hour = sched_dep_time %/% 100,
sched_min = sched_dep_time %% 100,
sched_min = sched_dep_time %% 100,
sched_dep_time = sched_hour + sched_min / 60
) %>%
ggplot(mapping = aes(sched_dep_time)) +
@ -287,13 +286,11 @@ However this plot isn't great because there are many more non-cancelled flights
## Covariation
> "What type of covariation occurs between my variables?"
If variation describes the behavior _within_ a variable, covariation describes the behavior _between_ variables. **Covariation** is the tendency for the values of two or more variables to vary together in a related way. The best way to spot covariation is to visualize the relationship between two or more variables. How you do that should again depend on the type of variables involved.
If variation describes the behavior _within_ a variable, covariation describes the behavior _between_ variables. **Covariation** is the tendency for the values of two or more variables to vary together in a correlated way. The best way to spot covariation is to visualize the relationship between two or more variables. How you do that should again depend on the type of variables involved.
### A categorical and continuous variable
### Categorical + continuous
It's common to want to explore the distribution of a continuous variable broken down by a categorical, as in the previous histogram. The default appearance of `geom_freqpoly()` is not that useful for that sort of comparison because the height is the count. That means if one of the groups is much smaller than the others, it's hard to see the differences in shape. For example, lets explore how the price of a diamond varies with its quality:
It's common to want to explore the distribution of a continuous variable broken down by a categorical, as in the previous frequency polygon. The default appearance of `geom_freqpoly()` is not that useful for that sort of comparison because the height is given by the count. That means if one of the groups is much smaller than the others, it's hard to see the differences in shape. For example, lets explore how the price of a diamond varies with its quality:
```{r}
ggplot(data = diamonds, mapping = aes(x = price)) +
@ -302,7 +299,7 @@ ggplot(data = diamonds, mapping = aes(x = price)) +
It's hard to see the difference in distribution because the overall counts differ so much:
```{r}
```{r, fig.width = "50%", out.width =4}
ggplot(diamonds, aes(cut)) +
geom_bar()
```
@ -352,7 +349,7 @@ ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot()
```
Covariation will appear as a systematic change in the medians or IQRs of the boxplots. To make the trend easier to see, wrap the $x$ variable with `reorder()`. The code below reorders the x axis based on the median hwy value of each group.
Covariation will appear as a systematic change in the medians or IQRs of the boxplots. To make the trend easier to see, reorder $x$ variable with `reorder()`. This code reorders the `class` based on the median value of `hwy` in each group.
```{r fig.height = 3}
ggplot(data = mpg) +
@ -374,8 +371,8 @@ ggplot(data = mpg) +
1. What variable in the diamonds dataset is most important for predicting
the price of a diamond? How is that variable correlated with cut?
Why does that combination lead to lower quality diamonds being more
expensive.
Why does the combination of those two relationships lead to lower quality
diamonds being more expensive?
1. Install the ggstance pacakge, and create a horizontal boxplot.
How does this compare to using `coord_flip()`?
@ -388,7 +385,7 @@ ggplot(data = mpg) +
do you learn? How do you interpret the plots?
1. Compare and contrast `geom_violin()` with a facetted `geom_histogram()`,
or coloured `geom_freqpoly()`. What are the pros and cons of each
or a coloured `geom_freqpoly()`. What are the pros and cons of each
method?
1. If you have a small dataset, it's sometimes useful to use `geom_jitter()`
@ -396,44 +393,47 @@ ggplot(data = mpg) +
The ggbeeswarm package provides a number of methods similar to
`geom_jitter()`. List them and briefly describe what each one does.
### Categorical x2
### Two categorical variables
There are two basic techniques for visulaising covariation between categorical variables. One is to count the number of observations at each location and display the count with the size of a point. That's the job of `geom_count()`:
To visualise the covariation between categorical variables, you'll need to count the number of observations for each combination. One way to do that is to rely on the built-in `geom_count()`:
```{r}
ggplot(data = diamonds) +
geom_count(mapping = aes(x = cut, y = color))
```
The size of each circle in the plot displays how many observations occurred at each combination of values. Covariation will appear as a strong correlation between specific x values and specific y values. As with bar charts, you can calculate the specific values with `count()`.
The size of each circle in the plot displays how many observations occurred at each combination of values. Covariation will appear as a strong correlation between specific x values and specific y values.
Another approach is to compute the count with dplyr:
```{r}
diamonds %>% count(color, cut)
```
This allows you to reproduce `geom_count()` by hand, or instead of mapping count to `size`, you could instead use `geom_raster()` and map count to `fill`:
Then visualise with `geom_tile()` and the fill aesthetic:
```{r}
diamonds %>%
count(color, cut) %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_raster(aes(fill = n))
geom_tile(aes(fill = n))
```
If the categorical variables are unordered, you might want to use the seriation package to simultaneously reorder the rows and columns in order to more clearly reveal interesting patterns. For larger plots, you might want to try the d3heatmap or heatmaply packages which creative interactive plots.
#### Exercises
1. How could you rescale the count dataset above to more clearly see
the differences across colours or across cuts?
1. How could you rescale the count dataset above to more clearly show
the distribution of cut within colour, or colour within cut?
1. Use `geom_raster()` together with dplyr to explore how average flight
delays vary by destination and month of year.
1. Use `geom_tile()` together with dplyr to explore how average flight
delays vary by destination and month of year. What makes the
plot difficult to read? How could you improve it?
1. Why is slightly better to use `aes(x = color, y = cut)` rather
than `aes(x = cut, y = color)` in the example above?
### Continuous x2
### Two continuous variables
You've already seen one great way to visualise the covariation between two continuous variables: draw a scatterplot with `geom_point()`. You can see covariation as a pattern in the points. For example, you can see an exponential relationship between the carat size and price of a diamond.
@ -529,7 +529,7 @@ ggplot(data = faithful) +
Patterns provide one of the most useful tools for data scientists because they reveal covariation. If you think of variation as a phenomenon that creates uncertainty, covariation is a phenomenon that reduces it. If two variables covary, you can use the values of one variable to make better predictions about the values of the second. If the covariation is due to a causal relationship (a special case), then you can use the value of one variable to control the value of the second.
Models are a rich tool for extracting patterns out of data. For example, consider the diamonds data. It's hard to understand the relationship between cut and price, because cut and carat, and carat and price are tightly related. It's possible to use a model to remove the very strong relationship between price and carat so we we can explore the subtleties that remain.
Models are a tool for extracting patterns out of data. For example, consider the diamonds data. It's hard to understand the relationship between cut and price, because cut and carat, and carat and price are tightly related. It's possible to use a model to remove the very strong relationship between price and carat so we we can explore the subtleties that remain:
```{r, dev = "png"}
library(modelr)
@ -544,12 +544,14 @@ ggplot(data = diamonds2, mapping = aes(x = carat, y = resid)) +
geom_point()
```
Once you've removed the strong relationship between carat and price, you can see what you expect in the relationship between cut and price: relative to their size, better quality diamonds are more expensive.
```{r}
ggplot(data = diamonds2, mapping = aes(x = cut, y = resid)) +
geom_boxplot()
```
Modelling is important because once you have recognised a pattern, a model allows you to make that pattern quantitative and precise, and partition it out from what remains. That supports a powerful interative approach where you indentify a pattern with visualisation, then subtract with a model, allowing you to see the subtler trends that remain. I deliberately chose not to teach modelling yet, because understanding what models are and how they work are easiest once you have some other tools in hand: data wrangling, and programming.
You haven't learn more modelling yet because understanding what models are and how they work is easiest once you have tools of data wrangling and programming in hand.
## ggplot2 calls
@ -560,7 +562,9 @@ ggplot(data = faithful, mapping = aes(x = eruptions)) +
geom_freqpoly(binwidth = 0.25)
```
But the first couple of arguments to a function are typically so important that you should know them by heart. The first two arguments to `ggplot()` are `data` and `mapping`, and the first two arguments to `aes()` are `x` and `y`. In the remainder of the book, we won't supply those names. That saves typing and by reducing the amount of boilerplate makes it easier to see what's different between plots (that's a rely important programming concern that we'll come back in [functions]).
Typically, the first one or two arguments to a function are so important that you should know them by heart. The first two arguments to `ggplot()` are `data` and `mapping`, and the first two arguments to `aes()` are `x` and `y`. In the remainder of the book, we won't supply those names. That saves typing, and, by reducing the amount of boilerplate, makes it easier to see what's different between plots. That's a really important programming concern that we'll come back in [functions].
Rewriting the previous plot more concisely yields:
```{r, eval = FALSE}
ggplot(faithful, aes(eruptions)) +
@ -575,3 +579,5 @@ diamonds %>%
ggplot(aes(clarity, cut, fill = n)) +
geom_tile()
```
If you want learn more about ggplot2, I'd highly recommend grabbing a copy of the ggplot2 book: <https://amzn.com/331924275X>. It's been recently updated, so includes dplyr and tidyr code, and has much more space to explore all the facets of visualisation. Unfortunately the book isn't generally available for fre, but if have a connection a university you can probably get an electronic version for free through SpringerLink.