Merge branch 'master' of github.com:hadley/r4ds

# Conflicts:
#	model-vis.Rmd
This commit is contained in:
hadley 2016-05-26 09:44:58 -05:00
commit 3e8447c62d
21 changed files with 1114 additions and 444 deletions

View File

@ -11,6 +11,7 @@ Imports:
bookdown,
broom,
dplyr,
DSR,
ggplot2,
hexbin,
htmltools,
@ -19,7 +20,6 @@ Imports:
jsonlite,
Lahman,
lubridate,
modelr,
knitr,
maps,
microbenchmark,
@ -35,10 +35,9 @@ Imports:
Remotes:
gaborcsardi/rcorpora,
garrettgman/DSR,
hadley/modelr,
hadley/purrr,
hadley/tidyr,
hadley/stringr,
hadley/ggplot2,
hadley/nycflights13,
yihui/knitr,
rstudio/bookdown

View File

@ -1,11 +1,12 @@
# Communication with plots
*Section 2* will show you how to prepare your plots for communication. You'll learn how to make your plots more legible with titles, labels, zooming, and default visual themes.
```{r include = FALSE}
```{r echo = FALSE, messages = FALSE, warning=FALSE}
library(ggplot2)
```
# Communication with plots
The previous sections showed you how to make plots that you can use as a tools for _exploration_. When you made these plots, you knew---even before you looked at them---which variables the plot would display and which data sets the variables would come from. You might have even known what to look for in the completed plots, assuming that you made each plot with a goal in mind. As a result, it was not very important to put a title or a useful set of labels on your plots.
The importance of titles and labels changes once you use your plots for _communication_. Your audience will not share your background knowledge. In fact, they may not know anything about your plots except what the plots themselves display. If you want your plots to communicate your findings effectively, you will need to make them as self-explanatory as possible.

View File

@ -398,7 +398,7 @@ datetimes %>%
### Setting dates
You can also use each accessor function to set the components of a date or datetime.
You can also use each accessor funtion to set the components of a date or datetime.
```{r}
datetime

View File

@ -1,4 +1,9 @@
# Dynamic Documents with R Markdown
...
## Output formats
## Slide syntax
## Customizing output
## Tables
## Citations and bibliographies
## Interactive documents
## Templates

View File

@ -2,4 +2,33 @@
# Introduction
Data poses a cognitive problem; Data comprehension is a skill.
```{r, include = FALSE}
library(ggplot2)
library(dplyr)
```
If you are like most humans, your brain is not designed to work with raw data. The working memory can only attend to a few values at a time, which makes it difficult to discover patterns in raw data. For example, can you spot the striking relationship between $X$ and $Y$ in the table below?
```{r data, echo=FALSE}
x <- rep(seq(0.2, 1.8, length = 5), 2) + runif(10, -0.15, 0.15)
X <- c(0.02, x, 1.94)
Y <- sqrt(1 - (X - 1)^2)
Y[1:6] <- -1 * Y[1:6]
Y <- Y - 1
order <- sample(1:10)
knitr::kable(round(data.frame(X = X[order], Y = Y[order]), 2))
```
While we may stumble over raw data, we can easily process visual information. Within your mind is a visual processing system that has been fine-tuned by thousands of years of evolution. As a result, the quickest way to understand your data is to visualize it. Once you plot your data, you can instantly see the relationships between values. Here, we see that the values above fall on a circle.
```{r echo=FALSE, dependson=data}
ggplot2::qplot(X, Y) + ggplot2::coord_fixed(ylim = c(-2.5, 2.5), xlim = c(-2.5, 2.5))
```
Visualization works because your brain processes visual information in a different (and much wider) channel than it processes symbolic information, like words and numbers. However, visualization is not the only way to comprehend data.
You can also comprehend data by transforming it. You can easily attend to a small set of summary values, which lets you absorb important information about the data. This is why it feels natural to work with things like averages, maximums, minimums, medians, and so on.
Another way to summarize your data is to replace it with a model, a function that describes the relationships between two or more variables. You can attend to the important parts of a model more easily than you can attend to the raw values in your data set.
The first problem in Data Science is a cognitive problem: how can you understand your own data? In this part of the book, you'll learn how to use R to discover and understand the information contained in your data.

BIN
images/EDA-boxplot.pdf Normal file

Binary file not shown.

Binary file not shown.

After

Width:  |  Height:  |  Size: 20 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 21 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 30 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 37 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 60 KiB

BIN
images/EDA-plotly.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 177 KiB

View File

@ -17,11 +17,11 @@ In [functions], we talked about how important it is to reduce duplication in you
1. You're likely to have fewer bugs because each line of code is
used in more places.
One part of reducing duplication is writing functions. Functions allow you to identify repeated patterns of code and extract them out into independent pieces that you can reuse and easily update as code changes. Iteration helps you when you need to do the same thing to multiple inputs: repeating the same operation on different columns, or on different datasets. (Generally, you won't need to use explicit iteration to deal with different subsets of your data: in most cases the implicit iteration in dplyr will take care of that problem for you.)
One part of reducing duplication is writing functions. Functions allow you to identify repeated patterns of code and extract them out into indepdent pieces that you can reuse and easily update as code changes. Iteration helps you when you need to do the same thing to multiple inputs: repeating the same operation on different columns, or on different datasets. (Generally, you won't need to use explicit iteration to deal with different subsets of your data: in most cases the implicit iteration in dplyr will take care of that problem for you.)
In this chapter you'll learn about two important iteration paradigms: imperative programming and functional programming, and the machinary each provides. On the imperative side you have things like for loops and while loops, which are a great place to start because they make iteration very explicit, so it's obvious what's happening. However, for loops are quite verbose, and include quite a bit of book-keeping code, that is duplicated for every for loop. Functional programming (FP) offers tools to extract out this duplicated code, so each common for loop pattern gets its own function. Once you master the vocabulary of FP, you can solve many common iteration problems with less code, more ease, and fewer errors.
In this chapter you'll learn about two important iteration paradigms: imperative programming and functional programming, and the machinery each provides. On the imperative side you have things like for loops and while loops, which are a great place to start because they make iteration very explicit, so it's obvious what's happening. However, for loops are quite verbose, and include quite a bit of book-keeping code, that is duplicated for every for loop. Functional programming (FP) offers tools to extract out this duplicated code, so each common for loop pattern gets its own function. Once you master the vocabulary of FP, you can solve many common iteration problems with less code, more ease, and fewer errors.
Some people will tell you to avoid for loops because they are slow. They're wrong! (Well at least they're rather out of date, for loops haven't been slow for many years). The chief benefits of using FP functions like `lapply()` or `purrr::map()` is that they are more expressive and make code both easier to write and easier to read.
Some people will tell you to avoid for loops because they are slow. They're wrong! (Well at least they're rather out of date, as for loops haven't been slow for many years). The chief benefits of using FP functions like `lapply()` or `purrr::map()` is that they are more expressive and make code both easier to write and easier to read.
In later chapters you'll learn how to apply these iterating ideas when modelling. You can often use multiple simple models to help understand a complex dataset, or you might have multiple models because you're bootstrapping or cross-validating. The techniques you'll learn in this chapter will be invaluable.
@ -116,7 +116,7 @@ That's all there is to the for loop! Now is a good time to practice creating som
1. Compute the mean of every column in the `mtcars`.
1. Determine the type of each column in `nycflights13::flights`.
1. Compute the number of unique values in each column of `iris`.
1. Generate 10 random normals for each of $\mu = -10$, $0$, $10$, and $100$.
1. Generate 10 random normals for each of $mu = -10$, $0$, $10$, and $100$.
Think about output, sequence, and body, __before__ you start writing
the loop.
@ -248,7 +248,7 @@ for (i in seq_along(x)) {
### Unknown output length
Sometimes you might know now how long the output will be. For example, imagine you want to simulate some random vectors of random lengths. You might be tempted to solve this problem by progressively growing the vector:
Sometimes you might not know how long the output will be. For example, imagine you want to simulate some random vectors of random lengths. You might be tempted to solve this problem by progressively growing the vector:
```{r}
means <- c(0, 1, 2)
@ -261,7 +261,7 @@ for (i in seq_along(means)) {
str(output)
```
But this type of is not very efficient because in each iteration, R has to copy all the data from the previous iterations. In technical terms you get "quadratic" ($O(n^2)$) behaviour which means that a loop with three times as many elements would take nine times ($3^2$) as long to run.
But this is not very efficient because in each iteration, R has to copy all the data from the previous iterations. In technical terms you get "quadratic" ($O(n^2)$) behaviour which means that a loop with three times as many elements would take nine times ($3^2$) as long to run.
A better solution to save the results in a list, and then combine into a single vector after the loop is done:
@ -375,7 +375,7 @@ I mention while loops briefly, because I hardly ever use them. They're most ofte
}
```
## For loops vs functionals
## For loops vs. functionals
For loops are not as important in R as they are in other languages because R is a functional programming language. This means that it's possible to wrap up for loops in a function, and call that function instead of using the for loop directly.
@ -529,7 +529,7 @@ There are a few differences between `map_*()` and `col_summary()`:
### Shortcuts
There are a few shortcuts that you can use with `.f` in order to save a little typing. Imagine you want to fit a linear model to each group in a dataset. The following toy example splits up the `mtcars` dataset into three pieces (one for each value of cylinder) and fits the same linear model to each piece:
There are a few shortcuts that you can use with `.f` in order to save a little typing. Imagine you want to fit a linear model to each group in a dataset. The following toy example splits the up the `mtcars` dataset in to three pieces (one for each value of cylinder) and fits the same linear model to each piece:
```{r}
models <- mtcars %>%

View File

@ -85,10 +85,3 @@ ggplot(, aes(mse)) +
geom_histogram(binwidth = 0.25) +
geom_vline(xintercept = base_mse, colour = "red")
```
### Exercises
1. Given a list of formulas, use purr to fit them.
1. Given a list of model functions, and parameters, use purrr to fit them.

View File

@ -550,3 +550,538 @@ delays %>%
```
In this chapter we will explore model visualisation from two different sides:
1. Use a model to make it easier to see important patterns in our data.
1. Use visualisation to understand what a model is telling us about our data.
We're going to give you a basic strategy, and point you to places to learn more. The key is to think about data generated from your model as regular data - you're going to want to manipulate it and visualise it in many different ways. Being good at modelling is a mixture of having some good general principles and having a big toolbox of techniques. Here we'll focus on general techniques to help you undertand what your model is telling you.
<!-- residuals vs. predictions -->
Centered around looking at residuals and looking at predictions. You'll see those here applied to linear models (and some minor variations), but it's a flexible technique since every model can generate predictions and residuals.
Attack the problem from two directions: building up from a simple model, and subtracting off the full dataset.
Transition from implicit knowledge in your head and in data to explicit knowledge in the model. In other words, you want to make explicit your knowledge of the data and capture it explicitly in a model. This makes it easier to apply to new domains, and easier for others to use. But you must always remember that your knowledge is incomplete. Subtract patterns from the data, and add patterns to the model.
<!-- purpose of modelling -->
What is a good model? We'll think about that more in the next chapter. For now, a good model captures the majority of the patterns that are generated by the underlying mechanism of interest, and captures few patterns that are not generated by that mechanism. Another way to frame that is that you want your model to be good at inference, not just description. Inference is one of the most important parts of a model - you want to not just make statements about the data you have observed, but data that you have not observed (like things that will happen in the future).
Focus on constructing models that help you better understand the data. This will generally lead to models that predict better. But you have to beware of overfitting the data - in the next section we'll discuss some formal methods. But a healthy dose of scepticism is also a powerful: do you believe that a pattern you see in your sample is going to generalise to a wider population?
<!-- When do you stop? -->
For very large and complex datasets this is going to be a lot of work. There are certainly alternative approaches - a more machine learning approach is simply to focus on improving the predictive ability of the model, being careful to fairly assess it (i.e. not assessing the model on the data that was used to train it). These approaches tend to produce black boxes - i.e. the model does a really good job, but you don't know why. This is fine, but the main problem is that you can't apply your real world knowledge to the model to think about whether or not it's likely to work in the long-term, as fundamentals change. For most real models, I'd expect you to use some combination of this approach and a ML model building approach. If prediction is important, get to a good point, and then use visulisation to understand the most important parts of the model.
> A long time ago in art class, my teacher told me "An artist needs to know
> when a piece is done. You can't tweak something into perfection - wrap it up.
> If you don't like it, do it over again. Otherwise begin something new". Later
> in life, I heard "A poor seamstress makes many mistake. A good seamstress
> works hard to correct those mistakes. A great seamstress isn't afraid to
> throw out the garment and start over."
-- Reddit user Broseidon241, https://www.reddit.com/r/datascience/comments/4irajq/mistakes_made_by_beginningaspiring_data_scientists/
In the next chapter, you'll also learn about how to visualise the model-level summaries, and the model parameters.
To do this we're going to use some helper functions from the modelr package. This package provides some wrappers around the traditional base R modelling functions that make them easier to use in data manipulation pipelines. Currently at <https://github.com/hadley/modelr> but will need to be on CRAN before the book is published.
```{r}
library(modelr)
options(
contrasts = c("contr.treatment", "contr.treatment"),
na.option = na.exclude
)
```
In the course of modelling, you'll often discover data quality problems. Maybe a missing value is recorded as 999. Whenever you discover a problem like this, you'll need to review an update your import scripts. You'll often discover a problem with one variable, but you'll need to think about it for all variables. This is often frustrating, but it's typical.
## Residuals
To motivate the use of models we're going to start with an interesting pattern from the NYC flights dataset: the number of flights per day.
```{r}
library(nycflights13)
library(lubridate)
library(dplyr)
daily <- flights %>%
mutate(date = make_datetime(year, month, day)) %>%
group_by(date) %>%
summarise(n = n())
ggplot(daily, aes(date, n)) +
geom_line()
```
Understanding this pattern is challenging because there's a very strong day-of-week effect that dominates the subtler patterns:
```{r}
daily <- daily %>%
mutate(wday = wday(date, label = TRUE))
ggplot(daily, aes(wday, n)) +
geom_boxplot()
```
There are fewer flights on weekends because a very large proportion of travel is for business. You might sometimes have to less on Sunday for an early flight, but it's very rare that you'd leave on Saturday: you'd much rather be home with your family.
One way to remove this strong pattern is to fit a model that "explains" (i.e. attempts to predict) the day of week effect, and then look at the residuals:
```{r}
mod <- lm(n ~ wday, data = daily)
daily <- daily %>% add_residuals(n_resid = mod)
daily %>%
ggplot(aes(date, n_resid)) +
geom_hline(yintercept = 0, size = 2, colour = "white") +
geom_line()
```
Note the change in the y-axis: now we are seeing the deviation from the expected number of flights, given the day of week. This plot is interesting because now that we've removed much of the large day-of-week effect, we can see some of the subtler patterns that remain:
1. Our day of week adjustment seems to fail starting around June: you can
still see a strong regular pattern that our model hasn't removed. Drawing
a plot with one line for each day of the week makes the cause easier
to see:
```{r}
ggplot(daily, aes(date, n_resid, colour = wday)) +
geom_hline(yintercept = 0, size = 2, colour = "white") +
geom_line()
```
The problem appears to be Saturdays: it seems like during summer there are
more flights on Saturdays than we expect, and during Fall there are fewer.
I suspect this is because of summer holidays: many people go on holiday
in the summer, and people don't mind travelling on Saturdays for vacation.
(This doesn't, however, explain why there are more Satruday flights in
spring than fall).
1. There are some day with much fewer flights than expected:
```{r}
daily %>% filter(n_resid < -100)
```
If you're familiar with American public holidays, you might spot New Year's
day, July 4th, Thanksgiving and Christmas. There are some others that don't
seem to correspond immediately to public holidays. You'll work on those
in the exercise below.
1. There seems to be some smoother long term trend over the course of a year.
We can highlight that trend with `geom_smooth()`:
```{r}
daily %>%
ggplot(aes(date, n_resid)) +
geom_hline(yintercept = 0, size = 2, colour = "white") +
geom_line(colour = "grey50") +
geom_smooth(se = FALSE, span = 0.20)
```
There are fewer flights in January (and December), and more in summer
(May-Sep). We can't do much more with this trend than brainstorm possible
explanations because we only have a single year's worth of data.
We'll tackle the day of week effect first. Let's zoom in on Saturdays, going back to raw numbers:
```{r}
daily %>%
filter(wday == "Sat") %>%
ggplot(aes(date, n)) +
geom_line() +
geom_point(alpha = 1/3) +
scale_x_datetime(NULL, date_breaks = "1 month", date_labels = "%b")
```
So it looks like summer holidays are from early June to late August. That seems to line up fairly well with the [state's school terms](http://schools.nyc.gov/Calendar/2013-2014+School+Year+Calendars.htm): summer break is Jun 26 - Sep 9. Few families travel in the fall because of the big Thanksgiving and Christmas holidays. So lets add a "term" variable to attemp to control for that.
```{r}
term <- function(date) {
cut(date,
breaks = as.POSIXct(ymd(20130101, 20130605, 20130825, 20140101)),
labels = c("spring", "summer", "fall")
)
}
daily <- daily %>% mutate(term = term(date))
daily %>%
filter(wday == "Sat") %>%
ggplot(aes(date, n, colour = term)) +
geom_point(alpha = 1/3) +
geom_line() +
scale_x_datetime(NULL, date_breaks = "1 month", date_labels = "%b")
```
(I manually tweaked the dates to get nice breaks in the plot. Using a visualisation to help you understand what your function is doing is a really powerful and general technique.)
It's useful to see how this new variable affects the other days of the week:
```{r}
daily %>%
ggplot(aes(wday, n, colour = term)) +
geom_boxplot()
```
It looks like there is significant variation across the terms, so fitting a separate day of week effect for each term is reasonable. This improves our model, but not as much as we might hope:
```{r}
mod2 <- lm(n ~ wday * term, data = daily)
daily$n_resid2 <- resid(mod2)
ggplot(daily, aes(date)) +
geom_line(aes(y = n_resid, colour = "mod1")) +
geom_line(aes(y = n_resid2, colour = "mod2")) +
scale_colour_manual(values = c(mod1 = "grey50", mod2 = "black"))
```
That's because this model is basically calculating an average for each combination of wday and school term. We have a lot of big outliers, so they tend to drag the mean far away from the typical value.
```{r}
middles <- daily %>%
group_by(wday, term) %>%
summarise(
mean = mean(n),
median = median(n)
)
middles %>%
ggplot(aes(wday)) +
geom_linerange(aes(ymin = mean, ymax = median), colour = "grey70") +
geom_point(aes(y = mean, colour = "mean")) +
geom_point(aes(y = median, colour = "median")) +
facet_wrap(~ term)
```
We can reduce this problem by switching to a robust model fitted by `MASS::rlm()`. A robust model is a variation of the linear model which you can think of a fitting medians, instead of means (it's a bit more complicated than that, but that's a reasonable intuition). This greatly reduces the impact of the outliers on our estimates, and gives a result that does a good job of removing the day of week pattern:
```{r, warn=FALSE}
mod3 <- MASS::rlm(n ~ wday * term, data = daily)
daily <- daily %>% add_residuals(n_resid3 = mod3)
ggplot(daily, aes(date, n_resid3)) +
geom_hline(yintercept = 0, size = 2, colour = "white") +
geom_line()
```
It's now much easier to see the long-term trend, and the positive and negative outliers.
Very common to use residual plots when figuring out if a model is ok. But it's easy to get the impression that there's just one type of residual plot you should do, when in fact there are infinite.
### Exercises
1. Use your google sleuthing skills to brainstorm why there were fewer than
expected flights on Jan 20, May 26, and Sep 9. (Hint: they all have the
same explanation.) How would these days generalise to another year?
1. What do the three days with high positive residuals represent?
How would these days generalise to another year?
```{r}
daily %>% filter(n_resid2 > 80)
```
1. Create a new variable that splits the `wday` variable into terms, but only
for Saturdays, i.e. it should have `Thurs`, `Fri`, but `Sat-summer`,
`Sat-spring`, `Sat-fall`. How does this model compare with the model with
every combination of `wday` and `term`?
1. Create a new wday variable that combines the day of week, term
(for Saturdays), and public holidays. What do the residuals of
that model look like?
1. What happens if you fit a day of week effect that varies by month?
Why is this not very helpful?
1. Above we made the hypothesis that people leaving on Sundays are more
likely to be business travellers who need to be somewhere on Monday.
Explore that hypothesis by seeing how it breaks down based on distance:
if it's true, you'd expect to see more Sunday flights to places that
are far away.
1. It's a little frustrating that Sunday and Saturday are on separate ends
of the plot. Write a small function to set the levels of the factor so
that the week starts on Monday.
## Predictions
Focus on predictions from a model because this works for any type of model. Visualising parameters can also be useful, but tends to be most useful when you have many similar models. Visualising predictions works regardless of the model family.
Visualising high-dimensional models is challenging. You'll need to partition off a useable slice at a time.
Let's start by exploring the difference between the `lm()` and `rlm()` predictions for the day of week effects. We'll first re-fit the models, just so we have them handy:
```{r}
mod1 <- lm(n ~ wday * term, data = daily)
mod2 <- MASS::rlm(n ~ wday * term, data = daily)
```
Next, we need to generate a grid of values to compute predictions for. The easiest way to do that is to use `tidyr::expand()`. It's first argument is a data frame, and for each subsequent argument it finds the unique variables and then generates all combinations:
```{r}
grid <-
daily %>%
tidyr::expand(wday, term)
grid
```
Next we add predicitons. We'll use `modelr::add_predictions()` which works in exactly the same way as `add_residuals()`, but just compute predictions (so doesn't need a data frame that contains the response variable:)
```{r}
grid <-
grid %>%
add_predictions(linear = mod1, robust = mod2)
grid
```
And then we plot the predictions. Plotting predictions is usually the hardest bit and you'll need to try a few times before you get a plot that is most informative. Depending on your model it's quite possible that you'll need multiple plots to fully convey what the model is telling you about the data. Here's my attempt - it took me a few tries before I got something that I was happy with.
```{r}
grid %>%
ggplot(aes(wday)) +
geom_linerange(aes(ymin = linear, ymax = robust), colour = "grey70") +
geom_point(aes(y = linear, colour = "linear")) +
geom_point(aes(y = robust, colour = "robust")) +
facet_wrap(~ term)
```
### Exercises
1. How does the model of model coefficients compare to the plot of means
and medians computed "by hand" in the previous chapter. Create a plot
the highlights the differences and similarities.
## Generating prediction grids
### Continuous variables
When you have a continuous variable in the model, rather than using the unique values that you've seen, it's often more useful to generate an evenly spaced grid. One convenient way to do this is with `modelr::seq_range()` which takes a continuous variable, calculates its range, and then generates an evenly spaced points between the minimum and maximum.
```{r}
mod <- MASS::rlm(n ~ wday * yday(date), data = daily)
grid <- daily %>%
tidyr::expand(wday, date = seq_range(date, n = 13)) %>%
add_predictions(mod = mod)
ggplot(grid, aes(date, mod, colour = wday)) +
geom_line() +
geom_point()
```
(Why use `yday(date)` instead of `date`? That's saying we think that the pattern depends only the day of the year, so we might expect it to be the same in other years. Again, since we only have a single year of data we can't test that hypothesis.)
We're going to be using this pattern for a few examples, so lets wrap it up into a function. This is a rather inflexible function, but it serves its purpose here. In my experience, it's challenging to write functions that apply to wide swathes of your modelling process because you tend to do a lot of experimentation and your models can differ quite radically in form, and hence in how you need to visualise them.
```{r}
vis_flights <- function(mod) {
daily %>%
tidyr::expand(wday, date = seq_range(date, n = 13)) %>%
add_predictions(mod = mod) %>%
ggplot(aes(date, mod, colour = wday)) +
geom_line() +
geom_point()
}
```
This is more useful if you have a model that includes non-linear components. You might wonder how you can use a linear model to handle non-linear patterns! But basically we're going to do something like a Taylor series expansion - we're going to approximate a non-linear pattern with a series of transformation. Technically, it's still "linear" in the linear model sense because each parameter multiplies an input variable. This is different from true non-linear models where the coefficients appear in other places like $\alpha_1 + (\alpha_2-\alpha_1) * exp(-exp(\alpha_3) * x)$, the asymptotic regression model.
One way to get that is to include non-linear terms like `I(x ^ 2)`, `I(x ^ 3)` etc. You can't just use `X ^ 2` because of the way the modelling algebra works. `x ^ 2` is equivalent to `x * x` which in the modelling algebra is equivalent to `x + x + x:x` which is the same as `x`. This is useful because `(x + y + z)^2` fit all all major terms and second order interactions of x, y, and z.
But rather than using this laborious formulation (which also doesn't work here because our `x` is a date, and you can't raise a date to a power), a better solution is to the use `poly(x, n)` which generates `n` polynomials. (They are Legendre polynomials which means that each is uncorrelated with any of the previous which also makes model fitting a bit easier)
```{r}
MASS::rlm(n ~ wday * poly(yday(date), 5), data = daily) %>% vis_flights()
```
(Note the warning message that the model failed to converge - this typically suggests a model misspecification problem. We should be very cautious about interpreting this model. You'll explore the model more in the exercises.)
One problem with polynomials is that they have bad tail behaviour - outside of the range of the data they will rapidly shoot towards either positive or negative infinity. One solution to this is splines. Let's look at this with a simpler example:
```{r}
df <- data_frame(
x = rep(1:10, each = 3),
y = 30 - (x - 5) ^ 2 + rnorm(30, sd = 6)
)
ggplot(df, aes(x, y)) + geom_point()
```
Then we fit the models and predict both to the data.
```{r}
library(splines)
mod_poly <- lm(y ~ poly(x, 2), data = df)
mod_ns <- lm(y ~ ns(x, 2), data = df)
preds <- df %>%
expand(x = seq_range(x, 20)) %>%
add_predictions(poly = mod_poly, ns = mod_ns)
```
They look equally good in the range of the data:
```{r}
ggplot(df, aes(x, y)) +
geom_point() +
geom_line(data = preds, aes(y = poly, colour = "poly")) +
geom_line(data = preds, aes(y = ns, colour = "ns"))
```
But when we extend past the existing range of the data, we see:
```{r}
preds <- df %>%
expand(x = seq(-5, 15, length = 30)) %>%
add_predictions(poly = mod_poly, ns = mod_ns)
ggplot(df, aes(x, y)) +
geom_point() +
geom_line(data = preds, aes(y = poly, colour = "poly")) +
geom_line(data = preds, aes(y = ns, colour = "ns"))
```
The natural splines are designed to only linearly interpolate outside the range of the data. In general, this leads to somewhat safer behaviour, but you in either case you should be cautious about extrapolating. These are not generative models - they're just pragmatic fits to patterns. We shouldn't expect them to do a good job outside the data we've observed.
In general, splines are a useful tool when you have patterns that aren't linear, but you don't have a good explicit model for. Interestingly, using a natural spline instead of poly also fixes our convergence problem, while yielding much the same results:
```{r}
library(splines)
MASS::rlm(n ~ wday * ns(date, 5), data = daily) %>% vis_flights()
```
Particularly, we see the strongly pattern in Saturdays that we identified when coming in the opposite direction. It's always a good sign when you see the same signal from multiple approaches. (But note our previous model was explanatory - this is just predictatory.)
How many degrees of freedom to use? Either pick manually to extract the shape of the data, or you can use one of the model assessment techniques in the following chapter to pick algorithmically. Here we're most interested in explanation, so picking by hand (with a little though and plenty of scepticism) is typically fine.
Other useful arguments to `seq_range()`:
* `pretty = TRUE` will generate a "pretty" sequence, i.e. something that looks
nice to the human eye. This is useful if you want to produce tables of
output:
```{r}
seq_range(c(0.0123, 0.923423), n = 5)
seq_range(c(0.0123, 0.923423), n = 5, pretty = TRUE)
```
* `trim = 0.1` will trim off 10% of the tail values. This is useful if the
variables has an long tailed distribution and you want to focus on generating
values near the center:
```{r}
x <- rcauchy(100)
seq_range(x, n = 5)
seq_range(x, n = 5, trim = 0.10)
seq_range(x, n = 5, trim = 0.25)
seq_range(x, n = 5, trim = 0.50)
```
### Computed variables
If you're experimenting with many models and many visualisations, it's a good idea to bundle the creation of variables up into a function so there's no chance of accidentally applying a different transformation in different places.
```{r}
compute_vars <- function(data) {
data %>% mutate(
term = term(date),
wday = wday(date, label = TRUE)
)
}
```
Another option is to wrap it ito the model formula:
```{r}
wday2 <- function(x) wday(x, label = TRUE)
mod3 <- lm(n ~ wday2(date) * term(date), data = daily)
daily %>%
expand(date) %>%
add_predictions(pred = mod3)
```
I think this is fine to do provided that you've carefully checked that the functions do what you think they do (i.e. with a visualisation). There are two disadvantages:
1. You may need to add the variables back in anyway if you want to use
them in a visualsiation.
1. When looking at the coefficients, their values are longer and harder to
read. (But this is a general problem with the way that linear models report
categorical coefficients in R, not a specific problem with this case.)
### Nested variables
Another case that occassionally crops up is nested variables: you have an identifier that is locally unique, not globally unique. For example you might have this data about students in schools:
```{r}
students <- tibble::frame_data(
~student_id, ~school_id,
1, 1,
2, 1,
1, 2,
1, 3,
2, 3,
3, 3
)
```
The student id only makes sense in the context of the school: it doesn't make sense to generate every combination of student and school. You can use `nesting()` for this case:
```{r}
students %>% expand(nesting(school_id, student_id))
```
### Interpolation vs extrapolation
One danger with prediction plots is that it's easy to make predictions that are far away from the original data. This is dangerous because it's quite possible that the model (which is a simplification of reality) will no longer apply far away from observed values.
To help avoid this problem, it's good practice to include "nearby" observed data points in any prediction plot. These help you see if you're interpolating, making prediction "in between" existing data points, or extrapolating, making predictions about preivously unobserved slices of the data.
One way to do this is to use `condvis::visualweight()`.
### Exercises
1. In the use of `rlm` with `poly()`, the model didn't converge. Carefully
read the documentation for `rlm` and figure out which parameter controls
the number of iterations. Can you increase the iterations so that the
model converges? If so, how does the model differ from the model that
didn't converge?
## Case study: predicting flight delays
Finish off with a somewhat more realistic case study where we combine the techniques of visualising predictions and residuals to attack the problem of predicting flight delays.
Can't predict delays for next year. Why not? Instead we'll focus on predicting the amount that your flight will be delayed if it's leaving soon.
We'll start with some exploratory analysis, and then work on the model:
* time of day
* weather
```{r}
delays <- flights %>%
mutate(date = make_datetime(year, month, day)) %>%
group_by(date) %>%
summarise(delay = mean(arr_delay, na.rm = TRUE), cancelled = mean(is.na(dep_time)), n = n())
# delays %>%
# ggplot(aes(wday(date, label = TRUE), delay)) +
# geom_boxplot()
delays %>%
ggplot(aes(n, delay)) +
geom_point() +
geom_smooth(se = F)
```

View File

@ -1,3 +1,7 @@
# Reproducible Research with R Markdown
...
## What is R Markdown?
## Using R Markdown in the RStudio IDE
## Write text
## Embed code
## Set Parameters

View File

@ -815,7 +815,7 @@ There are a few other functions in base R that accept regular expressions:
stringr is built on top of the __stringi__ package. stringr is useful when you're learning because it exposes a minimal set of functions, that have been carefully picked to handle the most common string manipulation functions. stringi on the other hand is designed to be comprehensive. It contains almost every function you might ever need. stringi has `r length(ls(getNamespace("stringi")))` functions to stringr's `r length(ls("package:stringr"))`.
So if you find yourself struggling to do something that doesn't seem natural in stringr, it's worth taking a look at stringi. The use of the two packages is very similar because stringr was designed to mimic stringi's interface. The main difference is the prefix: `str_` vs `stri_`.
So if you find yourself struggling to do something that doesn't seem natural in stringr, it's worth taking a look at stringi. The use of the two packages is very similar because stringr was designed to mimic stringi's interface. The main difference is the prefix: `str_` vs. `stri_`.
### Encoding

View File

@ -681,7 +681,7 @@ ggplot(delays, aes(n, delay)) +
geom_point()
```
Not suprisingly, there is much more variation in the average delay when there are few flights. The shape of this plot is very characteristic: whenever you plot a mean (or many other summaries) vs number of observations, you'll see that the variation decreases as the sample size increases.
Not suprisingly, there is much more variation in the average delay when there are few flights. The shape of this plot is very characteristic: whenever you plot a mean (or many other summaries) vs. number of observations, you'll see that the variation decreases as the sample size increases.
When looking at this sort of plot, it's often useful to filter out the groups with the smallest numbers of observations, so you can see more of the pattern and less of the extreme variation in the smallest groups. This is what the following code does, and also shows you a handy pattern for integrating ggplot2 into dplyr flows. It's a bit painful that you have to switch from `%>%` to `+`, but once you get the hang of it, it's quite convenient.

View File

@ -1,41 +1,38 @@
# Exploratory Data Analysis (EDA)
```{r, include = FALSE}
```{r include = FALSE}
library(ggplot2)
library(dplyr)
library(broom)
knitr::opts_chunk$set(fig.height = 2)
```
If you are like most humans, your brain isn't built to process tables of raw data. You can understand your raw data better if you first visualize it or transform it. This chapter will show you the best ways to visualize and transform your data to make discoveries, a process known as Exploratory Data Analysis (EDA).
## The challenge of data
This chapter will show you how to use visualization and transformation to explore your data in a systematic way, a task that statisticians call Exploratory Data Analysis, or EDA for short. EDA involves iteratively
The human working memory can only attend to a few values at a time. This makes it difficult to discover patterns in raw data because patterns involve many values. To discover even a simple pattern, you must consider many values _at the same time_, which is difficult to do. For example, a simple pattern exists between $X$ and $Y$ in the table below, but it is very difficult to spot.
1. forming questions about your data
2. searching for answers by visualizing and transforming your data
3. using what you discover to refine your questions about the data, or to choose new questions to investigate
```{r data, echo=FALSE}
x <- rep(seq(0.2, 1.8, length = 5), 2) + runif(10, -0.15, 0.15)
X <- c(0.02, x, 1.94)
Y <- sqrt(1 - (X - 1)^2)
Y[1:6] <- -1 * Y[1:6]
Y <- Y - 1
order <- sample(1:10)
knitr::kable(round(data.frame(X = X[order], Y = Y[order]), 2))
```
There is no formal way to do Exploratory Data Analysis because you must be free to investigate every idea that occurs to you. However, some tactics will reliably lead to insights. This chapter will teach you a basic toolkit of these useful EDA techniques. Our discussion will lead to a model of data science itself, the model that I've built this book around.
While your mind may stumble over raw data, you can easily process visual information. Within your mind is a visual processing system that has been fine-tuned by thousands of years of evolution. As a result, the quickest way to understand your data is to visualize it. Once you plot your data, you can instantly see the relationships between values. Here, we see that the values above fall on a circle.
## Questions
```{r echo=FALSE}
ggplot2::qplot(X, Y) + ggplot2::coord_fixed(ylim = c(-2.5, 0.5), xlim = c(-0.5, 2.5))
```
> "There are no routine statistical questions, only questionable statistical routines."---Sir David Cox
Visualization works because it bypasses the bottle neck in your working memory. Your brain processes visual information in a different (and much wider) channel than it processes symbolic information, like words and numbers. However, visualization is not the only way to comprehend data.
> "Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise."---John Tukey
You can also comprehend your data if you reduce it to a small set of summary values. Your working memory can easily attend to just a few values, which lets you absorb important information about the data. This is why it feels natural to work with things like averages, e.g. how tall is the average basketball player? How educated is the average politician? An average is a single number that you can attend to. Although averages are quite popular, you can also compare data sets on other summary values, such as maximums, minimums, medians, and so on. Another way to summarize your data is to replace it with a model, a function that describes the relationship between two or more variables.
Your goal during EDA is to develop a complete understanding of your data set and the information that it contains. The easiest way to do this is to ask questions as tools to guide your investigation. When you ask a question, the question focuses your attention on a specific part of your data set and helps you decide which graphs or models to make.
These two tactics, visualizing and summarizing your data, are the main tools of Exploratory Data Analysis. Before we begin to use the tools, let's consider what types of information you can hope to find in your data.
During EDA, the _quantity_ of questions that you ask matters more than the quality of the questions. It is difficult to ask revealing questions at the start of your analysis because you do not know what insights are contained in your data set. On the other hand, each new question that you ask will expose you to a new aspect of your data and increase your chance of making a discovery. You can quickly drill down into the most interesting parts of your data---and develop a set of thought provoking questions---if you follow up each question with a new question based on what you find.
## Variation
There is no rule about which questions you should ask to guide your research. However, two types of questions will always be useful for making discoveries within your data. You can loosely word these questions as
Data carries two types of useful information: information about _variation_ and information about _covariation_. These concepts will be easier to describe if we first define some basic terms:
1. What type of **variation** occurs **within** my variables? and
2. What type of **covariation** occurs **between** my variables?
The rest of this chapter will look at these two questions. I'll explain what variation and covariation are, and I'll show you several ways to answer each question. To make the discussion easier, let's define some terms:
* A _variable_ is a quantity, quality, or property that you can measure.
@ -43,9 +40,13 @@ Data carries two types of useful information: information about _variation_ and
* An _observation_ is a set of measurements that you make under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object). An observation will contain several values, each associated with a different variable. I'll sometimes refer to an observation as a data point.
_Variation_ is the tendency of the values of a variable to change from measurement to measurement.
* _tabular data_ is a set of values, each associated with a variable and an observation. Tabular data is _tidy_ if each value is placed in its own "cell", each variable in its own column, and each observation in its own row. Throughout the rest of this chapter I will use the word data to mean tidy tabular data.
Variation is easy to encounter in real life; if you measure any continuous variable twice---and precisely enough, you will get two different results. This is true even if you measure quantities that should be constant, like the speed of light (below); each of your measurements will include a small amount of error that varies from measurement to measurement.
## Variation
> "What type of variation occurs within my variables?"
**Variation** is the tendency of the values of a variable to change from measurement to measurement. You can see variation easily in real life; if you measure any continuous variable twice---and precisely enough, you will get two different results. This is true even if you measure quantities that should be constant, like the speed of light (below). Each of your measurements will include a small amount of error that varies from measurement to measurement.
```{r, variation, echo = FALSE}
@ -54,398 +55,539 @@ mat <- as.data.frame(matrix(morley$Speed + 299000, ncol = 10))
knitr::kable(mat, caption = "*The speed of light is a universal constant, but variation due to measurement error obscures its value. In 1879, Albert Michelson measured the speed of light 100 times and observed 30 different values (in km/sec).*", col.names = rep("", ncol(mat)))
```
Discrete and quantitative variables can also vary if you measure across different subjects (e.g. the eye colors of different people), or different times (e.g. the energy levels of an electron).
Discrete and categorical variables can also vary if you measure across different subjects (e.g. the eye colors of different people), or different times (e.g. the energy levels of an electron at different moments).
Variation is a source of uncertainy in data science. Since values vary from measurement to measurement, you cannot assume that what you measure in one context will be true in another context. However, variation can also be a tool. Every variable exhibits a pattern of variation. If you comprehend the pattern, you can determine which values of the variable are likely to occur, which are unlikely to occur, and which are impossible. You can also use the pattern to quickly spot outliers, data points that behave differently from other observations of a variable, and clusters, groups of data points that share similar values.
## Covariation
The second type of information contained in data is information about covariation. _Covariation_ occurs when two or more variables vary together in a systematic way.
You can understand covariation if you picture the growth charts that doctors use with young children (below). The ages and heights of children covary since a child is likely to be born small and to grow taller over time. As a result, you can expect a large value of age to be accompanied by a large value of height and a small value of age to be accompanied by a small value of height. In fact, the covariation between age and height is so regular that a doctor can tell if something has gone wrong by comparing a child's height to his or her age.
Systems of covariation can be very complex. Multiple variables can covary together, as do age, height, weight, bone density, etc., and two variables can covary in an inverse relationship, as do unemployment and presidential approval ratings (presidential approval ratings are reliably low at times when unemployment is high, and vice versa). Covariation can also occur between categorical variables. In that case observations in one category will be linked to certain values more often than observations in other categories.
Covariation provides a way to reduce the uncertainty created by variation. You can make an accurate guess about the value of an unobserved variable if you observe the values of variables that it covaries with. The covariation that occurs between variables will also occur between the values of those variables whenever the values belong in the same observation.
Covariation also provides clues about causal relationships. If one variable causes the value of a second, the two variables will covary. The inverse does not logically follow; i.e. if two variables covary, one does not necessarily cause the other. However, you can use patterns of covariation to direct your attention as you search for causal relationships.
Now that you have a sense of what to look for in data, how do you find it?
## Exploratory Data Analysis
Exploratory Data Analysis is a loosely defined task that deepens your understanding of a data set. EDA involves iteratively
* forming questions about your data
* searching for answers by visualizing and summarizing your data
* refining your questions about the data, or choosing new questions to investigate, based on what you discover
There is no formal way to do Exploratory Data Analysis. You must be free to investigate every insight that occurs to you. The remainder of the chapter will teach you ways to visualize and summarise your data that you can use in any part of your exploration. I've organized these methods into two groups: methods for
1. Understanding variation. These methods elucidate the question, "What type of uncertainty exists in the processes that my data describe?"
2. Understanding covariation. These methods elucidate the question, "How can the data help me reduce the uncertainty that exists in the processes that my data describe?"
As you use these methods, it is important to keep an eye out for any signals that can lead to better questions about your data, and then scrutinize them. Things like outliers, missing values, gaps in your data coverage, and patterns can all tip you of to important aspects of your data set. Often the discoveries that you make during Exploratory Data Analysis will be the most valuable results of your data analysis. Many useful scientific discoveries, like the discovery of the hole in the ozone layer, were made by simply exploring data.
## Understanding variation - Distributions
The most useful tool for understanding the variation associated with a variable is the variable's _empirical distribution_, the pattern of values that emerges as you take many measurements of the variable.
Every variable has its own pattern of variation, which can reveal interesting information. The best way to understand that pattern is to visualize the distribution of the values that you have observed for the variable.
### Visualizing distributions
If a variable is categorical, you can display its empirical distribution with a simple bar graph. **Categorical** variables are variables that can only contain a finite (or countably infinite) set of unique values, like the cut rating of a diamond. In R, categorical variables are usually saved as factors or character strings.
How you visualize the distribution of a variable will depend on whether the variable is categorical or continuous. A variable is **categorical** if it can only have a finite (or countably infinite) set of unique values. In R, categorical variables are usually saved as factors, integers, or character strings. To examine the distribution of a categorical variable, use a bar chart.
```{r}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
```
By default, `geom_bar()` counts the number of observations that are associated with each value of a variable, and it displays the results as a series of bars. You do not need to provide a $y$ aesthetic to `geom_bar()`.
The height of each bar in a bar graph reveals the number of observations that are associated with the $x$ value of the bar. You can use the heights to estimate the frequency that different values will appear as you measure the variable. $x$ values that have a tall bar occur often, and $x$ values with small bars occur rarely.
***
*Tip*: You can compute the counts of a categorical variable quickly with R's `table()` function. These are the numbers that `geom_bar()` visualizes.
The height of the bars displays how many observations occurred at each x value. You can compute these values manually with `table()`.
```{r}
table(diamonds$cut)
```
***
To compare the distributions of different subgroups in a bar chart,
1. set the fill and grouping aesthetics to a grouping variable
2. set the position adjustment to "dodge" for side by side bars
3. set $y$ to `..prop..` to compare proportions instead of raw counts (as raw counts will depend on group sizes which may differ)
```{r}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = ..prop.., group = carat > 1, fill = carat > 1), position = "dodge")
```
What if your variable is not categorical but continuous? A variable is **continuous** if you can arrange its values in order _and_ an infinite number of unique values can exist between any two values of the variable. Numbers and date-times are two examples of continuous variables.
The strategy of counting the number of observations at each value breaks down for continuous data, because if your data is truly continuous, then no two observations will have the same value.
To get around this, you can divide the range of a continuous variable into equally spaced intervals, a process called _binning_.
```{r, echo = FALSE}
# knitr::include_graphics("images/visualization-17.png")
```
Then count how many observations fall into each bin.
```{r, echo = FALSE}
# knitr::include_graphics("images/visualization-18.png")
```
And display the count as a bar.
```{r, echo = FALSE}
# knitr::include_graphics("images/visualization-19.png")
```
The result is called a *histogram*. The height of each bar reveals how many observations fall within the width of the bar. Tall bars reveal common values of the variable, short bars reveal uncommon values, and the absence of bars suggests rare or impossible values. As with bar charts, you can use this information to estimate the probabilities that different values will appear in future measurements of the variable.
To make a histogram of a continuous variable, like the carat size of a diamond, use `geom_histogram()`. As with `geom_bar()`, you do not need to supply a $y$ variable.
A variable is **continuous** if you can arrange its values in order _and_ an infinite number of unique values can exist between any two values of the variable. Numbers and date-times are two examples of continuous variables. To examine the distribution of a continuous variable, use a histogram.
```{r message = FALSE}
ggplot(data = diamonds) +
geom_histogram(aes(x = carat))
geom_histogram(aes(x = carat), binwidth = 0.5)
```
Binning is a temperamental process because the appearance of a distribution can change dramatically if the bin size changes. As no bin size is "correct," you should explore several bin sizes when examining data.
A histogram divides the x axis into equally spaced intervals and then uses a bar to display how many observations fall into each interval. In the graph above, the tallest bar shows that almost 30,000 observations have a $carat$ value between 0.25 and 0.75, which are the left and right edges of the bar.
By default, `geom_histogram()` will divide the range of your variable into 30 equal length bins. The quickest way to change this behavior is to set the binwidth argument.
```{r message = FALSE}
ggplot(data = diamonds) +
geom_histogram(aes(x = carat), binwidth = 1)
```
Different binwidths reveal different information. For example, the plot above shows that the availability of diamonds decreases quickly as carat size increases. The plot below shows that there are more diamonds than you would expect at whole carat sizes (and common fractions of carat sizes). Moreover, for each popular size, there are more diamonds that are slightly larger than the size than there are diamonds that are slightly smaller than the size.
You can set the width of the intervals in a histogram with the `binwidth` argument, which is measured in the units of the $x$ axis. You should always explore a variety of binwidths when working with histograms, as different binwidths can reveal different patterns. For example, here is how the graph above looks with a binwidth of 0.01.
```{r message = FALSE}
ggplot(data = diamonds) +
geom_histogram(aes(x = carat), binwidth = 0.01)
```
Histograms give you a quick sense of the variation in your variable. Often you can immediately tell what the typical value of a variable is and what range of values you can expect (the wider the range, the more uncertainty you will encounter when making predictions about the variable). However, histograms have a downside.
If you wish to overlay multiple histograms in the same plot, I recommend using `geom_freqpoly()` or `geom_density2d()` instead of `geom_histogram()`. `geom_freqpoly()` makes a frequency polygon, a line that connects the tops of the bars that would appear in a histogram. Like `geom_histogram()`, `geom_freqpoly()` accepts a binwidth argument.
It is difficult to compare multiple histograms. The solid bars of a histogram will occlude other histograms when you arrenage them in layers. You could stack histograms on top of each other, but this invites error because you cannot compare the bars against a common baseline.
`geom_density()` plots a one dimensional kernel density estimate of a variable's distribution. The result is a smooth version of the information contained in a histogram or a frequency polygon. You can control the smoothness of the density with `adjust`. `geom_density()` displays _density_---not count---on the y axis; the area under each curve will be normalized to one, no matter how many total observations occur in the subgroup, which makes it easier to compare subgroups.
```{r message = FALSE, fig.show='hold', fig.width=3}
zoom <- coord_cartesian(xlim = c(55, 70))
```{r message = FALSE}
ggplot(data = diamonds) +
geom_histogram(aes(x = price, fill = cut))
geom_freqpoly(aes(x = depth, color = cut), binwidth = 0.2) +
zoom
ggplot(data = diamonds) +
geom_density(aes(x = depth, color = cut), adjust = 3) +
zoom
```
`geom_freqpoly()` and `geom_density()` provide better ways to compare multiple distributions in the same plot. You can think of `geom_freqpoly()` as a line that connects the tops of the bars that would appear in a histogram.
### Asking questions about variation
```{r message = FALSE, fig.show='hold', fig.width=4, fig.height=4}
ggplot(data = diamonds) +
geom_freqpoly(aes(x = carat))
Now that you can visualize variation, what should you look for in your plots? and what type of follow up questions should you ask? I've put together a list below of the most useful types of information that you will find in your graphs, along with some follow up questions for each type of information. The key to asking good follow up questions will be to rely on your **curiosity** (What do you want to learn more about?) as well as your **skepticism** (How could this be misleading?).
ggplot(data = diamonds) +
geom_histogram(aes(x = carat))
```
`geom_density()` plots a one dimensional kernel density estimate of a variable's distribution. The result is a smooth version of the information contained in a histogram or a freqpoly.
```{r}
ggplot(data = diamonds) +
geom_density(aes(x = carat))
```
`geom_density()` displays $density$---not $count$---on the y axis, which makes it easy to compare the shape of the distributions of multiple subgroups; the area under each curve will be normalized to one, no matter how many total observations occur in the subgroup. To achieve the same effect with `geom_freqpoly()`, set the y variable to `..density..`.
```{r message = FALSE, fig.show='hold', fig.width=4, fig.height=4}
ggplot(data = diamonds) +
geom_freqpoly(aes(x = price, y = ..density.., color = cut ))
ggplot(data = diamonds) +
geom_density(aes(x = price, color = cut))
```
`geom_density()` does not use the binwidth argument. You can control the smoothness of the density with `adjust`, and you can select the kernel to use to estimate the density with `kernel`. Set kernel to one of "gaussian" (default), "epanechikov", "rectangular", "triangular", "biweight", "cosine", "optcosine".
```{r}
ggplot(data = diamonds) +
geom_density(aes(x = carat, color = cut), kernel = "gaussian", adjust = 4)
```
### Summarizing distributions
You can also make sense of a distribution by reducing it to a few summary statistics, numbers that summarize important information about the distribution. Summary statistics have an advantage over plots and raw data; it is easy to talk and write about summary statistics.
Two types of summary statistics are more useful than the rest:
* statistics that describe the typical value of a variable. These include the measures of location from Chapter 2:
+ `mean()` - the average value of a variable
+ `median()` - the 50th percentile value of a variable
* statistics that describe the range of a variable's values. These include the measures of spread from chapter 2.
+ `sd()` - the standard deviation of a variable's distribution, which is the average distance between any two values selected at random from the distribution.
+ `var()` - the variance of the distribution, which is the standard deviation squared.
+ `IQR()` - the interquartile range of the distribution, which is the distance between the 25th and 75th percentiles of a distribution.
+ `range()` - the minimum and maximum value of a distribution, use with `diff(range())` to compute the distance between the miminum and maximum values.
* *Typical values*
In both bar charts and histograms, tall bars reveal common values of a variable. Shorter bars reveal less common or rare values. Places that do not have bars reveal seemingly impossible values. To turn this information into a useful question, look for anything unexpected:
Use dplyr's summarise function to calculate any of these statistics for a variable.
+ Which values are the most common? Why might that be?
+ Which values are the most rare? Why might that be?
+ Is there an unusual pattern in the distribution? Why might that be?
+ Do the typical values change if you look at subgroups of the data?
As an example, the histogram below suggests several interesting questions: Why are there more diamonds at whole carats and common fractions of carats? Why are there more diamonds slightly to the right of each peak than there are slightly to the left of each peak?
```{r echo = FALSE, message = FALSE, warning = FALSE, fig.height = 2}
ggplot(data = diamonds) +
geom_histogram(aes(x = carat), binwidth = 0.01) + xlim(0, 3)
```
* *Range of values*
```{r}
diamonds %>%
summarise(mean = mean(price, na.rm = TRUE),
sd = sd(price, na.rm = TRUE))
```
The range of values, or spread, of the distribution reveals how certain you can be when you make predictions about a variable. If the variable only takes a narrow set of values, like below, you are unlikely to be far off if you make a prediction about a future observation. Even if the observation takes a value at the distant extreme of the distribution, the value will not be far from your guess.
```{r echo = FALSE, message = FALSE, fig.height = 2}
mpg$hwy2 <- mpg$hwy / 10 + 22
ggplot(mpg) + geom_histogram(aes(x = hwy2), binwidth = 1) + xlim(10, 45)
```
If the variable takes on a wide set of values, like below, the possibility that your guess will be far off the mark is much greater. The extreme possibilities are farther away.
```{r echo = FALSE, message = FALSE, fig.height = 2}
ggplot(mpg) + geom_histogram(aes(x = hwy), binwidth = 1) + xlim(10, 45)
```
As a quick rule, wide distributions imply less certainty when making predictions about a variable; narrow distributions imply more certainty. A distribution with only a single repeated value implies complete certainty: your variable is a constant. Ask yourself
+ Do your data show a surprising amount of certainty or uncertainty? Why?
+ Does the range of the distribution change if you look at individual subgroups of the data?
* *Outliers*
Combine `summarise()` with `group_by()` to calculate the statistics on a groupwise basis.
Outliers are data points that do not seem to fit the overall pattern of variation, like the diamond on the far right of the histogram below. This diamond has a y dimension of `r diamonds$y[which(diamonds$y > 50)]` mm, which is much larger than the other diamonds.
```{r echo = FALSE, message = FALSE, fig.height = 2}
ggplot(diamonds[24000:24500, ]) + geom_histogram(aes(x = y), binwidth = 0.25)
```
An outlier is a signal that something unique happened to the observation. Whenever you spot an outlier, ask yourself
+ What can explain the unusual value?
If you can figure out what happened, a discovery might follow. In the case above, the unique event was a measurement error.
```{r}
diamonds %>%
group_by(cut) %>%
summarise(mean = mean(price, na.rm = TRUE),
sd = sd(price, na.rm = TRUE))
```
* *Clusters*
## Understanding Covariation
Clusters of similar values suggest that subgroups exist in your data. To understand the subgroups, ask:
+ How are the observations within each cluster similar to each other?
+ How are the observations in separate clusters different from each other?
+ How can you explain or describe the clusters?
+ Why might the appearance of clusters be misleading?
The histogram below displays two distinct clusters. It shows the length in minutes of 272 eruptions of the Old Faithful Geyser in Yellowstone National Park; Old Faithful appears to oscillate between short and long eruptions.
### Visualizing covariation
```{r echo = FALSE, message = FALSE, fig.height = 2}
ggplot(faithful) + geom_histogram(aes(x = eruptions))
```
### Visualize Covariation
Many of the questions above will prompt you to explore a relationship *between* variables, for example, to see if the values of one variable can explain the behavior of another variable. Questions about relationships are examples of the second general question that I proposed for EDA. Let's look at that question now.
### Compare Distributions
## Covariation
#### Visualize functions between two variables
> "What type of covariation occurs between my variables?"
Distributions provide useful information about variables, but the information is general. By itself, a distribution cannot tell you how the value of a variable in one set of circumstances will differ from the value of the same variable in a different set of circumstances.
If variation describes the behavior _within_ a variable, covariation describes the behavior _between_ variables. **Covariation** is the tendency for the values of two or more variables to vary together in a correlated way. The best way to spot covariation is to visualize the relationship between two or more variables. How you do that should again depend on the type of variables involved.
_Covariation_ can provide more specific information. Covariation is a relationship between the values of two or more variables.
### Visualizing two categorical variables
To see how covariation works, consider two variables: the $volume$ of an object and its $temperature$. If the $volume$ of the object usually increases when the $temperature$ of the object increases, then you could use the value of $temperature$ to help predict the value of $volume$.
Visualize covariation between categorical variables with `geom_count()`.
You've probably heard that "correlation (covariation) does not prove causation." This is true, two variables can covary without one causing the other. However, covariation is often the first clue that two variables have a causal relationship.
Visualization is one of the best ways to spot covariation. How you look for covariation will depend on the structural relationship between two variables. The simplest structure occurs when two continuous variables have a functional relationship, where each value of one variable corresponds to a single value of the second variable.
In this scenario, covariation will appear as a pattern in the relationship. If two variables do not covary, their functional relationship will look like a random walk.
The variables `date` and `unemploy` in the `economics` data set have a functional relationship. The `economics` data set comes with `ggplot2` and contains various economic indicators for the United States between 1967 and 2007. The `unemploy` variable measures the number of unemployed individuals in the United States in thousands.
A scatterplot of the data reveals the functional relationship between `date` and `unemploy`.
```{r}
ggplot(data = economics) +
geom_point(aes(x = date, y = unemploy))
```
`geom_line()` makes the relationship clear. `geom_line()` creates a line chart, one of the most used---and most efficient---devices for visualizing a function.
```{r}
ggplot(data = economics) +
geom_line(aes(x = date, y = unemploy))
```
Use `geom_step()` to turn a line chart into a step function. Here, the result will be easier to see with a subset of data.
```{r}
ggplot(data = economics[1:150, ]) +
geom_step(aes(x = date, y = unemploy))
```
Control the step direction by giving `geom_step()` a direction argument. `direction = "hv"` will make stairs that move horizontally then vertically to connect points. `direction = "vh"` will do the opposite.
`geom_area()` creates a line chart with a filled area under the line.
```{r}
ggplot(data = economics) +
geom_area(aes(x = date, y = unemploy))
```
##### Visualize correlations between two variables
Many variables do not have a functional relationship. As a result, a single value of one variable can correspond to multiple values of another variable.
Height and weight are two variables that are often related, but do not have a functional relationship. You could examine a classroom of students and notice that three different students, with three different weights all have the same height, 5'4". In this case, there is not a one to one relationship between height and weight.
The easiest way to plot the relationship between two variables is with a scatterplot, i.e. `geom_point()`. If the variables covary, a pattern will appear in the points. If they do not, the points will look like a random cloud of points.
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
```
The jitter adjustment is so useful for scatterplots that `ggplot2` provides the `geom_jitter()`, which is identical to `geom_point()` but comes with `position = "jitter"` by default.
```{r}
ggplot(data = mpg) +
geom_jitter(mapping = aes(x = displ, y = hwy))
```
`geom_count()` can be a useful way to visualize the distribution between two discrete variables.
```{r}
ggplot(data = diamonds) +
geom_count(mapping = aes(x = cut, y = color))
```
Use `geom_rug()` to visualize the distribution of each variable in the scatterplot. `geom_rug()` adds a tickmark along each axis for each value observed in the data. `geom_rug()` works best as a second layer in the plot (see Section 3 for more info on layers).
The size of each circle in the plot displays how many observations occurred at each combination of values. Covariation will appear as a strong correlation between specifc x values and specific y values. As with bar charts, you can calculate the specific values with `table()`.
```{r}
table(diamonds$color, diamonds$cut)
```
### Visualizing one categorical variable and one continuous variable
Visualize covariation between continuous and categorical variables with boxplots. A **boxplot** is a type of visual shorthand for a distribution that is popular among statisticians. Each boxplot consists of:
* A box that stretches from the 25th percentile of the distribution to the 75th percentile, a distance known as the Inter-Quartile Range (IQR). In the middle of the box is a line that displays the median, i.e. 50th percentile, of the distribution. These three lines give you a sense of the spread of the distribution and whether or not it is symmetric about the median or skewed to one side.
* Points that display observations that fall more than 1.5 times the IQR from either edge of the box. These outlying points have a strong chance of being outliers, so they are included in the boxplot for inspection.
* A line (or whisker) that extends from each end of the box and goes to the farthest non-outlier point in the distribution.
```{r, echo = FALSE}
knitr::include_graphics("images/EDA-boxplot.pdf")
```
The chart below shows several boxplots, one for each level of the class variable in the mpg data set. Each boxplot represents the distribution of hwy values for points with the given level of class. To make boxplots, use `geom_boxplot()`.
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
geom_rug(mapping = aes(x = displ, y = hwy), position = "jitter")
geom_boxplot(aes(x = class, y = hwy))
```
Use the `sides` argument to control which axes to place a "rug" on.
Covariation will appear as a systematic change in the medians or IQR's of the boxplots. To make the trend easier to see, wrap the $x$ variable with `reorder()`. The code below reorders the x axis based on the median hwy value of each group.
* `sides = "bl"` - (default) Places a rug on each axis
* `sides = "b"` - Places a rug on the bottom axis
* `sides = "l"` - Places a rug on the left axis
```{r}
ggplot(data = mpg) +
geom_boxplot(aes(x = reorder(class, hwy, FUN = median), y = hwy))
```
Scatterplots do not work well with large data sets because individual points will begin to occlude each other. As a result, you cannot tell where the mass of the data lies. Does a black region contain a single layer of points? Or hundreds of points stacked on top of each other?
`geom_boxplot()` works best when the categorical variable is on the x axis. You can invert the axes with `coord_flip()`.
You can see this type of plotting in the `diamonds` data set. The data set only contains 53,940 points, but the points overplot each other in a way that we cannot fix with jittering.
```{r}
ggplot(data = mpg) +
geom_boxplot(aes(x = reorder(class, hwy, FUN = median), y = hwy)) +
coord_flip()
```
If you wish to add more information to your boxplots, use `geom_violin()`. In a violin plot, the width of the "box" displays a kernel density estimate of the shape of the distribution.
```{r}
ggplot(data = mpg) +
geom_violin(aes(x = reorder(class, hwy, FUN = median), y = hwy)) +
coord_flip()
```
### Vizualizing two continuous variables
Visualize covariation between two continuous variables with a scatterplot, i.e. `geom_point()`. Covariation will appear as a structure or pattern in the data points. For example, an exponential relationship seems to exist between the carat size and price of a diamond.
```{r}
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price))
geom_point(aes(x = carat, y = price))
```
For large data, it is more useful to plot summary information that describes the raw data than it is to plot the raw data itself. Several geoms can help you do this.
Scatterplots become less useful as the size of your data set grows, because points begin to pile up into areas of uniform black (as above). You can make patterns clear again with `geom_bin2d()`, `geom_hex()`, or `geom_density2d()`.
The simplest way to summarise covariance between two variables is with a model line. The model line displays the trend of the relationship between the variables.
`geom_bin2d()` and `geom_hex()` divide the coordinate plane into two dimensional bins and then use a fill color to display how many points fall into each bin. `geom_bin2d()` creates rectangular bins. `geom_hex()` creates hexagonal bins. You will need to install the hexbin package to use `geom_hex()`.
```{r fig.show='hold', fig.width=3}
ggplot(data = diamonds) +
geom_bin2d(aes(x = carat, y = price))
# install.packages("hexbin")
ggplot(data = diamonds) +
geom_hex(aes(x = carat, y = price))
```
`geom_density2d()` fits a 2D kernel density estimation to the data and then uses contour lines to highlight areas of high density. It is very useful for overlaying on raw data when your data set is not big.
Use `geom_smooth()` to display a model line between any two variables. As with `geom_rug()`, `geom_smooth()` works well as a second layer for a plot (See Section 3 for details).
```{r}
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price)) +
geom_smooth(mapping = aes(x = carat, y = price))
ggplot(data = faithful, aes(x = eruptions, y = waiting)) +
geom_point() +
geom_density2d()
```
`geom_smooth()` will add a loess line to your data if the data contains less than 1000 points, otherwise it will fit a general additive model to your data with a cubic regression spline, and plot the resulting model line. In either case, `geom_smooth()` will display a message in the console to tell you what it is doing. This is not a warning message; you do not need to worry when you see it.
### Asking questions about covariation
`geom_smooth()` will also plot a standard error band around the model line. You can remove the standard error band by setting the `se` argument of `geom_smooth()` to `FALSE`.
When you explore plots of covariation, look for the following sources of insight:
Use the `method` argument of `geom_smooth()` to add a specific type of model line to your data. `method` takes the name of an R modeling function. `geom_smooth()` will use the function to calculate the model line. For example, the code below uses R's `lm()` function to fit a linear model line to the data.
* *Outliers*
Two dimensional plots can reveal outliers that are not visible in one dimensional plots. For example, some points in the plot below have an unusual combination of $x$ and $y$ values, which makes the points outliers even though their $x$ and $y$ values appear normal when examined separately.
```{r echo = FALSE}
ggplot(data = diamonds) +
geom_point(aes(x = x, y = y)) +
coord_cartesian(xlim = c(3, 12), ylim = c(3, 12))
```
* *Clusters*
Plots of covariation can also reveal clusters that may not be visible in plots of variation. For example, the two dimensional pattern in the plot below reveals two clusters, a separation that is not visible in the distribution of either variable by itself, as verified with a rug geom.
```{r echo = FALSE}
ggplot(data = iris, aes(y = Sepal.Length, x = Sepal.Width)) +
geom_jitter() +
geom_density2d(h= c(1,1)) +
geom_rug(position = "jitter")
```
* *Patterns*
Patterns in your data provide clues about relationships. If a systematic relationship exists between two variables it will appear as a pattern in the data. If you spot a pattern, ask yourself:
+ Could this pattern be due to coincidence (i.e. random chance)?
+ How can you describe the relationship implied by the pattern?
+ How strong is the relationship implied by the pattern?
+ What other variables might affect the relationship?
+ Does the relationship change if you look at individual subgroups of the data?
A scatterplot of Old Faithful eruption lengths versus the wait time between eruptions shows a pattern: longer wait times are associated with longer eruptions. The scatterplot also reveals the two clusters that we noticed above.
```{r echo = FALSE, message = FALSE, fig.height = 2}
ggplot(faithful) + geom_point(aes(x = eruptions, y = waiting))
```
Patterns provide one of the most useful tools for data scientists because they reveal covariation. If you think of variation as a phenomenon that creates uncertainty, covariation is a phenomenon that reduces it. If two variables covary, you can use the values of one variable to make better predictions about the values of the second. If the covariation is due to a causal relationship (a special case), then you can use the value of one variable to control the value of the second.
### Visualizing three or more variables
In general, outliers, clusters, and patterns become easier to spot as you look at the interaction of more and more variables. However, as you include more variables in your plot, data becomes harder to visualize.
You can extend scatterplots into three dimensions with the plotly, rgl, rglwidget, and threejs packages (among others). Each creates a "three dimensional," graph that you can rotate with your mouse. Below is an example from plotly, displayed as a static image.
```{r eval = FALSE}
library(plotly)
plot_ly(data = iris, x = Sepal.Length, y = Sepal.Width, z = Petal.Width, color = Species, type = "scatter3d", mode = "markers")
```
```{r, echo = FALSE}
knitr::include_graphics("images/EDA-plotly.png")
```
You can extend this approach into n-dimensional hyperspace with the ggobi package, but you will soon notice a weakness of multidimensional graphs. You can only visualize multidimensional space by projecting it onto your two dimensional retinas. In the case of 3D graphics, you can combine 2D projections with rotation to create an intuitive illusion of 3D space, but the illusion ceases to be intuitive as soon as you add a fourth dimension.
This doesn't mean that you should ignore complex interactions in your data. You can explore multivariate relationships in several ways. You can
* visualize each combination of variables in a multivariate relationship, two at a time
* use aesthetics and facetting to add additional variables to a 2D plot
* use a clustering algorithm to spot clusters in multivariate space
* use a modeling algorithm to spot patterns and outliers in multivariate space
## Clusters
A clustering algorithm computes the distances between data points in n-dimensional space. It then uses an algorithm to group points into clusters based on how near or far they are from each other. Base R provides two easy to use clustering algrotihms: heirarchical clustering and k means clustering.
### Heirarchical clustering
The heirarchical clustering algorithm groups points together based on how near they are to each other in n-dimensional space. The algorithm proceeds in stages until every point has been grouped into a single cluster, the data set. You can visualize the results of the algorithm as a dendrogram, and you can use the dendrogram to divide your data into any number of clusters.
```{r, echo = FALSE}
knitr::include_graphics("images/EDA-plotly.png")
```
You can only apply heirarchical clustering to numeric data, so begin by selecting the numeric columns from your data set. Then apply the `dist()` function to the data and pass the results to `hcust()`. `dist()` computes the distances between your points in the n dimensional space defined by your numeric vectors. `hclust()` performs the clustering algorithm.
```{r}
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price)) +
geom_smooth(mapping = aes(x = carat, y = price), method = lm)
iris_hclust <- iris %>%
select(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) %>%
dist() %>%
hclust(method = "complete")
```
By default, `geom_smooth()` will use the formula `y ~ x` to model your data. You can modify this formula by setting the `formula` argument to a different formula. If you do this, be sure to refer to the variable on your $x$ axis as `x` and the variable on your $y$ axis as `y`, e.g.
Use `plot()` to visualize the results as a dendrogram. Each observation in the data set will appear at the bottom of the dendrogram labeled by its rowname. You can use the labels argument to set the labels to something more informative.
```{r fig.height = 4}
plot(iris_hclust, labels = iris$Species)
```
To see how near two data points are to each other, trace the paths of the data points up through the tree until they intersect. The y value of the intersection displays how far apart the points are in n-dimensional space. Points that are close to each other will intersect at a small y value, points that are far from each other will intersect at a large y value. Groups of points that are near each other will look like "leaves" that all grow on the same "branch."
The ordering of the x axis in the dendrogram is somewhat arbitrary (think of the tree as a mobile, each horizontal branch can spin around meaninglessly).
Use the `identify()` function to easily see easily which group of points are downstream from a branch. `identify()` will plot the dendrogram in an interactive format. When you click on a branch, R will draw a red rectangle around the downstream points. Clikc escape when you are finished.
```{r eval = FALSE}
identify(iris_hclust)
```
You can split your data into any number of clusters by drawing a horizontal line across the tree. Each vertical branch that the line crosses will represent a cluster that contains all of the points downstream from the branch. Move the line up the y axis to intersect fewer branches (and create fewer clusters), move the line down the y axis to intersect more branches and (create more clusters).
`cutree()` provides a useful way to split data points into clusters. Give cutree the output of `hclust()` as well as the number of clusters that you want to split the data into. `cutree()` will return a vector of cluster labels for your data set. To visualize the results, map the output of `cutree()` to an aesthetic.
```{r}
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price)) +
geom_smooth(mapping = aes(x = carat, y = price),
method = lm, formula = y ~ poly(x, 4))
(clusters <- cutree(iris_hclust, 3))
ggplot(iris, aes(x = Sepal.Width, y = Sepal.Length)) +
geom_point(aes(color = factor(clusters), shape = Species))
```
Be careful, `geom_smooth()` will overlay a trend line on every data set, even if the underlying data is uncorrelated. You can avoid being fooled by also inspecting the raw data or calculating the correlation between your variables, e.g. `cor(diamonds$carat, diamonds$price)`.
You can modify the heirarchical clustering algorithm by setting the method argument of hclust to one of "complete", "single", "average", or "centroid". The method determines how to measure the distance between two clusters or a lone point and a cluster, a measurement that effects the outcome of the algorithm.
`geom_quantile()` fits a different type of model to your data. Use it to display the results of a quantile regression (see `?rq` for details). Like `geom_smooth()`, `geom_quantile()` takes a formula argument that describes the relationship between $x$ and $y$.
* *complete* - Measures the greatest distance between any two points. Tends to create distinct clusters and subclusters.
```{r message = FALSE}
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price)) +
geom_quantile(mapping = aes(x = carat, y = price),
quantiles = c(0.05, 0.5, 0.95),
formula = y ~ poly(x, 2))
* *single* - Measures the smallest distance between any two points in the clusters. Tends to add points one at a time to existing clusters, creating ambiguously defined clusters.
* *average* - Measures the average distance between all combinations of points in the separate clusters. Tends to add points one at a time to existing clusters.
* *centroid* - Measures the distance between the average location of the points in each cluster.
```{r fig.height = 4}
iris %>%
select(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) %>%
dist() %>%
hclust(method = "single") %>%
plot(labels = iris$Species)
```
`geom_smooth()` and `geom_quantile()` summarise the relationship between two variables as a function, but you can also summarise the relationship as a bivariate distribution.
`geom_bin2d()` divides the coordinate plane into a two dimensional grid and then displays the number of observations that fall into each bin in the grid. This technique let's you see where the mass of the data lies; bins with a light fill color contain more data than bins with a dark fill color. Bins with no fill contain no data at all.
### K means clustering
K means clustering provides a simulation based alternative to heirarchical clustering. It identifies the "best" way to group your data into a predefined number of clusters.
```{r, echo = FALSE}
knitr::include_graphics("images/EDA-plotly.png")
```
Use `kmeans()` to perform k means clustering with R. As with heirarchical clustering, you can only apply k means clustering to numerical data. Pass your numerical data to the `kmeans()` function, then set `center` to the number of clusters to search for ($k$) and `nstart` to the number of simulations to run. Since the results of k means clustering depend on the initial assignment of points to groups, which is random, R will run `nstart` k means simulations and then return the best results (as measured by the minimum sum of squared distances between each point and the centroid of the group it is assigned to).
Finally, set the maximum number of iterations to let each simulation run in case the simulation cannot quickly find a stable grouping.
```{r}
ggplot(data = diamonds) +
geom_bin2d(mapping = aes(x = carat, y = price), binwidth = c(0.1, 500))
iris_kmeans <- iris %>%
select(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) %>%
kmeans(centers = 3, nstart = 20, iter.max = 50)
iris_kmeans$cluster
```
`geom_hex()` works similarly to `geom_bin2d()`, but it divides the coordinate plane into hexagon shaped bins. This can reduce visual artifacts that are introduced by the aligning edges of rectangular bins.
Unlike `hclust()` the k means algorithm does not porvide an intuitive visualize interface. Instead, `kmeans()` returns a kmeans class object. Subset the object with `$cluster` to access list of cluster assignments for your data set, like `cutree()`, e.g. `iris_kmeans$cluster`. You can visualize the results by mapping them to an aesthetic, or you can apply the results by passing them to dplyr's `group_by()` function.
```{r}
ggplot(data = diamonds) +
geom_hex(mapping = aes(x = carat, y = price), binwidth = c(0.1, 500))
ggplot(iris, aes(x = Sepal.Width, y = Sepal.Length)) +
geom_point(aes(color = factor(iris_kmeans$cluster), shape = Species))
iris %>%
group_by(iris_kmeans$cluster) %>%
summarise(n_obs = n(), avg_width = mean(Sepal.Width), avg_length = mean(Sepal.Length))
```
`geom_hex()` requires the `hexbin` package, which you can install with `install.packages("hexbin")`.
`geom_density2d()` uses density contours to display similar information. It is the two dimensional equivalent of `geom_density()`. Interpret a two dimensional density plot the same way you would interpret a contour map. Each line connects points of equal density, which makes changes of slope easy to see.
### Asking questions about clustering
As with other summary geoms, `geom_density2d()` makes a useful second layer.
Both algorithms _will always_ return a set of clusters, whether your data appears clustered or not. As a result, you should always be skeptical about clustering algorithms. Ask yourself:
* Do the clusters seem to identify real differences between your points? How can you tell?
* Are the points within each cluster similar in some way?
* Are the points in separate clusters different in some way?
* Might there be a mismatch between the number of clusters that you found and the number that exist in real life? Are only a couple of the clusters meaningful? Are there more clusters in the data than you found?
* How stable are the clusters if you re-run the algorithm?
Remember to use the results of clustering as a tool for exploration. They can be quite insightful, but there is no reason to treat them as a fact without doing further research.
## Models
A model is a type of summary that describes the relationships in your data. You can use a model to reveal patterns and outliers that only appear in n-dimensional space. To see how this works, consider the simple linear model below. I've fit it to a two dimensional pattern so we can visualize the results.
```{r echo = FALSE}
diamonds2 <- filter(diamonds, x > 3, y > 3, y < 12)
diamond_mod <- lm(y ~ x, data = diamonds2)
resids <- augment(diamond_mod)
diamonds3 <- bind_rows(filter(resids, abs(.resid) > 0.5),
sample_n(filter(resids, abs(.resid) <= 0.5), 1000)) %>%
select(x, y)
ggplot(diamonds3, aes(x = x, y = y)) +
geom_point() +
geom_smooth(method = lm, se = FALSE)
```
The model describes the relationship between x and y as
$$\hat{y} = 0.13 + 0.98 x$$
which is the equation of the blue model line in the graph above. Even if we did not have the graph, we could use the model coefficients in the equation above to determine that a positive relationship exists between $y$ and $x$ such that a one unit increase in $x$ is associated with an approximately one unit increase in $y$. We could use a model statistic, such as adjusted $r^{2}$ to determine that the relationship is very strong (here adjusted $r^{2} = 0.99$).
Finally, we could spot outliers in our data by examining the residuals of the model, which are the distances between the actual $y$ values of our data points and the $y$ values that the model would predict for the data points. Observations that are outliers in n-dimensional space will have a residual that is an outlier in one dimensional space. You can find these outliers by plotting a histogram of the residuals or by visualizing the residuals against any variable in a two dimenisonal plot.
```{r echo = FALSE, fig.width = 3, fig.show='hold'}
diamond_mod <- lm(y ~ x, data = diamonds3)
resids <- augment(diamond_mod)
ggplot(resids) +
geom_histogram(aes(x = .resid), binwidth = 0.1)
ggplot(resids) +
geom_point(aes(x = x, y = .resid))
```
You can examine coefficients, model statistics, and residuals of a model fit to an n-dimensional relationship in the same way, without visualizing the raw data in n-dimensional space.
I'll postpone teaching you how to fit and interpret models with R until Part 4. Altough models are something simple, a description of your data set, they are tied into the logic of statistical inference: if a model describes your data accurately _and_ your data is similar to the world at large, then your model should describe the world at large accurately. This chain of reasoning provides a basis for using models to make inferences and predictions. You'll be able to do more with models if you learn a few more skills before you begin to model data.
## Exploring further
Every data set contains more variables and observations than it displays. You can use the values in your data to compute new variables or to measure new new, group-level observations on subgroups of your data. These new variables and observations provide a further source of insights that you can explore with visualizations, clustering algorithms, and models.
### Making new variables
Use dplyr's `mutate()` function to calculate new variables from your existing variables.
```{r}
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price)) +
geom_density2d(mapping = aes(x = carat, y = price))
diamonds %>%
mutate(volume = x * y * z) %>%
head()
```
##### Visualize correlations between three variables
The window functions from Chapter 3 are particularly useful for calculating new variables. To calculate a variable from two or more variables, use basic operators or the `map2()`, `map3()`, and `map_n()` functions from purr. You will learn more about purrr in Chapter ?.
There are two ways to add three (or more) variables to a two dimensional plot. You can map additional variables to aesthetics within the plot, or you can use a geom that is designed to visualize three variables.
Statisticians can use R to extract potential variables with more sophisticated algorithms. R provides `prcomp()` for Principle Components Analysis and `factanal()` for factor analysis. The psych and SEM packages also provide further tools for working with latent variables.
`ggplot2` provides three geoms that are designed to display three variables: `geom_raster()`, `geom_tile()` and `geom_contour()`. These geoms generalize `geom_bin2d()` and `geom_density()` to display a third variable instead of a count, or a density.
### Making new observations
`geom_raster()` and `geom_tile()`
If your data set contains subgroups, you can derive from your data a new data set of observations that describe the subgroups. To do this, first use dplyr's `group_by()` function to group the data into subgroups. Then use dplyr's `summarise()` function to calculate group level values. The measures of location, rank and spread listed in Chapter 3 are particularly useful for describing subgroups.
### Summarizing covariation
### Modeling covariation
```{r}
mpg %>%
group_by(class) %>%
summarise(n_obs = n(), avg_hwy = mean(hwy), sd_hwy = sd(hwy))
```
Models have one of the richest literatures of how to select and test, so we've reserved them for their own section. Modelling brings together the various components of data science more so than any other data science task. So we'll postpone its coverage until you can program and wrangle data, two skills that will aid your ability to select models.
## A last word on variables, values, and observations
#### How to fit a model
#### How to quickly look at a model
#### How to quickly access a model's residuals
## Bias can ruin everything
Variables, values, and observations provide a basis for Exploratory Data Analysis: if a relationship exists between two variables, then the relationship will exist between the values of those variables when those values are measured in the same observation. As a result, relationships between variables will appear as patterns in your data.
## Bring it all together variables, values, observations, variation, natural laws, models.
Within any particular observation, the exact form of the relationship between values may be obscured by mediating factors, measurement error, or random noise; which means the patterns in your data will appear as signals obscured by noise.
Due to a quirk of the human cognitive system, the easiest way to spot the signal admidst the noise is to visualize your data. The concepts of variables, values, and observations make this easy to do. To visualize your data, represent each observation with its own geometric object, such as a point. Then map each variable to an aesthetic property of the point, setting specific values of the variable to specific levels of the aesthetic. Or compute group-level statistics (i.e. observations) and map them to geoms, something that `geom_bar()`, `geom_boxplot()` and other geoms do for you automatically.
## Exploratory Data Analysis and Data Science
As a term, data science has been used in many ways by different people. This fluidity is necessary for a term that describes a wide breadth of activity, as data science does. Although different data science activities will take different forms, you can use the principles in this chapter to build a general model of data science. The model requires one limit to the definition of data science: data science must rely in some way on human judgement and expertise.
To judge or interpret the information in a data set, you must first comprehend that information. Data is difficult to comprehend, which means that you need to visualize, model, and transform it, a process that we have referred to as Exploratory Data Analysis.
```{r, echo = FALSE}
knitr::include_graphics("images/EDA-data-science-1.png")
```
Once you comprehend the information in your data, you can use it to make inferences from your data. Often this involves making deductions from a model. This is what you do when you conduct a hypothesis test, make a prediction (wth or without a confidence interval), or score cases in a database.
```{r, echo = FALSE}
knitr::include_graphics("images/EDA-data-science-2.png")
```
But all of this will involve a computer; you can make little headway with pencil and paper calculations when you work with data. To work efficiently, you will need to know how to program in a computer language, such as R, import data to use with that language, and tidy the data into the format that works best for that language.
```{r, echo = FALSE}
knitr::include_graphics("images/EDA-data-science-3.png")
```
Finally, if your work is meaningful at all you will need to report it in a way that your audience can understand. Your audience might be fellow scientists who will want to ensure that the work is reproducible, non-scientists who will need to understand your findings in plain language, or future you who will be thankful if you make it easy to come back up to speed on your work and recreate it as necessary. To satisfy these audiences, you may choose to communicate your results in a report or to bundle your work into some type of useful format, like a package or a Shiny app.
```{r, echo = FALSE}
knitr::include_graphics("images/EDA-data-science-4.png")
```
This model forms a roadmap for the rest of the book.
* Part 1 of the book covered the central tasks of the model above, Exploratory Data Analysis.
* Part 2 will cover the logistical tasks of working with data in a computer language: importing and tidying the data, skills I call Data Wrangling.
* Part 3 will teach you some of the most efficient ways to program in R with data.
* Part 4 discusses models and how to apply them.
* Part 5 will teach you the most popular format for reporting and reproducing the results of an R analysis.
```{r, echo = FALSE}
knitr::include_graphics("images/EDA-data-science-5.png")
```

View File

@ -2,29 +2,11 @@
> "The simple graph has brought more information to the data analysts mind than any other device."---John Tukey
If you are like most humans, your brain is not designed to work with raw data. The working memory can only attend to a few values at a time, which makes it difficult to discover patterns in raw data. For example, can you spot the striking relationship between $X$ and $Y$ in the table below?
```{r data, echo=FALSE}
x <- rep(seq(0.2, 1.8, length = 5), 2) + runif(10, -0.15, 0.15)
X <- c(0.02, x, 1.94)
Y <- sqrt(1 - (X - 1)^2)
Y[1:6] <- -1 * Y[1:6]
Y <- Y - 1
order <- sample(1:10)
knitr::kable(round(data.frame(X = X[order], Y = Y[order]), 2))
```
While we may stumble over raw data, we can easily process visual information. Within your mind is a visual processing system that has been fine-tuned by thousands of years of evolution. As a result, the quickest way to understand your data is to visualize it. Once you plot your data, you can instantly see the relationships between values. Here, we see that the values above fall on a circle.
```{r echo=FALSE, dependson=data}
ggplot2::qplot(X, Y) + ggplot2::coord_fixed(ylim = c(-2.5, 2.5), xlim = c(-2.5, 2.5))
```
This chapter will teach you how to visualize your data with R and the `ggplot2` package. R contains several systems for making graphs, but the `ggplot2` system is one of the most beautiful and most versatile. `ggplot2` implements the *grammar of graphics*, a coherent system for describing and building graphs. With `ggplot2`, you can do more faster by learning one system and applying it in many places.
### Prerequisites
To access the data sets and functions that we will use in this chapter, load the `ggplot2` package:
To access the data sets, help pages, and functions that we will use in this chapter, load the `ggplot2` package:
```{r echo = FALSE, message = FALSE, warning = FALSE}
library(ggplot2)
@ -37,7 +19,7 @@ library(ggplot2)
## A code template
Let's use our first graph to answer a question: Do cars with big engines use more fuel than cars with small engines? You probably already have an answer, but try to make your answer precise. What does the relationship between engine size and fuel efficieny look like? Is it positive? Negative? Linear? Nonlinear? Strong? Weak?
Let's use our first graph to answer a question: Do cars with big engines use more fuel than cars with small engines? You probably already have an answer, but try to make your answer precise. What does the relationship between engine size and fuel efficieny look like? Is it positive? Negative? Linear? Nonlinear?
You can test your answer with the `mpg` data set in the `ggplot2` package. The data set contains observations collected by the EPA on 38 models of car. Among the variables in `mpg` are
@ -46,16 +28,6 @@ You can test your answer with the `mpg` data set in the `ggplot2` package. The d
To learn more about `mpg`, open its help page with the command `?mpg`.
***
*Tip*: If you have trouble loading `mpg`, its help page, or any of the functions in this chapter, you may need to reload the `ggplot2` package with the command below. You will need to reload the package each time you start a new R session.
```{r eval=FALSE}
library(ggplot2)
```
***
To plot `mpg`, open an R session and run the code below. The code plots the `displ` variable of `mpg` against the `hwy` variable.
```{r}
@ -65,7 +37,7 @@ ggplot(data = mpg) +
The plot shows a negative relationship between engine size (`displ`) and fuel efficiency (`hwy`). In other words, cars with big engines use more fuel. Does this confirm your hypothesis about fuel efficiency and engine size?
Our code is almost a template for making plots with `ggplot2`.
Pay close attention to this code because it is almost a template for making plots with `ggplot2`.
```{r eval=FALSE}
ggplot(data = mpg) +
@ -74,18 +46,18 @@ ggplot(data = mpg) +
With `ggplot2`, you begin a plot with the function `ggplot()`. `ggplot()` creates a coordinate system that you can add layers to. The first argument of `ggplot()` is the data set to use in the graph. So `ggplot(data = mpg)` creates an empty graph that will use the `mpg` data set.
You complete your graph by adding one or more layers to `ggplot()`. Here, the function `geom_point()` adds a layer of points to your plot, which creates a scatterplot. `ggplot2` comes with many geom functions that each add a different type of layer to a plot. Each geom function in `ggplot2` takes a mapping argument.
You complete your graph by adding one or more layers to `ggplot()`. Here, the function `geom_point()` adds a layer of points to your plot, which creates a scatterplot. `ggplot2` comes with many geom functions that each add a different type of layer to a plot.
The mapping argument of your geom function explains where your points should go. You must set `mapping` to a call to `aes()`. The `x` and `y` arguments of `aes()` explain which variables to map to the x and y axes of your plot. `ggplot()` will look for those variables in your data set, `mpg`.
Each geom function in `ggplot2` takes a mapping argument. The mapping argument of your geom function explains where your points should go. You must set `mapping` to a call to `aes()`. The `x` and `y` arguments of `aes()` explain which variables to map to the x and y axes of your plot. `ggplot()` will look for those variables in your data set, `mpg`.
This code suggests a reusable template for making graphs with `ggplot2`. To make a graph, replace the bracketed sections in the code below with a data set, a geom function, or a set of mappings.
Let's turn this code into a reusable template for making graphs with `ggplot2`. To make a graph, replace the bracketed sections in the code below with a data set, a geom function, or a set of mappings.
```{r eval = FALSE}
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
```
The sections below will show you how to complete and extend this template to make a graph. We will begin with `<MAPPINGS>`.
The rest of this chapter will show you how to complete and extend this template to make different types of graphs. We will begin with the `<MAPPINGS>` component.
## Aesthetic mappings
@ -120,7 +92,7 @@ The colors reveal that many of the unusual points are two seater cars. These car
In the above example, we mapped `class` to the color aesthetic, but we could have mapped `class` to the size aesthetic in the same way. In this case, the exact size of each point would reveal its class affiliation.
```{r}
```{r warning=FALSE}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = class))
```
@ -139,13 +111,9 @@ ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = class))
```
***
What happened to the suv's? `ggplot2` will only use six shapes at a time. Additional groups will go unplotted when you use this aesthetic.
**Tip** - What happened to the suv's? `ggplot2` will only use six shapes at a time. Additional groups will go unplotted when you use this aesthetic.
***
In each case, you set the name of the aesthetic to the variable to display, and you do this within the `aes()` function. The `aes()` function gathers together each of the aesthetic mappings used by a layer and passes them to the layer's mapping argument. The syntax highlights a useful insight because you also set `x` and `y` to variables within `aes()`. The insight is that the x and y locations of a point are themselves aesthetics, visual properties that you can map to variables to display information about the data.
For each aesthetic, you set the name of the aesthetic to the variable to display, and you do this within the `aes()` function. The `aes()` function gathers together each of the aesthetic mappings used by a layer and passes them to the layer's mapping argument. The syntax highlights a useful insight because you also set `x` and `y` to variables within `aes()`. The insight is that the x and y locations of a point are themselves aesthetics, visual properties that you can map to variables to display information about the data.
Once you set an aesthetic, `ggplot2` takes care of the rest. It selects a pleasing set of levels to use for the aesthetic, and it constructs a legend that explains the mapping between levels and values. For x and y aesthetics, `ggplot2` does not create a legend, but it creates an axis line with tick marks and a label. The axis line acts as a legend; it explains the mapping between locations and values.
@ -156,7 +124,7 @@ ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
```
Here, the color doesn't convey information about a variable. It only changes the appearance of the plot. To set an aesthetic manually, do not place it in the `aes()` function. Call the aesthetic by name as an argument of your geom function. Then pass the aesthetic a value that R will recognize, such as
Here, the color doesn't convey information about a variable. It only changes the appearance of the plot. To set an aesthetic manually, do not place it in the `aes()` function. Call the aesthetic by name as an argument of your geom function. Then pass the aesthetic a level that R will recognize, such as
* the name of a color as a character string
* the size of a point as a cex expansion factor (see `?par`)
@ -193,10 +161,10 @@ pchShow <-
pchShow()
```
If you get an odd result, double check that you are calling the aesthetic as its own argument (and not calling it from inside of `mapping = aes()`. We like to think of aesthetics like this, if you set the aesthetic:
If you get an odd result, double check that you are calling the aesthetic as its own argument (and not calling it from inside of `mapping = aes()`). I like to think of aesthetics like this, if you set the aesthetic:
* _inside_ of the `aes()` function, `ggplot2` will map the aesthetic to data values and build a legend.
* _outside_ of the `aes()` function, `ggplot2` will directly set the aesthetic to your input.
* _inside_ of the `aes()` function, `ggplot2` will **map** the aesthetic to data values and build a legend.
* _outside_ of the `aes()` function, `ggplot2` will **set** the aesthetic to a level that you supply manually.
### Exercises
@ -270,17 +238,35 @@ Next to each geom is a visual representation of the geom. Beneath the geom is a
To learn more about any single geom, open it's help page in R by running the command `?` followed by the name of the geom function, e.g. `?geom_smooth`.
***
```{r, echo = FALSE}
knitr::include_graphics("images/visualization-geoms-1.png")
```
**Tip** - Many geoms use a single object to describe all of the data. For example, `geom_smooth()` uses a single line. For these geoms, you can set the group aesthetic to a discrete variable to draw multiple objects. `ggplot2` will draw a separate object for each unique value of the grouping variable.
```{r, echo = FALSE}
knitr::include_graphics("images/visualization-geoms-2.png")
```
```{r, echo = FALSE}
knitr::include_graphics("images/visualization-geoms-3.png")
```
```{r, echo = FALSE}
knitr::include_graphics("images/visualization-geoms-4.png")
```
Many geoms use a single object to describe all of the data. For example, `geom_smooth()` uses a single line. For these geoms, you can set the group aesthetic to a discrete variable to draw multiple objects. `ggplot2` will draw a separate object for each unique value of the grouping variable.
In practice, `ggplot2` will automatically group the data for these geoms whenever you map an aesthetic to a discrete variable (as in the `linetype` example). It is convenient to rely on this feature because the group aesthetic by itself does not add a legend or distinguishing features to the geoms.
```{r, echo = FALSE}
knitr::include_graphics("images/visualization-geoms-1.png")
knitr::include_graphics("images/visualization-geoms-2.png")
knitr::include_graphics("images/visualization-geoms-3.png")
knitr::include_graphics("images/visualization-geoms-4.png")
```{r, fig.show='hold', fig.height = 2.5, fig.width = 2.5}
ggplot(diamonds) +
geom_smooth(aes(x = carat, y = price))
ggplot(diamonds) +
geom_smooth(aes(x = carat, y = price, group = cut))
ggplot(diamonds) +
geom_smooth(aes(x = carat, y = price, color = cut))
```
## Layers
@ -301,7 +287,7 @@ ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_smooth()
```
If you place mappings in a geom function, `ggplot2` will treat them as local mappings. It will use these mappings to extend or overwrite the global mappings _for that geom only_. This provides an easy way to differentiate geoms.
If you place mappings in a geom function, `ggplot2` will treat them as local mappings for the layer. It will use these mappings to extend or overwrite the global mappings _for that layer only_. This provides an easy way to differentiate layers.
```{r, message = FALSE}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
@ -309,12 +295,12 @@ ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_smooth()
```
You can use the same system to specify individual data sets for each layer. Here, our smooth line displays just a subset of the `mpg` data set, the cars with eight cylinder engines. The local data argument in `geom_smooth()` overrides the global data argument in `ggplot()` for the smooth layer only.
You can use the same system to specify individual data sets for each layer. Here, our smooth line displays just a subset of the `mpg` data set, the subcompact cars. The local data argument in `geom_smooth()` overrides the global data argument in `ggplot()` for the smooth layer only.
```{r, message = FALSE, warning = FALSE}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth(data = subset(mpg, cyl == 8))
geom_point(aes(color = class)) +
geom_smooth(data = subset(mpg, class == "subcompact"))
```
### Exercises
@ -423,7 +409,7 @@ ggplot(data = diamonds) +
### Position = "fill"
`position = "fill"` places overlapping objects above one another. However, it scales the objects to take up all of the available vertical space. As a result, `position = "fill"` makes it easy to compare relative frequencies across groups.
`position = "fill"` places overlapping objects above one another. However, it scales the objects to take up all of the available vertical space. As a result, `position = "fill"` makes it easy to compare relative proportions across groups.
```{r}
ggplot(data = diamonds) +
@ -460,19 +446,11 @@ ggplot(data = mpg) +
ggtitle('Position = "jitter"')
```
But isn't random noise, you know, bad? It *is* true that jittering your data will make it less accurate at the local level, but jittering may make your data _more_ accurate at the global level. Occasionally, jittering will reveal a pattern that was hidden within the grid.
This may seem like a bad idea since jittering will make your graph less accurate at the local level, but jittering may make your graph _more_ revealing at the global level. Occasionally, jittering will reveal a pattern that was hidden within the grid.
***
`ggplot2` comes with a special geom `geom_jitter()` that is the exact equivalent of `geom_point(position = "jitter")`.
**Tip** - `ggplot2` comes with a special geom `geom_jitter()` that is the exact equivalent of `geom_point(position = "jitter")`.
***
***
**Tip** - To learn more about a position adjustment, look up the help page associated with each adjustment: `?position_dodge`, `?position_fill`, `?position_identity`, `?position_jitter`, and `?position_stack`.
***
To learn more about a position adjustment, look up the help page associated with each adjustment: `?position_dodge`, `?position_fill`, `?position_identity`, `?position_jitter`, and `?position_stack`.
## Stats
@ -491,14 +469,14 @@ head(diamonds)
Where does count come from?
Some graphs, like scatterplots, plot the raw values of your data set. Other graphs, like bar charts, do not plot raw values at all. These graphs apply an algorithm to your data and then plot the results of the algorithm. Consider how often graphs do this.
Some graphs, like scatterplots, plot the raw values of your data set. Other graphs, like bar charts, calculate new values to plot.
* **bar charts** and **histograms** bin your data and then plot bin counts, the number of points that fall in each bin.
* **smooth lines** fit a model to your data and then plot the model line.
* **boxplots** calculate the quartiles of your data and then plot the quartiles as a box.
* and so on.
`ggplot2` calls the algorithm that a graph uses to transform raw data a _stat_, which is short for statistical transformation. Each geom in `ggplot2` is associated with a default stat that it uses to plot your data. `geom_bar()` uses the "count" stat, which computes a data set of counts for each x value from your raw data. `geom_bar()` then uses this computed data to make the plot.
`ggplot2` calls the algorithm that a graph uses to calculate new values a _stat_, which is short for statistical transformation. Each geom in `ggplot2` is associated with a default stat that it uses to calculate values to plot. The figure below describes how this process works with `geom_bar()`.
```{r, echo = FALSE}
knitr::include_graphics("images/visualization-stat-bar.png")
@ -511,11 +489,11 @@ A few geoms, like `geom_point()`, plot your raw data as it is. These geoms also
knitr::include_graphics("images/visualization-stat-point.png")
```
You can learn which stat a geom uses, as well as what variables it computes by visiting the geom's help page. For example, the help page of `geom_bar()` shows that it uses the count stat and that the count stat computes two new variables, `count` and `prop`. If you have an R session open---and you should!---you can verify this by running `?geom_bar` at the command line.
You can learn which stat a geom uses, as well as what variables it computes by visiting the geom's help page. For example, the help page of `geom_bar()` shows that it uses the count stat and that the count stat computes two new variables, `count` and `prop`. If you have an R session open you can verify this by running `?geom_bar` at the command line.
Stats are the most subtle part of plotting because you do not see them in action. `ggplot2` applies the transformation and stores the results behind the scenes. You only see the finished plot. Moreover, `ggplot2` applies stats automatically, with a very intuitive set of defaults. So why bother thinking about stats? Because you can use stats to do three very useful things.
Stats are the most subtle part of plotting because you do not see them in action. `ggplot2` applies the transformation and stores the results behind the scenes. You only see the finished plot. Moreover, `ggplot2` applies stats automatically, with a very intuitive set of defaults. As a result, you rarely need to adjust a geom's stat. However, you can do three things with a geom's stat if you wish to.
First, you can change the stat that a geom uses. To do this, set the geom's stat argument. For example, you can map the heights of your bars to raw values---not counts---if you change the stat of `geom_bar()` from "count" to "identity". This works best if your data contains one value per bar, as in the demo data set below. Add a $y$ aesthetic, and map it to the variable that contains the bar heights.
First, you can change the stat that the geom uses with the geom's stat argument. In the code below, I change the stat of `geom_bar()` from count (the default) to identity. This lets me map the height of the bars to the raw values of a $y$ variable.
```{r}
demo <- data.frame(
@ -523,44 +501,42 @@ demo <- data.frame(
b = c(20, 30, 40)
)
demo
ggplot(data = demo) +
geom_bar(mapping = aes(x = a, y = b), stat = "identity")
demo
```
Use consideration when you change a geom's stat. Many combinations of geoms and stats will create incompatible results. In practice, you will almost always use a geom's default stat.
I provide a list of the stats that are available to use in ggplot2 at the end of this section. Be careful when you change a geom's stat. Many combinations of geoms and stats will create incompatible results. In practice, you will almost always use a geom's default stat.
Second, you can customize how a stat does its job. For example, the count stat takes a width parameter that it uses to set the widths of the bars in a bar plot. To pass a width value to the stat, provide a width argument to your geom function. `width = 1` will make the bars wide enough to touch each other.
Second, you can give some stats arguments by passing the arguments to your geom function. In the code below, I pass a width argument to the count stat, which controls the widths of the bars. `width = 1` will make the bars wide enough to touch each other.
```{r}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut), width = 1)
```
***
You can learn which arguments a stat takes at the stat's help page. To open the help page, place the prefix `?stat_` before the name of the stat, then run the command at the command line, e.g. `?stat_count`.
**Tip** - You can learn which arguments a stat takes and how it uses them at the stat's help page. To open the help page, place the prefix `?stat_` before the name of the stat, then run the command at the command line, e.g. `?stat_count`.
***
Finally, you can tell `ggplot2` how to use the stat. Many stats in `ggplot2` create multiple variables, some of which go unused. For example, `geom_count()` uses the "sum" stat to create bubble charts. Each bubble represents a group of data points, and the size of the bubble displays how many points are in the group (e.g. the count of the group).
Finally, you can use extra variables created by the stat. Many stats in `ggplot2` create multiple variables, some of which go unused. For example, `geom_count()` uses the "sum" stat to create bubble charts. Each bubble represents a group of data points, and the size of the bubble displays how many points are in the group (e.g. the count of the group).
```{r}
ggplot(data = diamonds) +
geom_count(mapping = aes(x = cut, y = clarity))
```
The sum stat creates two variables, `n` (count) and `prop`. By default, `geom_count()` uses the `n` variable to create the size of each bubble. You can use the prop variable in combination with a _group_ aesthetic to display the proportion of observations in each row (or column) that appear in the bubbles. To tell `geom_count()` to use the prop variable, map $size$ to `..prop..`. The two dots that surround prop notify `ggplot2` that the prop variable appears in the transformed data set that is created by the stat, and not in the raw data set. Be sure to include these dots whenever you refer to a variable that is created by a stat.
The help page of `?stat_sum` reveals that the sum stat creates two variables, n (count) and prop. By default, `geom_count()` uses the n variable to create the size of each bubble. To tell `geom_count()` to use the prop variable, map $size$ to `..prop..`. The two dots that surround prop notify `ggplot2` that the prop variable appears in the transformed data set that is created by the stat, and not in the raw data set. Be sure to include these dots whenever you refer to a variable that is created by a stat.
```{r}
ggplot(data = diamonds) +
geom_count(mapping = aes(x = cut, y = clarity, size = ..prop.., group = clarity))
```
If you set _group_ to the $x$ variable, `..prop..` will show proportions across columns. If you set it to the $y$ variable, `..prop..` will show proportions across rows, as in the plot above. Here, the proportions in each row sum to one.
For `geom_count()`, the `..prop..` variable does not do anything useful until you set a group aesthetic. If you set _group_ to the $x$ variable, `..prop..` will show proportions across columns. If you set it to the $y$ variable, `..prop..` will show proportions across rows, as in the plot above. Here, the proportions in each row sum to one.
The best way to discover which variables are created by a stat is to visit the stat's help page. `ggplot2` provides 22 stats for you to use. Each stat is saved as a function, which provides a convenient way to access a stat's help page, e.g. `?stat_identity`.
In most cases, you will not want to switch the default variable supplied by a stat. Many stats only return one useful variable. The best way to discover which variables are created by a stat is to visit the stat's help page.
`ggplot2` provides 22 stats for you to use. Each stat is saved as a function, which provides a convenient way to access a stat's help page, e.g. `?stat_identity`.
The table below describes each stat in `ggplot2` and lists the parameters that the stat takes, as well as the variables that the stat makes.
@ -580,7 +556,7 @@ ggplot(data = diamonds) +
coord_polar()
```
Answer: A coxcomb plot is a bar chart plotted in polar coordinates. If this seems surprising, consider how you would make a coxcomb plot with `ggplot2`.
A coxcomb plot is a bar chart plotted in polar coordinates. If this seems surprising, consider how you would make a coxcomb plot with `ggplot2`.
To make a coxcomb plot, first build a bar chart and then add `coord_polar()` to your plot call. Polar bar charts will look better if you also set the width parameter of `geom_bar()` to 1. This will ensure that no space appears between the bars.
@ -592,7 +568,7 @@ ggplot(data = diamonds) +
You can use `coord_polar()` to turn any plot in `ggplot2` into a polar chart. Whenever you add `coord_polar()` to a plot's call, `ggplot2` will draw the plot on a polar coordinate system. It will map the plot's $y$ variable to $r$ and the plot's $x$ variable to $\theta$. You can reverse this behavior by passing `coord_polar()` the argument `theta = "y"`.
Polar coordinates unlock another riddle as well. You may have noticed that `ggplot2` does not come with a pie chart geom. Why would that be? In practice, a pie chart is just a stacked bar chart plotted in polar coordinates. To make a pie chart in `ggplot2`, create a stacked bar chart and:
Polar coordinates unlock another riddle as well. You may have noticed that `ggplot2` does not come with a pie chart geom. In practice, a pie chart is a stacked bar chart plotted in polar coordinates. To make a pie chart in `ggplot2`, create a stacked bar chart and:
1. ensure that the x axis only has one value. An easy way to do this is to set `x = factor(1)`.
2. set the width of the bar to one, e.g. `width = 1`
@ -607,17 +583,12 @@ ggplot(data = diamonds) +
`ggplot2` comes with eight coordinate functions that you can use in the same way as `coord_polar()`. The table below describes each function and what it does. Add any of these functions to your plot's call to change the coordinate system that the plot uses.
***
You can learn more about each coordinate system by opening its help page in R, e.g. `?coord_cartesian`, `?coord_fixed`, `?coord_flip`, `?coord_map`, `?coord_polar`, and `?coord_trans`.
```{r, echo = FALSE}
knitr::include_graphics("images/visualization-coordinate-systems.png")
```
***
**Tip** - You can learn more about each coordinate system by opening its help page in R, e.g. `?coord_cartesian`, `?coord_fixed`, `?coord_flip`, `?coord_map`, `?coord_polar`, and `?coord_trans`.
***
## Facets
@ -632,7 +603,7 @@ ggplot(data = diamonds) +
facet_wrap( ~ clarity)
```
To facet your plot on the combinations of two variables, add `facet_grid()` to your plot call. The first argument of `facet_grid()` is also a formula. This time the formula should contain two variable names separated by a `~`.
To facet your plot on the combination of two variables, add `facet_grid()` to your plot call. The first argument of `facet_grid()` is also a formula. This time the formula should contain two variable names separated by a `~`.
```{r fig.height = 7, fig.width = 7}
ggplot(data = diamonds) +
@ -645,13 +616,7 @@ Here the first subplot displays all of the points that have an `I1` code for `cl
If you prefer to not facet on the rows or columns dimension, place a `.` instead of a variable name before or after the `~`, e.g. `+ facet_grid(. ~ clarity)`.
Faceting works on more than just polar charts. You can add `facet_wrap()` or `facet_grid()` to any plot in `ggplot2`. For example, you could facet our original scatterplot.
```{r fig.height = 6, fig.width = 6}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class)
```
Faceting works on more than just polar charts. You can add `facet_wrap()` or `facet_grid()` to any plot in `ggplot2`.
### Exercises
@ -699,7 +664,6 @@ To see how this works, consider how you could build a basic plot from scratch: y
knitr::include_graphics("images/visualization-grammar-1.png")
```
***
Next, you could choose a geometric object to represent each observation in the transformed data. You could then use the aesthetic properties of the geoms to represent variables in the data. You would map the values of each variable to the levels of an aesthetic.
```{r, echo = FALSE}
@ -713,5 +677,3 @@ knitr::include_graphics("images/visualization-grammar-3.png")
```
You could use this method to build _any_ plot that you imagine. In other words, you can use the code template that you've learned in this chapter to build hundreds of thousands of unique plots.
You could use these plots to explore and understand your data. Or you could use them to share insights about your data with others. Chapter ? will provide some tips on how to use graphs to discover insights in your data. Chapter ? will provide some tips on how to prepare your graphs to share with others. Graphs make an excellent communication tool, especially when they are self explanatory.

View File

@ -4,7 +4,7 @@
With data, the relationships between values matter as much as the values themselves. Tidy data encodes those relationships.
Throughout this book we work with "tibbles" instead of the traditional data frame. Tibbles _are_ data frames but they encode some patterns that make modern usage of R better. Unfortunately R is an old language, and things that made sense 10 or 20 years a go are no longer as valid. It's difficult to change base R without breaking existing code, so most innovation occurs in packages, providing new functions that you should use instead of the old ones.
Throughout this book we work with "tibbles" instead of the traditional data frame. Tibbles _are_ data frames but they encode some patterns that make modern usage of R better. Unfortunately R is an old language, and things that made sense 10 or 20 years ago are no longer as valid. It's difficult to change base R without breaking existing code, so most innovation occurs in packages, providing new functions that you should use instead of the old ones.
```{r}
library(tibble)
@ -38,7 +38,7 @@ frame_data(
)
```
## Tibbles vs data frames
## Tibbles vs. data frames
There are two main differences in the usage of a data frame vs a tibble: printing, and subsetting.