Adds model section to EDA chapter. Edits EDA chapter.

This commit is contained in:
Garrett 2016-05-20 17:15:21 -04:00
parent 725ae26de0
commit 671f8800b4
2 changed files with 250 additions and 76 deletions

BIN
images/EDA-boxplot.pdf Normal file

Binary file not shown.

View File

@ -1,5 +1,7 @@
# Exploratory Data Analysis (EDA)
# Exploratory Data Analysis (EDA)
```{r include = FALSE}
library(ggplot2)
library(dplyr)
@ -13,23 +15,25 @@ Visualization and transformation are the most useful tools for exploring your da
2. searching for answers by visualizing and transforming your data
3. using what you discover to refine your questions about the data, or to choose new questions to investigate
There is no formal way to do Exploratory Data Analysis because you must be free to investigate every insight that occurs to you. However, some tactics will lead more reliably to insights than others. This chapter will teach you a basic toolkit of the most useful EDA techniques.
There is no formal way to do Exploratory Data Analysis because you must be free to investigate every insight that occurs to you. However, some tactics will lead reliably lead to insights. This chapter will teach you a basic toolkit of these useful EDA techniques.
## Questions
> "There are no routine statistical questions, only questionable statistical routines."---Sir David Cox
> "Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.."---John Tukey
> "Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise."---John Tukey
EDA begins with questions. The questions that you ask about your data will guide your attention as you search for insights. Good questions will lead you to discoveries that let you ask better questions.
Your goal during EDA is to develop a complete understanding of your data set and the information that it contains. The easiest way to do this is to use questions as tools that guide your investigation. When you ask a question, the question focuses your attention on a specific part of your data set and helps you decide which graphs or models to make.
There is no rule about which questions you should ask to guide your research. You will often begin with one set of questions and then replace them as your understanding of your data deepens. If you ever find yourself at a loss for questions, two types of questions will always be useful for making discoveries with your data. You can loosely word them as
During EDA, the _quantity_ of questions that you ask matters more than the quality of the questions. It is difficult to ask revealing questions at the start of your analysis because you do not know what insights are contained in your data set. On the other hand, each new question that you ask will expose you to a new aspect of your data and increase your chance of making a discovery. You can quickly drill down into the most interesting parts of your data---and develop a set of thought provoking questions---if you follow each question with a new question based on what you find.
There is no rule about which questions you should ask to guide your research. However, two types of questions will always be useful for making discoveries within your data. You can loosely word them as
1. What type of **variation** occurs **within** my variables? and
2. What type of **covariation** occurs **between** my variables?
The rest of this chapter will look at these two questions. I'll show you the best ways to use visualization and summaries to explore variation and covariation. Our discussion will lead to a model of data science itself, the model that we've built this book around. To make the discussion easier, let's define some terms:
The rest of this chapter will look at these two questions. I'll explain what variation and covariation are, and I'll show you several ways to answer each question. Our discussion will lead to a model of data science itself, the model that I've built this book around. To make the discussion easier, let's define some terms:
* A _variable_ is a quantity, quality, or property that you can measure.
@ -41,9 +45,7 @@ The rest of this chapter will look at these two questions. I'll show you the bes
> "What type of variation occurs within my variables?"
**Variation** is the tendency of the values of a variable to change from measurement to measurement.
You can see variation easily in real life; if you measure any continuous variable twice---and precisely enough, you will get two different results. This is true even if you measure quantities that should be constant, like the speed of light (below). Each of your measurements will include a small amount of error that varies from measurement to measurement.
**Variation** is the tendency of the values of a variable to change from measurement to measurement. You can see variation easily in real life; if you measure any continuous variable twice---and precisely enough, you will get two different results. This is true even if you measure quantities that should be constant, like the speed of light (below). Each of your measurements will include a small amount of error that varies from measurement to measurement.
```{r, variation, echo = FALSE}
@ -58,16 +60,14 @@ Each variable contains its own pattern of variation, which can reveal interestin
### Visualizing distributions
How you visualize the distribution will depend on whether your variable is categorical or continuous.
A variable is **categorical** if it can only have a finite (or countably infinite) set of unique values. In R, categorical variables are usually saved as factors, integers, or character strings. To examine the distribution of a categorical variable, use a bar chart.
How you visualize the distribution of a variable will depend on whether the variable is categorical or continuous. A variable is **categorical** if it can only have a finite (or countably infinite) set of unique values. In R, categorical variables are usually saved as factors, integers, or character strings. To examine the distribution of a categorical variable, use a bar chart.
```{r}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
```
The height of the bars displays how many observations occurred with each x value. If you would like, these exact values, wyou can compute them with R's `table()` function.
The height of the bars displays how many observations occurred at each x value. You can compute these values manually with `table()`.
```{r}
table(diamonds$cut)
@ -82,7 +82,7 @@ ggplot(data = diamonds) +
A histogram divides the x axis into equally spaced intervals and then uses a bar to display how many observations fall into each interval. In the graph above, the tallest bar shows that almost 30,000 observations have a $carat$ value between 0.25 and 0.75, which are the left and right edges of the bar.
In a histogram, the intervals are known as **bins** and the process of creating intervals is known as **binning**. You can set the binwidth of the intervals with the `binwidth` argument of `geom_histogram()`, which is measured in the units of the $x$ axis. You should always explore a variety of binwidths when working with histograms, as different binwidths can reveal different patterns. For example, here is how the graph above looks with a binwidth of 0.01.
You can set the width of the intervals with the `binwidth` argument of `geom_histogram()`, which is measured in the units of the $x$ axis. You should always explore a variety of binwidths when working with histograms, as different binwidths can reveal different patterns. For example, here is how the graph above looks with a binwidth of 0.01.
```{r message = FALSE}
ggplot(data = diamonds) +
@ -91,7 +91,7 @@ ggplot(data = diamonds) +
If you wish to overlay multiple histograms in the same plot, I recommend using `geom_freqpoly()` or `geom_density2d()` instead of `geom_histogram()`. `geom_freqpoly()` makes a frequency polygon, a line that connects the tops of the bars that would appear in a histogram. Like `geom_histogram()`, `geom_freqpoly()` accepts a binwidth argument.
`geom_density()` plots a one dimensional kernel density estimate of a variable's distribution. The result is a smooth version of the information contained in a histogram or a frequency polygon. You can control the smoothness of the density with `adjust`. `geom_density()` displays $density$---not $count$---on the y axis; the area under each curve will be normalized to one, no matter how many total observations occur in the subgroup.
`geom_density()` plots a one dimensional kernel density estimate of a variable's distribution. The result is a smooth version of the information contained in a histogram or a frequency polygon. You can control the smoothness of the density with `adjust`. `geom_density()` displays $density$---not $count$---on the y axis; the area under each curve will be normalized to one, no matter how many total observations occur in the subgroup, which makes it easier to compare subgroups.
```{r message = FALSE, fig.show='hold', fig.width=3}
zoom <- coord_cartesian(xlim = c(55, 70))
@ -107,30 +107,30 @@ ggplot(data = diamonds) +
## Follow up questions
Now that you can visualize variation, what should you look for in your plots? Here are the most useful types of information in any graph:
Now that you can visualize variation, what should you look for in your plots? and what type of follow up questions should you ask? I've put together below a list of the most useful types of information that you will find in your graphs, along with some follow up questions for each type of information. The key to asking good follow up questions will be to rely on your **curiosity** (What do you want to learn more about?) as well as your **skepticism** (How could this be misleading?).
* *Typical Values*
* *Typical values*
In both bar charts and histograms, tall bars reveal common values of a variable. Shorter bars reveal less common or rare values. Places that do not have bars reveal seemingly impossible values. To turn this information into a useful question, look for anything unexpected:
+ Which values are the most common? Why?
+ Which values are the most common? Why might that be?
+ Which values are the most rare? Why?
+ Which values are the most rare? Why might that be?
+ Is there an unusual pattern to the frequencies? Why?
+ Is there an unusual pattern to the frequencies? Why might that be?
+ Do the typical values change if you look at individual subgroups of the data?
+ Do the typical values change if you look at subgroups of the data?
For example, the histogram below suggests several interesting questions: Why are there more diamonds at whole carats and common fractions of carats? Why are there slightly more diamonds above each of these peaks than there are slightly below each of these peaks?
As an example, the histogram below suggests several interesting questions: Why are there more diamonds at whole carats and common fractions of carats? Why are there more diamonds slightly to the right of each peak than there are slightly to the left of each peak?
```{r echo = FALSE, message = FALSE, fig.height = 2}
```{r echo = FALSE, message = FALSE, warning = FALSE, fig.height = 2}
ggplot(data = diamonds) +
geom_histogram(aes(x = carat), binwidth = 0.01) + xlim(0, 3)
```
* *Range of values*
The range, or spread, of values in the distribution reveals how certain you can be when you make predictions about a variable. If the variable only takes a narrow set of values, like below, you are unlikely to be far off if you make a prediction about a future observation. Even if the observation takes a value at the distant extreme of the distribution, the value will not be far from your guess.
The range of values in the distribution reveals how certain you can be when you make predictions about a variable. If the variable only takes a narrow set of values, like below, you are unlikely to be far off if you make a prediction about a future observation. Even if the observation takes a value at the distant extreme of the distribution, the value will not be far from your guess.
```{r echo = FALSE, message = FALSE, fig.height = 2}
mpg$hwy2 <- mpg$hwy / 10 + 22
@ -143,11 +143,11 @@ Now that you can visualize variation, what should you look for in your plots? He
ggplot(mpg) + geom_histogram(aes(x = hwy), binwidth = 1) + xlim(10, 45)
```
As a quick rule, narrow distributions imply less uncertainty when making predictions about a variable; wide distributions imply more uncertainty. Ask yourself
As a quick rule, wide distributions imply less certainty when making predictions about a variable; narrow distributions imply more certainty. Ask yourself
+ Do your data show a surprising amount of certainty or uncertainty? Why?
+ Does the spread of the data change if you look at individual subgroups of the data?
+ Does the range of the distribution change if you look at individual subgroups of the data?
* *Outliers*
@ -172,41 +172,47 @@ Now that you can visualize variation, what should you look for in your plots? He
+ How are the observations in separate clusters different from each other?
+ How can you explain or describe the clusters?
+ Why might the appearance of clusters be misleading?
The histogram below displays the length in minutes of 272 eruptions of the Old Faithful Geyser in Yellowstone National Park. You can spot two distinct clusters; Old Faithful appears to oscillate between short and long eruptions.
The histogram below displays two distinct clusters. It shows the length in minutes of 272 eruptions of the Old Faithful Geyser in Yellowstone National Park; Old Faithful appears to oscillate between short and long eruptions.
```{r echo = FALSE, message = FALSE, fig.height = 2}
ggplot(faithful) + geom_histogram(aes(x = eruptions))
```
To answer many of the follow up questions above, you will need to make a new graph that includes two or more variables and then look for:
To answer many of the follow up questions above, you will need to make a new graph that shows the relationship between two or more variables and then look for:
* *Patterns*
Patterns in your data provide clues about covariation. If a relationship exists between two variables it will appear as a pattern in the data. If you spot a pattern, ask yourself:
Patterns in your data provide clues about relationships. If a systematic relationship exists between two variables it will appear as a pattern in the data. If you spot a pattern, ask yourself:
+ Could this pattern be due to coincidence (i.e. random chance)?
+ How can you describe the relationship described by the pattern?
+ How can you describe the relationship implied by the pattern?
+ How strong is the relationship implied by the pattern?
+ What other variable might be involved in the relationship?
+ What other variables might affect the relationship?
+ Does the relationship change if you look at individual subgroups of the data?
Each of these questions is an example of the second general question that I proposed for EDA. Let's look at that question now.
A scatterplot of Old Faithful eruption lengths versus the wait time between eruptions shows a pattern: longer wait times are associated with longer eruptions. The scatterplot also reveals the two clusters that we noticed above.
```{r echo = FALSE, message = FALSE, fig.height = 2}
ggplot(faithful) + geom_point(aes(x = eruptions, y = waiting))
```
Questions about patterns are examples of the second general question that I proposed for EDA. Let's look at that question now.
## Covariation
> "What type of covariation occurs between my variables?"
If variation describes the behavior _within_ a variable, covariation describes the behavior _between_ variables. **Covariation** is the tendency for the values of two or more variables to vary together in a systematic way.
If variation describes the behavior _within_ a variable, covariation describes the behavior _between_ variables. **Covariation** is the tendency for the values of two or more variables to vary together in a correlated way. The best way to spot covariation is to visualize the relationship between two or more variables. How you do that should again depend on the type of variable.
### Visualizaing covariation
The best way to spot covariation is to visualize the relationship between two or more variables. How you do that should again depend on the type of variable.
#### Two categorical variables
Visualize covariation between categorical variables with `geom_count()`.
@ -216,7 +222,7 @@ ggplot(data = diamonds) +
geom_count(mapping = aes(x = cut, y = color))
```
The size of each circle in the plot will display how many observations occurred at each combination of values. As with bar charts, you can calculate the specific values with `table()`. Covariation will appear as a strong correlation between specifc x values and specific y values.
The size of each circle in the plot displays how many observations occurred at each combination of values. Covariation will appear as a strong correlation between specifc x values and specific y values. As with bar charts, you can calculate the specific values with `table()`.
```{r}
table(diamonds$color, diamonds$cut)
@ -224,70 +230,65 @@ table(diamonds$color, diamonds$cut)
#### One categorical variable and one continuous variable
Visualize covariation between continuous and categorical variables with boxplots. A **boxplot** is a type of visual shorthand for a distribution that is popular among statisticians. You make a boxplot with `geom_boxplot()`. The chart below shows several boxplots, one for each level of the cut variable. Each boxplot represents the distribution of depth values for points with the given level of cut.
```{r}
ggplot(data = diamonds) +
geom_boxplot(aes(x = cut, y = depth))
```
How should you interpret a boxplot? Each boxplot consists of:
Visualize covariation between continuous and categorical variables with boxplots. A **boxplot** is a type of visual shorthand for a distribution that is popular among statisticians. Each boxplot consists of:
* A box that stretches from the 25th percentile of the distribution to the 75th percentile, a distance known as the Inter-Quartile Range (IQR). In the middle of the box is a line that displays the median, i.e. 50th percentile, of the distribution. These three lines give you a sense of the spread of the distribution and whether or not it is symmetric about the median or skewed to one side.
* Points that display observations that fall more than 1.5 times the IQR from either edge of the box. These outlying points have a strong chance of being outliers, so they are included in the boxplot for inspection. Since diamonds is a large data set, quite a few points fall in this range.
* Points that display observations that fall more than 1.5 times the IQR from either edge of the box. These outlying points have a strong chance of being outliers, so they are included in the boxplot for inspection.
* A line (or whisker) that extends from each end of the box and goes to the farthest non-outlier point in the distribution.
Boxplots make it especially easy to see if the locations or spreads of the distributions change across values. Simply compare the median lines of the distributions, or the width of the boxes. To make the trend easier to see, wrap the $x$ variable with `reorder()`. The code below reorders the x axis based on the median depth value of each group.
```{r}
ggplot(data = diamonds) +
geom_boxplot(aes(x = reorder(cut, depth, FUN = median), y = depth))
```{r, echo = FALSE}
knitr::include_graphics("images/EDA-boxplot.pdf")
```
`geom_boxplot()` works best when the x variable is categorical, but if you wish to invert the axes, you can easily do so with `coord_flip()`.
The chart below shows several boxplots, one for each level of the class variable in the mpg data set. Each boxplot represents the distribution of hwy values for points with the given level of class. To make boxplots, use `geom_boxplot()`.
```{r}
ggplot(data = diamonds) +
geom_boxplot(aes(x = cut, y = depth)) +
ggplot(data = mpg) +
geom_boxplot(aes(x = class, y = hwy))
```
Covariation will appear as a systematic change in the medians or IQR's of the boxplots. To make the trend easier to see, wrap the $x$ variable with `reorder()`. The code below reorders the x axis based on the median hwy value of each group.
```{r}
ggplot(data = mpg) +
geom_boxplot(aes(x = reorder(class, hwy, FUN = median), y = hwy))
```
`geom_boxplot()` works best when the categorical variable is on the x axis. You can invert the axes with `coord_flip()`.
```{r}
ggplot(data = mpg) +
geom_boxplot(aes(x = reorder(class, hwy, FUN = median), y = hwy)) +
coord_flip()
```
`geom_violin()` provides an alternate version of a boxplot. In a violin plot, the width of the "box" displays a kernel density estimate of the shape of the distribution.
If you wish to add more information to your boxplots, use `geom_violin()`. In a violin plot, the width of the "box" displays a kernel density estimate of the shape of the distribution.
```{r}
ggplot(data = diamonds) +
geom_violin(aes(x = cut, y = depth)) +
ggplot(data = mpg) +
geom_violin(aes(x = reorder(class, hwy, FUN = median), y = hwy)) +
coord_flip()
```
#### Two continuous variables
Visualize covariation between two continuous variables with a scatterplot. Covariation will appear as a structure or pattern in the data points. For example, we saw in Chapter 1 that a positive relationship exists between the carat size and price of a diamond.
Visualize covariation between two continuous variables with a scatterplot, i.e. `geom_point()`. Covariation will appear as a structure or pattern in the data points. For example, a positive trend exists between the carat size and price of a diamond.
```{r}
ggplot(data = diamonds) +
geom_point(aes(x = carat, y = price))
```
The easiest relationship to spot between two variables is a straight line. Often you can get a sense of the exact non-linear relationship between two variables by "bending" the data into a straight line with `coord_trans()`. For example, the gamut of charts below suggests that price and carat size have a relationship of the form $price = carat^{\beta}$ (This type of relationship is straightened by a log-log transform).
```{r fig.show='hold', fig.width=3, fig.height=3}
p <- ggplot(data = diamonds) +
geom_point(aes(x = carat, y = price))
p + coord_trans(y = "sqrt")
p + coord_trans(y = "log10")
p + coord_trans(x = "log10", y = "log10")
```
Scatterplots become less useful as the size of your data set grows, because points begin to pile up into areas of uniform black. You can make patterns clear again with `geom_bin2d()`, `geom_hex()`, or `geom_density2d()`.
`geom_bin2d()` and `geom_hex()` both divide the coordinate plane into two dimensional bins and then use fill to display how many points fall into each bin. `geom_bin2d()` creates rectangular bins. `geom_hex()` creates hexagonal bins. You will need to install the hexbin package to use `geom_hex()`.
`geom_bin2d()` and `geom_hex()` divide the coordinate plane into two dimensional bins and then use a fill color to display how many points fall into each bin. `geom_bin2d()` creates rectangular bins. `geom_hex()` creates hexagonal bins. You will need to install the hexbin package to use `geom_hex()`.
```{r fig.show='hold', fig.width=3, fig.height=3}
```{r fig.show='hold', fig.width=3}
ggplot(data = diamonds) +
geom_bin2d(aes(x = carat, y = price))
@ -295,7 +296,7 @@ ggplot(data = diamonds) +
geom_hex(aes(x = carat, y = price))
```
`geom_density2d()` fits a 2D kernel density estimation to the data and then uses contour lines to highlight areas of high density. It is very useful for overlaying on raw data.
`geom_density2d()` fits a 2D kernel density estimation to the data and then uses contour lines to highlight areas of high density. It is very useful for overlaying on raw data when your data set is not big.
```{r}
@ -304,9 +305,9 @@ ggplot(data = faithful, aes(x = eruptions, y = waiting)) +
geom_density2d()
```
As you search for covariation, also keep an eye out for outliers and clusters. Two dimensional plots can reveal outliers and clusters that are not visible in one dimensional plots.
#### Follow up questions
For example, some points in the plot on the left have an unusual combination of $x$ and $y$ values, which makes them an outlier even though their $x$ and $y$ values appear common when examined separately.
Use the same follow up questions for plots of covariation that you sue for plots of variation. Outliers, clusters, and patterns all appear in two dimensional plots. In fact, two dimensional plots can reveal outliers and clusters that are not visible in one dimensional plots. For example, some points in the plot on the left have an unusual combination of $x$ and $y$ values, which makes them an outlier even though their $x$ and $y$ values appear normal when examined separately.
The two dimensional pattern in the plot on the right reveals two clusters, a separation that is not visible in the distribution of either variable by itself, as verified with a rug geom.
@ -317,23 +318,196 @@ ggplot(data = diamonds) +
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_jitter() +
geom_density2d() +
geom_rug(position = "jitter")
```
In general, outliers and clusters become easier to spot as you look at the interaction of more and more variables. However, as you include more variables in your plot, data becomes harder to visualize.
#### Three or more variables
You can extend scatterplots into three dimensions with the plotly, rgl, rglwidget, and threejs packages (among others). Each creates a "three dimensional," graph that you can rotate with your mouse. Below is an example from plotly.
You can extend scatterplots into three dimensions with the plotly, rgl, rglwidget, and threejs packages (among others). Each creates a "three dimensional," graph that you can rotate with your mouse. Below is an example from plotly, displayed as a static image.
```{r eval = FALSE}
library(plotly)
plot_ly(data = iris, x = Sepal.Length, y = Sepal.Width, z = Petal.Width, color = Species, type = "scatter3d", mode = "markers")
```
![](images/EDA-plotly.png)
```{r, echo = FALSE}
knitr::include_graphics("images/EDA-plotly.png")
```
You can extend this approach into n-dimensional hyperspace with the ggobi package, but you will soon notice a weakness of multidimensional graphs. You can only visualize multidimensional space by projecting it onto your two dimensional retina. In the case of 3D graphics, you can combine 2D projections with rotation to create an intuitive illusion of space, but the illusion ceases to be intuitive as soon as you add a fourth dimension.
You can extend this approach into n-dimensional hyperspace with the ggobi package, but you will soon notice a weakness of multidimensional graphs. You can only visualize multidimensional space by projecting it onto your two dimensional retinas. In the case of 3D graphics, you can combine 2D projections with rotation to create an intuitive illusion of 3D space, but the illusion ceases to be intuitive as soon as you add a fourth dimension.
You can use aesthetics such as color and size to add a third or fourth variable to a two dimensional graph. You can also explore complex relationships two variables at a time. If you need to investigate the relationship between mutliple variables at the saem time, I recommend that you use modelling techniques, which we will discuss soon.
This doesn't mean that you should ignore complex interactions in your data. You can explore multivariate relationships in several ways. For example, you can visualize each combination of variables in the relationship, two at a time. Or you could use aesthetics and facetting to add additional variables to a 2D plot. Or you could calculate summaries that help you understand the relationship. Two types of summaries are particularly useful for EDA: summary statistics and models.
## Summary Statistics
You can use summary statistics to answer questions about the variation and covariation in your data. To do this, use dplyr's `summarise()` with a function that calculates summary statistics.
```{r}
diamonds %>%
summarise(avg_price = mean(price), sd_price = sd(price))
```
The measures of location and measures of spread listed in Chapter 3 are particularly useful for summaries; as is `cor()`, a function that calculates the correlation coefficient between two numerical variables.
The correlation coefficient measures the strength of the _linear_ relationship between two variables. Correlation coefficients can range from -1 to 1. A value near -1 or 1 describes a strong relationship between two variables. A value near zero describes a weak relationship, or no relationship. A negative correlation coefficient indicates that the variables are inversely related.
```{r}
cor(diamonds$price, diamonds$carat)
```
Summary statistics provide useful information. They are easy to compare against each other and easy to use in conversation. However, I do not recommend that you use summary statistics without also using visualizations. A visualization can prompt you to notice something that you did not expect to see. A summary statistic can only provide information that you intend to see.
## Models
> "Essentially, all models are wrong, but some are useful."---George Box
Models are a type of summary that describes the relationships in your data. You can use models to reveal unexpected covariation in your data or to describe relationships that you cannot easily visualize.
It is important to keep in mind that a model is just a description of your data. In my experience, you can easily overthink models because models are tied into the logic of statistical inference: if a model describes your data accurately _and_ your data is similar to the world at large, then your model should describe the world at large accurately. This chain of reasoning provides a basis for using models to make predictions, and we will examine it in Part 4. For now, we are only going to use models as descriptions that can use to help you discover insights in your data.
The most useful type of model for explorating your data is a linear model. A **linear model** describes the value of one variable as a linear combination of the values of other variables. This means that a linear model would describe the relationship between two variables as a line (below left), the relationship between three variables as a plane (below right), and the relationship between $n$ variables as a $n - 1$ dimensional hyperplane.
```{r, echo = FALSE}
knitr::include_graphics("images/EDA-plotly.png")
```
Linear models are useful because they are easy to interpret and because linear functions are a reasonable approximation to non-linear relationships over short distances (this is the basis of the taylor series approximation). You can also turn many non-linear relationships into linear relationships by transfroming the variables involved.
You can represent a linear model with the formula
$$Y = \alpha + \beta_{1} X_{1} + ... + \beta_{n} X_{n} + \epsilon$$
The first portion of the formula is the equation of the line (or plane or hyperplane) that "best" fits the data. The last variable in the formula, $\epsilon$, is an error term that has an unknown, random value. As a result, it is easy to interpret a linear model. A linear model says that when an observation has the $X$ values $x_{1}, ..., x_{n}$, it will have the $Y$ value $y$, where $y$ equals the point on the line (or plane or hyperplane) at $(x_{1}, ... x_{n})$ plus or minus some random error, $\epsilon$.
If all of the points in your data set fall close to the model line (or plane or hyperplane) then you can assume that the random error represented by $\epsilon$ is small. If they fall far from the line (or plane or hyperplane), then you can assume that the random error is large. The algorithm that you use to fit your linear model automatically chooses the line (or plane or hyperplane) that minimizes the distance between your data points and the model line (or plane or hyperplane). Technically, it minimizes the sum squared distances between each point and the model surface.
### Fitting a linear model
To fit a linear model to your data, use the `lm()` function that comes with R. Then save the output to an object that you can access later.
```{r}
faithful_mod <- lm(eruptions ~ waiting, data = faithful)
diamonds_mod <- lm(price ~ carat + cut, data = diamonds)
```
The first argument of `lm()` should be a formula, two or more variables separated by a `~`. You've seen formulas before, we used them in Chapter 2 to facet graphs.
The left hand side of the formula should contain a single variable name. This will be the response variable in your model. If you think of your model as a function, the explanatory variable would be the variable on the $y$ axis. The right hand side of the formula should contain the names of the variables related to the response variable. If you think of your model as a function, these would be the variables on the $x$ axes. If you include more than one variable on the right hand side of the formula, separate their names with a `+`, e.g. `price ~ carat + cut + color + clarity`.
The second argument of `lm` should be the name of the data set that contains the variables in the formula.
Once you've made your model, you can see the model equation by typing the name of the model object at the command line.
```{r}
faithful_mod
```
The output above shows that the linear model that best fits the Old Faithful data is
$$\text{eruption} = -1.87 + 0.07 \time \text{waiting} + \epsilon$$
You can interpret the coefficient of each $X_{i}$ variable in your model as the amount of change in $Y$ that is associated with a one unit change in $X_{i}$ _when the values of any other $X$ variables in the model are held constant.
If you like, you can return the coefficients in a data frame with the `tidy()` function in the broom package. For now, ignore the last three columns.
```{r}
library(broom)
tidy(faithful_mod)
```
### Visualizing a linear model
If your model only contains one variable, like our faithful mod, you can visualize its results with `geom_smooth()`. To do this, plot your model's $Y$ variable on the y axis, plot your model's $X$ variable on the $x$ axis, and set the method argument of `geom_smooth()` to `lm`, the name of the function that R should use to fit the model. I also like to set `se = FALSE` to remove the standard error ribbon that `geom_smooth()` displays by default. We'll look at standard errors in Part 4.
```{r}
ggplot(data = faithful, aes(x = waiting, y = eruptions)) +
geom_point() +
geom_smooth(method = lm, se = FALSE)
```
If your model contains multiple $X$ variables, the easiest way to visualize it will be to plot the fitted (i.e predicted) values of the model against various combinations of the explanatory variables. The fitted values are the $Y$ values along the model line (or plane or hyperplane) that correspond to each observation in your data set. You can access the fitted values with broom's `augment()` function.
```{r}
diamonds_fit <- augment(diamonds_mod)
head(diamonds_fit)
```
`augment()` returns a data frame that makes it easy to plot the fitted values in relation to the explanatory and response variables.
```{r}
ggplot(data = diamonds_fit, aes(x = carat, y = price, color = cut)) +
geom_point(alpha = 0.1) +
geom_line(aes(y = .fitted)) +
theme_bw()
```
### Modeling categorical variables
If you add a categorical variable to your model, like `diamonds$cut`, R will fit a separate line for each level of the variable, as you see above. It will also add each level of the variable, except one, to the model formula, e.g.
```{r}
diamonds_mod
```
If your categorical variable happens to be saved in R as an ordered factor, the model results will be displayed in a format that is unnecessarily hard to interpret (R turns ordered variables into a set of contrasts and then fits a model to the contrasts). This is what has happened above.
```{r}
class(diamonds$cut)
```
In all other cases, the results of modeling categorical variables are easy to interpret. If you have an ordered factor, I recommend that you change it to a regular factor before fitting your model. This will not noticeably change the output of your model, and it will free your model from a set of questionable assumptions (that the levels of the ordered factor are equidistant from each other in magnitude).
```{r}
diamonds$cut2 <- factor(diamonds$cut, ordered = FALSE)
diamonds_mod <- lm(price ~ carat + cut2, data = diamonds)
diamonds_mod
```
Now the model contains a coefficient and a dummy variable for each level of the categorical variable except one. To interpret the model, treat the dummy variables as indicator variables that take the value of 1 if an observation has the associated level of cut, and 0 if the observation does not. The effect of each dummy variable is to shift the line up or down to best fit observations with the given level.
So the model predicts that Fair cut diamonds have a price of
$$\text{Price} = -3875 + 7871 \times \text{carat} + \epsilon$$
It predicts that Good cut diamonds have a price of
$$\text{Price} = -3875 + 7871 \times \text{carat} + 1120 + \epsilon$$
It predicts that Very Good cut diamonds have a price of
$$\text{Price} = -3875 + 7871 \times \text{carat} + 1510 + \epsilon$$
and so on.
### Follow up questions
Now that you can fit a linear model to your data, how can you use them to make discoveries and reveal insights? The most useful information in your model will come from the model's coefficients and the model's residuals. The residuals of the model are the distances between each observed $Y$ value and each fitted $Y$ value. You can access the residuals of your model with the `augment()` function. The are saved in the `.resid` column.
```{r}
diamonds_fit <- augment(diamonds_mod)
head(diamonds_fit)
```
* *Size of coefficient*
* *Sign of coefficient*
* *Changes between models*
* *Outliers in residuals*
* *Clusters in residuals*
* *Patterns in residuals*
## Summary
What can go wrong? Bias and variance.
The structure of data, scientific laws, and the data analysis process. Segue to the rest of the book.
Using group_by and summarise to create new levels of observations