More writing on variation.

This commit is contained in:
Garrett 2016-05-21 22:14:30 -04:00
parent 671f8800b4
commit 5265fd767e
1 changed files with 131 additions and 203 deletions

View File

@ -1,21 +1,20 @@
# Exploratory Data Analysis (EDA)
# Exploratory Data Analysis (EDA)
```{r include = FALSE}
library(ggplot2)
library(dplyr)
library(broom)
knitr::opts_chunk$set(fig.height = 2)
```
Visualization and transformation are the most useful tools for exploring your data, a task that statisticians call Exploratory Data Analysis, or EDA for short. EDA involves iteratively
This chapter will show you how to use visualization and transformation to explore your data in a systematic way, a task that statisticians call Exploratory Data Analysis, or EDA for short. EDA involves iteratively
1. forming questions about your data
2. searching for answers by visualizing and transforming your data
3. using what you discover to refine your questions about the data, or to choose new questions to investigate
There is no formal way to do Exploratory Data Analysis because you must be free to investigate every insight that occurs to you. However, some tactics will lead reliably lead to insights. This chapter will teach you a basic toolkit of these useful EDA techniques.
There is no formal way to do Exploratory Data Analysis because you must be free to investigate every idea that occurs to you. However, some tactics will reliably lead to insights. This chapter will teach you a basic toolkit of these useful EDA techniques. Our discussion will lead to a model of data science itself, the model that I've built this book around.
## Questions
@ -23,17 +22,17 @@ There is no formal way to do Exploratory Data Analysis because you must be free
> "Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise."---John Tukey
Your goal during EDA is to develop a complete understanding of your data set and the information that it contains. The easiest way to do this is to use questions as tools that guide your investigation. When you ask a question, the question focuses your attention on a specific part of your data set and helps you decide which graphs or models to make.
Your goal during EDA is to develop a complete understanding of your data set and the information that it contains. The easiest way to do this is to ask questions as tools to guide your investigation. When you ask a question, the question focuses your attention on a specific part of your data set and helps you decide which graphs or models to make.
During EDA, the _quantity_ of questions that you ask matters more than the quality of the questions. It is difficult to ask revealing questions at the start of your analysis because you do not know what insights are contained in your data set. On the other hand, each new question that you ask will expose you to a new aspect of your data and increase your chance of making a discovery. You can quickly drill down into the most interesting parts of your data---and develop a set of thought provoking questions---if you follow each question with a new question based on what you find.
During EDA, the _quantity_ of questions that you ask matters more than the quality of the questions. It is difficult to ask revealing questions at the start of your analysis because you do not know what insights are contained in your data set. On the other hand, each new question that you ask will expose you to a new aspect of your data and increase your chance of making a discovery. You can quickly drill down into the most interesting parts of your data---and develop a set of thought provoking questions---if you follow up each question with a new question based on what you find.
There is no rule about which questions you should ask to guide your research. However, two types of questions will always be useful for making discoveries within your data. You can loosely word them as
There is no rule about which questions you should ask to guide your research. However, two types of questions will always be useful for making discoveries within your data. You can loosely word these questions as
1. What type of **variation** occurs **within** my variables? and
2. What type of **covariation** occurs **between** my variables?
The rest of this chapter will look at these two questions. I'll explain what variation and covariation are, and I'll show you several ways to answer each question. Our discussion will lead to a model of data science itself, the model that I've built this book around. To make the discussion easier, let's define some terms:
The rest of this chapter will look at these two questions. I'll explain what variation and covariation are, and I'll show you several ways to answer each question. To make the discussion easier, let's define some terms:
* A _variable_ is a quantity, quality, or property that you can measure.
@ -41,6 +40,8 @@ The rest of this chapter will look at these two questions. I'll explain what var
* An _observation_ is a set of measurements that you make under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object). An observation will contain several values, each associated with a different variable. I'll sometimes refer to an observation as a data point.
* _tabular data_ is a set of values, each associated with a variable and an observation. Tabular data is _tidy_ if each value is placed in its own "cell", each variable in its own column, and each observation in its own row. Throughout the rest of this chapter I will use the word data to mean tidy tabular data.
## Variation
> "What type of variation occurs within my variables?"
@ -54,9 +55,9 @@ mat <- as.data.frame(matrix(morley$Speed + 299000, ncol = 10))
knitr::kable(mat, caption = "*The speed of light is a universal constant, but variation due to measurement error obscures its value. In 1879, Albert Michelson measured the speed of light 100 times and observed 30 different values (in km/sec).*", col.names = rep("", ncol(mat)))
```
Discrete and categorical variables can also vary if you measure across different subjects (e.g. the eye colors of different people), or different times (e.g. the energy levels of an electron).
Discrete and categorical variables can also vary if you measure across different subjects (e.g. the eye colors of different people), or different times (e.g. the energy levels of an electron at different moments).
Each variable contains its own pattern of variation, which can reveal interesting information. The best way to understand that pattern is to visualize the distribution of the values that you have observed for the variable.
Every variable has its own pattern of variation, which can reveal interesting information. The best way to understand that pattern is to visualize the distribution of the values that you have observed for the variable.
### Visualizing distributions
@ -82,7 +83,7 @@ ggplot(data = diamonds) +
A histogram divides the x axis into equally spaced intervals and then uses a bar to display how many observations fall into each interval. In the graph above, the tallest bar shows that almost 30,000 observations have a $carat$ value between 0.25 and 0.75, which are the left and right edges of the bar.
You can set the width of the intervals with the `binwidth` argument of `geom_histogram()`, which is measured in the units of the $x$ axis. You should always explore a variety of binwidths when working with histograms, as different binwidths can reveal different patterns. For example, here is how the graph above looks with a binwidth of 0.01.
You can set the width of the intervals in a histogram with the `binwidth` argument, which is measured in the units of the $x$ axis. You should always explore a variety of binwidths when working with histograms, as different binwidths can reveal different patterns. For example, here is how the graph above looks with a binwidth of 0.01.
```{r message = FALSE}
ggplot(data = diamonds) +
@ -91,7 +92,7 @@ ggplot(data = diamonds) +
If you wish to overlay multiple histograms in the same plot, I recommend using `geom_freqpoly()` or `geom_density2d()` instead of `geom_histogram()`. `geom_freqpoly()` makes a frequency polygon, a line that connects the tops of the bars that would appear in a histogram. Like `geom_histogram()`, `geom_freqpoly()` accepts a binwidth argument.
`geom_density()` plots a one dimensional kernel density estimate of a variable's distribution. The result is a smooth version of the information contained in a histogram or a frequency polygon. You can control the smoothness of the density with `adjust`. `geom_density()` displays $density$---not $count$---on the y axis; the area under each curve will be normalized to one, no matter how many total observations occur in the subgroup, which makes it easier to compare subgroups.
`geom_density()` plots a one dimensional kernel density estimate of a variable's distribution. The result is a smooth version of the information contained in a histogram or a frequency polygon. You can control the smoothness of the density with `adjust`. `geom_density()` displays _density_---not count---on the y axis; the area under each curve will be normalized to one, no matter how many total observations occur in the subgroup, which makes it easier to compare subgroups.
```{r message = FALSE, fig.show='hold', fig.width=3}
zoom <- coord_cartesian(xlim = c(55, 70))
@ -105,9 +106,9 @@ ggplot(data = diamonds) +
zoom
```
## Follow up questions
### Asking questions about variation
Now that you can visualize variation, what should you look for in your plots? and what type of follow up questions should you ask? I've put together below a list of the most useful types of information that you will find in your graphs, along with some follow up questions for each type of information. The key to asking good follow up questions will be to rely on your **curiosity** (What do you want to learn more about?) as well as your **skepticism** (How could this be misleading?).
Now that you can visualize variation, what should you look for in your plots? and what type of follow up questions should you ask? I've put together a list below of the most useful types of information that you will find in your graphs, along with some follow up questions for each type of information. The key to asking good follow up questions will be to rely on your **curiosity** (What do you want to learn more about?) as well as your **skepticism** (How could this be misleading?).
* *Typical values*
@ -117,7 +118,7 @@ Now that you can visualize variation, what should you look for in your plots? an
+ Which values are the most rare? Why might that be?
+ Is there an unusual pattern to the frequencies? Why might that be?
+ Is there an unusual pattern in the distribution? Why might that be?
+ Do the typical values change if you look at subgroups of the data?
@ -130,7 +131,7 @@ Now that you can visualize variation, what should you look for in your plots? an
* *Range of values*
The range of values in the distribution reveals how certain you can be when you make predictions about a variable. If the variable only takes a narrow set of values, like below, you are unlikely to be far off if you make a prediction about a future observation. Even if the observation takes a value at the distant extreme of the distribution, the value will not be far from your guess.
The range of values, or spread, of the distribution reveals how certain you can be when you make predictions about a variable. If the variable only takes a narrow set of values, like below, you are unlikely to be far off if you make a prediction about a future observation. Even if the observation takes a value at the distant extreme of the distribution, the value will not be far from your guess.
```{r echo = FALSE, message = FALSE, fig.height = 2}
mpg$hwy2 <- mpg$hwy / 10 + 22
@ -143,7 +144,7 @@ Now that you can visualize variation, what should you look for in your plots? an
ggplot(mpg) + geom_histogram(aes(x = hwy), binwidth = 1) + xlim(10, 45)
```
As a quick rule, wide distributions imply less certainty when making predictions about a variable; narrow distributions imply more certainty. Ask yourself
As a quick rule, wide distributions imply less certainty when making predictions about a variable; narrow distributions imply more certainty. A distribution with only a single repeated value implies complete certainty: your variable is a constant. Ask yourself
+ Do your data show a surprising amount of certainty or uncertainty? Why?
@ -181,39 +182,15 @@ Now that you can visualize variation, what should you look for in your plots? an
ggplot(faithful) + geom_histogram(aes(x = eruptions))
```
To answer many of the follow up questions above, you will need to make a new graph that shows the relationship between two or more variables and then look for:
* *Patterns*
Patterns in your data provide clues about relationships. If a systematic relationship exists between two variables it will appear as a pattern in the data. If you spot a pattern, ask yourself:
+ Could this pattern be due to coincidence (i.e. random chance)?
+ How can you describe the relationship implied by the pattern?
+ How strong is the relationship implied by the pattern?
+ What other variables might affect the relationship?
+ Does the relationship change if you look at individual subgroups of the data?
A scatterplot of Old Faithful eruption lengths versus the wait time between eruptions shows a pattern: longer wait times are associated with longer eruptions. The scatterplot also reveals the two clusters that we noticed above.
```{r echo = FALSE, message = FALSE, fig.height = 2}
ggplot(faithful) + geom_point(aes(x = eruptions, y = waiting))
```
Questions about patterns are examples of the second general question that I proposed for EDA. Let's look at that question now.
Many of the questions above will prompt you to explore a relationship *between* variables, for example, to see if the values of one variable can explain the behavior of another variable. Questions about relationships are examples of the second general question that I proposed for EDA. Let's look at that question now.
## Covariation
> "What type of covariation occurs between my variables?"
If variation describes the behavior _within_ a variable, covariation describes the behavior _between_ variables. **Covariation** is the tendency for the values of two or more variables to vary together in a correlated way. The best way to spot covariation is to visualize the relationship between two or more variables. How you do that should again depend on the type of variable.
If variation describes the behavior _within_ a variable, covariation describes the behavior _between_ variables. **Covariation** is the tendency for the values of two or more variables to vary together in a correlated way. The best way to spot covariation is to visualize the relationship between two or more variables. How you do that should again depend on the type of variables involved.
### Visualizaing covariation
#### Two categorical variables
### Visualizing two categorical variables
Visualize covariation between categorical variables with `geom_count()`.
@ -228,7 +205,7 @@ The size of each circle in the plot displays how many observations occurred at e
table(diamonds$color, diamonds$cut)
```
#### One categorical variable and one continuous variable
### Visualizing one categorical variable and one continuous variable
Visualize covariation between continuous and categorical variables with boxplots. A **boxplot** is a type of visual shorthand for a distribution that is popular among statisticians. Each boxplot consists of:
@ -249,8 +226,6 @@ ggplot(data = mpg) +
geom_boxplot(aes(x = class, y = hwy))
```
Covariation will appear as a systematic change in the medians or IQR's of the boxplots. To make the trend easier to see, wrap the $x$ variable with `reorder()`. The code below reorders the x axis based on the median hwy value of each group.
```{r}
@ -275,16 +250,16 @@ ggplot(data = mpg) +
```
#### Two continuous variables
### Vizualizing two continuous variables
Visualize covariation between two continuous variables with a scatterplot, i.e. `geom_point()`. Covariation will appear as a structure or pattern in the data points. For example, a positive trend exists between the carat size and price of a diamond.
Visualize covariation between two continuous variables with a scatterplot, i.e. `geom_point()`. Covariation will appear as a structure or pattern in the data points. For example, an exponential relationship seems to exist between the carat size and price of a diamond.
```{r}
ggplot(data = diamonds) +
geom_point(aes(x = carat, y = price))
```
Scatterplots become less useful as the size of your data set grows, because points begin to pile up into areas of uniform black. You can make patterns clear again with `geom_bin2d()`, `geom_hex()`, or `geom_density2d()`.
Scatterplots become less useful as the size of your data set grows, because points begin to pile up into areas of uniform black (as above). You can make patterns clear again with `geom_bin2d()`, `geom_hex()`, or `geom_density2d()`.
`geom_bin2d()` and `geom_hex()` divide the coordinate plane into two dimensional bins and then use a fill color to display how many points fall into each bin. `geom_bin2d()` creates rectangular bins. `geom_hex()` creates hexagonal bins. You will need to install the hexbin package to use `geom_hex()`.
@ -292,6 +267,7 @@ Scatterplots become less useful as the size of your data set grows, because poin
ggplot(data = diamonds) +
geom_bin2d(aes(x = carat, y = price))
# install.packages("hexbin")
ggplot(data = diamonds) +
geom_hex(aes(x = carat, y = price))
```
@ -305,26 +281,57 @@ ggplot(data = faithful, aes(x = eruptions, y = waiting)) +
geom_density2d()
```
#### Follow up questions
### Asking questions about covariation
Use the same follow up questions for plots of covariation that you sue for plots of variation. Outliers, clusters, and patterns all appear in two dimensional plots. In fact, two dimensional plots can reveal outliers and clusters that are not visible in one dimensional plots. For example, some points in the plot on the left have an unusual combination of $x$ and $y$ values, which makes them an outlier even though their $x$ and $y$ values appear normal when examined separately.
When you explore plots of covariation, look for the following sources of insight:
The two dimensional pattern in the plot on the right reveals two clusters, a separation that is not visible in the distribution of either variable by itself, as verified with a rug geom.
* *Outliers*
```{r fig.show='hold', fig.width=3, fig.height=3}
ggplot(data = diamonds) +
geom_point(aes(x = x, y = y)) +
coord_cartesian(xlim = c(3, 12), ylim = c(3, 12))
Two dimensional plots can reveal outliers that are not visible in one dimensional plots. For example, some points in the plot below have an unusual combination of $x$ and $y$ values, which makes the points outliers even though their $x$ and $y$ values appear normal when examined separately.
```{r echo = FALSE}
ggplot(data = diamonds) +
geom_point(aes(x = x, y = y)) +
coord_cartesian(xlim = c(3, 12), ylim = c(3, 12))
```
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_jitter() +
geom_density2d() +
geom_rug(position = "jitter")
```
In general, outliers and clusters become easier to spot as you look at the interaction of more and more variables. However, as you include more variables in your plot, data becomes harder to visualize.
* *Clusters*
#### Three or more variables
Plots of covariation can also reveal clusters that may not be visible in plots of variation. For example, the two dimensional pattern in the plot below reveals two clusters, a separation that is not visible in the distribution of either variable by itself, as verified with a rug geom.
```{r echo = FALSE}
ggplot(data = iris, aes(y = Sepal.Length, x = Sepal.Width)) +
geom_jitter() +
geom_density2d(h= c(1,1)) +
geom_rug(position = "jitter")
```
* *Patterns*
Patterns in your data provide clues about relationships. If a systematic relationship exists between two variables it will appear as a pattern in the data. If you spot a pattern, ask yourself:
+ Could this pattern be due to coincidence (i.e. random chance)?
+ How can you describe the relationship implied by the pattern?
+ How strong is the relationship implied by the pattern?
+ What other variables might affect the relationship?
+ Does the relationship change if you look at individual subgroups of the data?
A scatterplot of Old Faithful eruption lengths versus the wait time between eruptions shows a pattern: longer wait times are associated with longer eruptions. The scatterplot also reveals the two clusters that we noticed above.
```{r echo = FALSE, message = FALSE, fig.height = 2}
ggplot(faithful) + geom_point(aes(x = eruptions, y = waiting))
```
Patterns provide one of the most useful tools for data scientists because they reveal covariation. If you think of variation as a phenomenon that creates uncertainty, covariation is a phenomenon that reduces it. If two variables covary, you can use the values of one variable to make better predictions about the values of the second. If the covariation is due to a causal relationship (a special case), then you can use the value of one variable to control the value of the second.
### Visualizing three or more variables
In general, outliers, clusters, and patterns become easier to spot as you look at the interaction of more and more variables. However, as you include more variables in your plot, data becomes harder to visualize.
You can extend scatterplots into three dimensions with the plotly, rgl, rglwidget, and threejs packages (among others). Each creates a "three dimensional," graph that you can rotate with your mouse. Below is an example from plotly, displayed as a static image.
@ -339,175 +346,96 @@ knitr::include_graphics("images/EDA-plotly.png")
You can extend this approach into n-dimensional hyperspace with the ggobi package, but you will soon notice a weakness of multidimensional graphs. You can only visualize multidimensional space by projecting it onto your two dimensional retinas. In the case of 3D graphics, you can combine 2D projections with rotation to create an intuitive illusion of 3D space, but the illusion ceases to be intuitive as soon as you add a fourth dimension.
This doesn't mean that you should ignore complex interactions in your data. You can explore multivariate relationships in several ways. For example, you can visualize each combination of variables in the relationship, two at a time. Or you could use aesthetics and facetting to add additional variables to a 2D plot. Or you could calculate summaries that help you understand the relationship. Two types of summaries are particularly useful for EDA: summary statistics and models.
This doesn't mean that you should ignore complex interactions in your data. You can explore multivariate relationships in several ways. You can
## Summary Statistics
* visualize each combination of variables in a multivariate relationship, two at a time
You can use summary statistics to answer questions about the variation and covariation in your data. To do this, use dplyr's `summarise()` with a function that calculates summary statistics.
* use aesthetics and facetting to add additional variables to a 2D plot
```{r}
diamonds %>%
summarise(avg_price = mean(price), sd_price = sd(price))
```
* use a clustering algorithm to spot clusters in multivariate space
The measures of location and measures of spread listed in Chapter 3 are particularly useful for summaries; as is `cor()`, a function that calculates the correlation coefficient between two numerical variables.
* use a modeling algorithm to spot patterns and outliers in multivariate space
The correlation coefficient measures the strength of the _linear_ relationship between two variables. Correlation coefficients can range from -1 to 1. A value near -1 or 1 describes a strong relationship between two variables. A value near zero describes a weak relationship, or no relationship. A negative correlation coefficient indicates that the variables are inversely related.
## Clusters
```{r}
cor(diamonds$price, diamonds$carat)
```
Summary statistics provide useful information. They are easy to compare against each other and easy to use in conversation. However, I do not recommend that you use summary statistics without also using visualizations. A visualization can prompt you to notice something that you did not expect to see. A summary statistic can only provide information that you intend to see.
A clustering algorithm computes the distances between data points in n-dimensional space. It then uses an algorithm to group points into clusters based on how near or far they are from each other.
## Models
> "Essentially, all models are wrong, but some are useful."---George Box
A model is a type of summary that describes the relationships in your data. You can use a model to reveal patterns and outliers that only appear in n-dimensional space. To see how this works, consider the simple linear model below. I've fit it to a two dimensional pattern so we can visualize the results.
Models are a type of summary that describes the relationships in your data. You can use models to reveal unexpected covariation in your data or to describe relationships that you cannot easily visualize.
It is important to keep in mind that a model is just a description of your data. In my experience, you can easily overthink models because models are tied into the logic of statistical inference: if a model describes your data accurately _and_ your data is similar to the world at large, then your model should describe the world at large accurately. This chain of reasoning provides a basis for using models to make predictions, and we will examine it in Part 4. For now, we are only going to use models as descriptions that can use to help you discover insights in your data.
```{r echo = FALSE}
diamonds2 <- filter(diamonds, x > 3, y > 3, y < 12)
diamond_mod <- lm(y ~ x, data = diamonds2)
resids <- augment(diamond_mod)
diamonds3 <- bind_rows(filter(resids, abs(.resid) > 0.5),
sample_n(filter(resids, abs(.resid) <= 0.5), 1000)) %>%
select(x, y)
The most useful type of model for explorating your data is a linear model. A **linear model** describes the value of one variable as a linear combination of the values of other variables. This means that a linear model would describe the relationship between two variables as a line (below left), the relationship between three variables as a plane (below right), and the relationship between $n$ variables as a $n - 1$ dimensional hyperplane.
```{r, echo = FALSE}
knitr::include_graphics("images/EDA-plotly.png")
```
Linear models are useful because they are easy to interpret and because linear functions are a reasonable approximation to non-linear relationships over short distances (this is the basis of the taylor series approximation). You can also turn many non-linear relationships into linear relationships by transfroming the variables involved.
You can represent a linear model with the formula
$$Y = \alpha + \beta_{1} X_{1} + ... + \beta_{n} X_{n} + \epsilon$$
The first portion of the formula is the equation of the line (or plane or hyperplane) that "best" fits the data. The last variable in the formula, $\epsilon$, is an error term that has an unknown, random value. As a result, it is easy to interpret a linear model. A linear model says that when an observation has the $X$ values $x_{1}, ..., x_{n}$, it will have the $Y$ value $y$, where $y$ equals the point on the line (or plane or hyperplane) at $(x_{1}, ... x_{n})$ plus or minus some random error, $\epsilon$.
If all of the points in your data set fall close to the model line (or plane or hyperplane) then you can assume that the random error represented by $\epsilon$ is small. If they fall far from the line (or plane or hyperplane), then you can assume that the random error is large. The algorithm that you use to fit your linear model automatically chooses the line (or plane or hyperplane) that minimizes the distance between your data points and the model line (or plane or hyperplane). Technically, it minimizes the sum squared distances between each point and the model surface.
### Fitting a linear model
To fit a linear model to your data, use the `lm()` function that comes with R. Then save the output to an object that you can access later.
```{r}
faithful_mod <- lm(eruptions ~ waiting, data = faithful)
diamonds_mod <- lm(price ~ carat + cut, data = diamonds)
```
The first argument of `lm()` should be a formula, two or more variables separated by a `~`. You've seen formulas before, we used them in Chapter 2 to facet graphs.
The left hand side of the formula should contain a single variable name. This will be the response variable in your model. If you think of your model as a function, the explanatory variable would be the variable on the $y$ axis. The right hand side of the formula should contain the names of the variables related to the response variable. If you think of your model as a function, these would be the variables on the $x$ axes. If you include more than one variable on the right hand side of the formula, separate their names with a `+`, e.g. `price ~ carat + cut + color + clarity`.
The second argument of `lm` should be the name of the data set that contains the variables in the formula.
Once you've made your model, you can see the model equation by typing the name of the model object at the command line.
```{r}
faithful_mod
```
The output above shows that the linear model that best fits the Old Faithful data is
$$\text{eruption} = -1.87 + 0.07 \time \text{waiting} + \epsilon$$
You can interpret the coefficient of each $X_{i}$ variable in your model as the amount of change in $Y$ that is associated with a one unit change in $X_{i}$ _when the values of any other $X$ variables in the model are held constant.
If you like, you can return the coefficients in a data frame with the `tidy()` function in the broom package. For now, ignore the last three columns.
```{r}
library(broom)
tidy(faithful_mod)
```
### Visualizing a linear model
If your model only contains one variable, like our faithful mod, you can visualize its results with `geom_smooth()`. To do this, plot your model's $Y$ variable on the y axis, plot your model's $X$ variable on the $x$ axis, and set the method argument of `geom_smooth()` to `lm`, the name of the function that R should use to fit the model. I also like to set `se = FALSE` to remove the standard error ribbon that `geom_smooth()` displays by default. We'll look at standard errors in Part 4.
```{r}
ggplot(data = faithful, aes(x = waiting, y = eruptions)) +
ggplot(diamonds3, aes(x = x, y = y)) +
geom_point() +
geom_smooth(method = lm, se = FALSE)
```
If your model contains multiple $X$ variables, the easiest way to visualize it will be to plot the fitted (i.e predicted) values of the model against various combinations of the explanatory variables. The fitted values are the $Y$ values along the model line (or plane or hyperplane) that correspond to each observation in your data set. You can access the fitted values with broom's `augment()` function.
The model describes the relationship between x and y as
```{r}
diamonds_fit <- augment(diamonds_mod)
head(diamonds_fit)
$$\hat{y} = 0.13 + 0.98 x$$
which is the equation of the blue model line in the graph above. Even if we did not have the graph, we could use the model coefficients in the equation above to determine that a positive relationship exists between $y$ and $x$ such that a one unit increase in $x$ is associated with an approximately one unit increase in $y$. We could use a model statistic, such as adjusted $r^{2}$ to determine that the relationship is very strong (here adjusted $r^{2} = 0.99$).
Finally, we could spot outliers in our data by examining the residuals of the model, which are the distances between the actual $y$ values of our data points and the $y$ values that the model would predict for the data points. Observations that are outliers in n-dimensional space will have a residual that is an outlier in one dimensional space. You can find these outliers by plotting a histogram of the residuals or by visualizing the residuals against any variable in a two dimenisonal plot.
```{r echo = FALSE, fig.width = 3, fig.show='hold'}
diamond_mod <- lm(y ~ x, data = diamonds3)
resids <- augment(diamond_mod)
ggplot(resids) +
geom_histogram(aes(x = .resid), binwidth = 0.1)
ggplot(resids) +
geom_point(aes(x = x, y = .resid))
```
`augment()` returns a data frame that makes it easy to plot the fitted values in relation to the explanatory and response variables.
You can examine coefficients, model statistics, and residuals of a model fit to an n-dimensional relationship in the same way, without visualizing the raw data in n-dimensional space.
I'll postpone teaching you how to fit and interpret models with R until Part 4. Altough models are something simple, a description of your data set, they are tied into the logic of statistical inference: if a model describes your data accurately _and_ your data is similar to the world at large, then your model should describe the world at large accurately. This chain of reasoning provides a basis for using models to make inferences and predictions. You'll be able to do more with models if you learn a few more skills before you begin to model data.
## A last word on variables and observations
Variables, observations, and visualization
Every data set contains more information than it displays. You can use the values in your data to calculate new variables, as well as new, group-level, observations. This section will show you how to calculate new variables and observations, which you can use in visualizations, clustering algorithms, and modeling algorithms.
### Making new variables
Use dplyr's `mutate()` function to calculate mew variables from your existing variables.
```{r}
ggplot(data = diamonds_fit, aes(x = carat, y = price, color = cut)) +
geom_point(alpha = 0.1) +
geom_line(aes(y = .fitted)) +
theme_bw()
diamonds %>%
mutate(volume = x * y * z) %>%
head()
```
### Modeling categorical variables
The window functions from Chapter 3 are particularly useful for calculating new variables. To calculate a variable from two or more variables, use basic operators or the `map2()`, `map3()`, and `map_n()` functions from purr. You will learn more about purrr in Chapter ?.
If you add a categorical variable to your model, like `diamonds$cut`, R will fit a separate line for each level of the variable, as you see above. It will also add each level of the variable, except one, to the model formula, e.g.
### Making new observations
If your data set contains subgroups, you can derive a new data set from it of observations that describe the subgroups. To do this, first use dplyr's `group_by()` function to group the data into subgroups. Then use dplyr's `summarise()` function to calculate group level values. The measures of location, rank and spread listed in Chapter 3 are particularly useful for describing subgroups.
```{r}
diamonds_mod
mpg %>%
group_by(class) %>%
summarise(n_obs = n(), avg_hwy = mean(hwy), sd_hwy = sd(hwy))
```
If your categorical variable happens to be saved in R as an ordered factor, the model results will be displayed in a format that is unnecessarily hard to interpret (R turns ordered variables into a set of contrasts and then fits a model to the contrasts). This is what has happened above.
```{r}
class(diamonds$cut)
```
In all other cases, the results of modeling categorical variables are easy to interpret. If you have an ordered factor, I recommend that you change it to a regular factor before fitting your model. This will not noticeably change the output of your model, and it will free your model from a set of questionable assumptions (that the levels of the ordered factor are equidistant from each other in magnitude).
```{r}
diamonds$cut2 <- factor(diamonds$cut, ordered = FALSE)
diamonds_mod <- lm(price ~ carat + cut2, data = diamonds)
diamonds_mod
```
Now the model contains a coefficient and a dummy variable for each level of the categorical variable except one. To interpret the model, treat the dummy variables as indicator variables that take the value of 1 if an observation has the associated level of cut, and 0 if the observation does not. The effect of each dummy variable is to shift the line up or down to best fit observations with the given level.
So the model predicts that Fair cut diamonds have a price of
$$\text{Price} = -3875 + 7871 \times \text{carat} + \epsilon$$
It predicts that Good cut diamonds have a price of
$$\text{Price} = -3875 + 7871 \times \text{carat} + 1120 + \epsilon$$
It predicts that Very Good cut diamonds have a price of
$$\text{Price} = -3875 + 7871 \times \text{carat} + 1510 + \epsilon$$
and so on.
### Follow up questions
Now that you can fit a linear model to your data, how can you use them to make discoveries and reveal insights? The most useful information in your model will come from the model's coefficients and the model's residuals. The residuals of the model are the distances between each observed $Y$ value and each fitted $Y$ value. You can access the residuals of your model with the `augment()` function. The are saved in the `.resid` column.
```{r}
diamonds_fit <- augment(diamonds_mod)
head(diamonds_fit)
```
* *Size of coefficient*
* *Sign of coefficient*
* *Changes between models*
* *Outliers in residuals*
* *Clusters in residuals*
* *Patterns in residuals*
Group level observations and group geoms
## Summary
What can go wrong? Bias and variance.
Data is difficult to comprehend, which means that you need to visualize, model, and transform it.
The structure of data, scientific laws, and the data analysis process. Segue to the rest of the book.
Once you comprehend the information in your data, you can make inferences from your data.
But all of this will involve a computer. To make head way, you will need to know how to program in a computer language (R), import data to use with that language, and tidy the data into the format that works best for that language.
Using group_by and summarise to create new levels of observations
When you are finished you will want to report and reproduce your results.