r4ds/model-basics.Rmd

437 lines
17 KiB
Plaintext

# Model basics
## Introduction
The goal of a model is to provide a simple low-dimensional summary of a dataset. Ideally, the fitted model will capture true "signals" (i.e. patterns generated by the phenomenon of interest), and ignore "noise" (i.e. random variation that you're not interested in).
SEGUE
This chapter of the book is unique because it only uses simulated datasets. These datasets are very simple, and not at all interesting, but they will help you understand the essence of modelling before you apply the same techniques to real data in the next chapter.
SEGUE
A fitted model is a function: you give it a set of inputs (often called the predictors), and it returns a value (typically called the prediction). Fitting a model takes place in two steps:
1. You define a __family of models__ that you think will capture an
interesting pattern in your data. This class will look like a mathematical
formula that relates your predictors to the response. For a simple 1d
model, this might look like `y = a * x + b` or `y = a * x ^ b`.
1. You generate a __fitted model__ by finding the one model from the family
that is the closest to your data. This takes the generic model family
and makes it specific, like `y = 3 * x + 7` or `y = 9 * x ^ 2`.
It's important to understand that a fitted model is just the closest model from a family of models. That implies that you have the "best" model (according to some criteria); it doesn't imply that you have a good model and it certainly doesn't imply that the model is "true". George Box puts this well in his famous aphorism that you've probably heard before:
> All models are wrong, but some are useful.
It's worth reading the fuller context of the quote:
> Now it would be very remarkable if any system existing in the real world
> could be exactly represented by any simple model. However, cunningly chosen
> parsimonious models often do provide remarkably useful approximations. For
> example, the law PV = RT relating pressure P, volume V and temperature T of
> an "ideal" gas via a constant R is not exactly true for any real gas, but it
> frequently provides a useful approximation and furthermore its structure is
> informative since it springs from a physical view of the behavior of gas
> molecules.
>
> For such a model there is no need to ask the question "Is the model true?".
> If "truth" is to be the "whole truth" the answer must be "No". The only
> question of interest is "Is the model illuminating and useful?".
The goal of a model is not to uncover Truth, but to discover a simple approximation that is still useful. This tension is at the heart of modelling: you want your model to be simpler than reality (so you can understand it more easily) while still generating good predictions. There is almost always a tradeoff between understandability and quality of predictions.
### Prerequisites
We need a couple of packages specifically designed for modelling, and all the packages you've used before for EDA. I also recommend setting some options that tweak the default linear model behaviour to be a little less surprising.
```{r setup, message = FALSE}
# Modelling functions
library(modelr)
library(broom)
# Modelling requires plently of visualisation and data manipulation
library(ggplot2)
library(dplyr)
library(tidyr)
# Options that make your life easier
options(
contrasts = c("contr.treatment", "contr.treatment"),
na.option = na.exclude
)
```
### Simulated datasets
(to be moved into modelr)
```{r}
# 1 con
sim1 <- tibble(
x = rep(1:10, each = 3),
y = 5 + x * 2 + rnorm(length(x), sd = 2)
)
ggplot(sim1) + geom_point(aes(x, y))
# 1 cat
means <- c(a = 1, b = 3, c = 6, d = 2)
sim2 <- tibble(
x = rep(names(means), each = 10),
y = means[x] + rnorm(length(x))
)
ggplot(sim2) + geom_point(aes(x, y))
# 1 cat + 1 con
ints <- c(a = 1, b = 3, c = 6, d = 2)
sim3 <-
tidyr::crossing(x1 = 1:10, x2 = names(ints), rep = 1:3) %>%
mutate(y = ints[x2] + 0.5 * x1 + rnorm(length(x1)), sd = 2)
ggplot(sim3) + geom_point(aes(x1, y, colour = x2))
# 1 cat + 1 con with interaction
slopes <- c(a = -0.2, b = -0.4, c = -0.1, d = 0.2)
sim4 <-
tidyr::crossing(x1 = 1:10, x2 = names(ints), rep = 1:3) %>%
mutate(y = ints[x2] + slopes[x2] * x1 + rnorm(length(x1)), sd = 2)
ggplot(sim3) + geom_point(aes(x1, y, colour = x2))
# 2 con
sim5 <-
tidyr::crossing(x1 = 1:10, x2 = 1:10) %>%
mutate(
y = 2 * x1 - 3 * x2 + rnorm(length(x1), sd = 3)
)
ggplot(sim3) + geom_raster(aes(x1, x2, fill = y))
# splines
```
## A simple model
Lets take a look at the simulated dataset `sim1`. It contains two continuous variables `x` and `y` which are related in a fairly straightforward way.
```{r}
ggplot(sim1, aes(x, y)) +
geom_point()
```
You can see a strong pattern in the data and we can use a model to capture it. This data looks like it comes from a linear famiy, i.e. `y = a * x + b`. Let's start by randomly generating a few models from that family and overlaying them on the data:
```{r}
models <- tibble(
a = runif(250, -20, 80),
b = runif(250, -5, 5)
)
ggplot(sim1, aes(x, y)) +
geom_abline(aes(intercept = a, slope = b), data = models, alpha = 1/4) +
geom_point()
```
There are an infinite number of possible models, but some are obviously better than others. Intuitively, it seems reasonable that a good model should generate a line that's close to the data. How can we make that intuition concrete?
First, we need a way to generate the predicted y values given a model and some data:
```{r}
make_prediction <- function(mod, data) {
data$x * mod$b + mod$a
}
make_prediction(list(a = 1, b = 3), sim1)
```
Next, we need some way to measure the distance between the prediction and the actual actual values. A reasonable place to start is Euclidean distance
```{r}
measure_distance <- function(mod, data) {
diff <- data$y - make_prediction(mod, data)
sqrt(mean(diff ^ 2))
}
measure_distance(list(a = 1, b = 3), sim1)
```
Now we can use purrrr to compure the distance for all the models defined above.
```{r}
sim1_dist <- function(a, b) {
mod <- list(a = a, b = b)
measure_distance(mod, sim1)
}
models <- models %>% mutate(dist = purrr::map2_dbl(a, b, sim1_dist))
models
```
And then overlay only the 10 best models on to the data. I've coloured the points by `-dist`: this is an easy way to make sure that the best models (i.e. the ones with the smallest distance) are coloured brightly.
```{r}
ggplot(sim1, aes(x, y)) +
geom_point(size = 2, colour = "grey30") +
geom_abline(
aes(intercept = a, slope = b, colour = -dist),
data = filter(models, rank(dist) <= 10)
)
```
Another way of looking at this is to look at the 2d grid formed by the values of `a` and `b` that we sampled:
```{r}
ggplot(models, aes(a, b)) +
geom_point(aes(colour = -dist))
```
Instead of trying lots of random models we could be more systematic and generate an evenly spaced grid of points, a so called grid search:
```{r}
grid <- expand.grid(
a = 5 + seq(-5, 5, length = 25) * 1,
b = 2 + seq(-5, 5, length = 25) * 0.15
) %>%
mutate(dist = purrr::map2_dbl(a, b, sim1_dist))
grid %>%
ggplot(aes(a, b)) +
geom_point(data = filter(grid, rank(dist) < 10), size = 4, colour = "red") +
geom_point(aes(colour = -dist))
```
All those models look pretty good:
```{r}
ggplot(sim1, aes(x, y)) +
geom_point(size = 2, colour = "grey30") +
geom_abline(
aes(intercept = a, slope = b, colour = -dist),
data = filter(grid, rank(dist) <= 10)
)
```
You could imagine iteratively making the grid finer and finer until you narrowed in on the best model. A more efficient approach is to use a numerical minimisation tool called a Newton-Rhapson search. The intuition of Newton-Rhapson is pretty simple: you pick a starting point and look around for the steepest slope. You then ski down that slope a little way, and then repeat.
```{r}
best <- optim(c(0, 0), function(x) sim1_dist(x[1], x[2]))
best$par
ggplot(sim1, aes(x, y)) +
geom_point(size = 2, colour = "grey30") +
geom_abline(intercept = best$par[1], slope = best$par[2])
```
You can that basic approach for any family of model that you can write an equation for. However, the specfic model we fit is a part of a broader class called linear models. We can use the `lm()` function to solve this specific problem. `lm()` has a special way to specify the model family: a formula like `y ~ x` which `lm()` translates to `y = a * x + b`. We can fit the model and look at the output:
```{r}
sim1_mod <- lm(y ~ x, data = sim1)
summary(sim1_mod)
```
Generally, model functions will do something more sophisticated than a gradient search: for linear models based on linear algebra. Using connection between geometry, calculus, and linear algebra: the closest linear model to the data can always be found by constructing and then inverting a matrix.
## Linear models
For simple models, like this one, you can figure out what the model says about the data by carefully studying the coefficients. If you ever take a formal statistics course on modelling, you'll spend a lot of time doing that. Here, however, we're going to take a different tack. In this book, we're going to focus on understanding a model by looking at its predictions. This has a big advantage: every type of model makes predictions (otherwise what use would it be?) so we can use the same set of techniques to understand simple linear models or complex random forrests. We'll see that advantage later on when we explore some other families of models.
1. Make predictions across a uniform grid to understand the "shape" of
the model.
1. Subtract predictions from the actual values to look at the residuals
and to learn what the model misses.
1. Create succinct numerical summaries when you need to compare
many models.
### Predictions
To visualise the predictions from a model, we start by generating an evenly spaced grid of values that covers the region where our data lies. The easiest way to do that is to use `tidyr::expand()`. It's first argument is a data frame, and for each subsequent argument it finds the unique variables and then generates all combinations:
```{r}
grid <- sim1 %>% expand(x)
grid
```
(This will get more interesting when we start to add more variables to our model.)
Next we add predicitons. We'll use `modelr::add_predictions()` which works in exactly the same way as `add_residuals()`, but just compute predictions (so doesn't need a data frame that contains the response variable:)
```{r}
grid <- grid %>% add_predictions(sim1_mod, "y")
grid
```
And then we plot the predictions. Here the choice of plots is pretty obvious because we only havee two variables. In general, however, figuring out how to the visualise the predictions can be quite challenging, and you'll often need to try a few alternatives before you get the most useful plot. For more complex models, you're likely to need multiple plots, not just one.
```{r}
ggplot(sim1, aes(x, y)) +
geom_point() +
geom_line(data = grid, colour = "red", size = 1)
```
This is exactly what `geom_smooth()` does for you behind the scenes!
### Residuals
The flip-side of predictions are residuals. The predictions tell you what the model is doing; the residuals tell you what the model is missing. The residuals are just the distances between the observed and predicted values that we computed above!
We can compute residuals with `add_residuals()`. Note that we computing residuals, you'll use the original dataset, not a manufactured grid. Otherwise where would you get the value of the response?
```{r}
sim1 <- sim1 %>% add_residuals(sim1_mod)
```
There are a few different ways to understand what the residuals tell us about the model. One way is to simply draw a frequency polygon to help us understand the spread of the residuals:
```{r}
ggplot(sim1, aes(resid)) +
geom_freqpoly(binwidth = 0.5)
```
(I prefer the frequency polygon over the histogram here because it makes it easier to display other categorical variables on the same plot.)
For many problems, the sign of the residual (i.e. whether the prediction is too high or too low) isn't important, and you might just want to focus on the magnitude of the residuals. You can do that by plotting the absolute value:
```{r}
ggplot(sim1, aes(abs(resid))) +
geom_freqpoly(binwidth = 0.25, boundary = 0)
```
For real data, you'll also explore explore how the residuals vary with other variables in the data. You'll see that in the next chapter.
### Exercises
1. In the plot of the absolute residuals, there's a bin smaller than
zero. What does this bin represent, and why is it necessary?
1. P-values are an important part of model interpretation, particularly in
science, that we're not going to talk much about. <https://xkcd.com/882/>
## Other predictors
In most cases, like this one, a single variable is not enough to generate a useful model. Instead we need multiple variables in the model, which you can include with either `+` or `*` in the modelling formula.
The distinction between `+` and `*` is important:
* `x + y` adds the variables independently: the model will estimate the
effect of `x` and the effect of `y` completely separately.
* `x * y` will estimate the effect of the model together.
We'll explore what this means in the sections below, as we add more categorical and continuous variables to the dataset.
### Missing values
### Categorical data
#### Factors
R stores categorical data as factors. If you add a string to a model, R will convert it to a factor for the purposes of the model.
A factor is an integer vector with a levels attribute. You can make a factor with `factor()`.
```{r}
fac <- factor(c("c", "a", "b"),
levels = c("a", "b", "c"),
labels = c("blond", "brunette", "red"))
fac
unclass(fac)
```
Each level of the factor (i.e. unique value) is encoded as an integer and displayed with the label that is associated with that integer.
If you use factors outside of a model, you will notice some limiting behavior:
* You cannot add values to a factor that do not appear in its levels.
* Factors retain all of their levels when you subset them. To avoid this use `drop = TRUE`.
```{r}
fac[1]
fac[1, drop = TRUE]
```
* If you coerce a factor to a number with `as.numeric()`, R will convert the integer vector that underlies the factor to a number, not the level labels that you see when you print the factor.
```{r}
num_fac <- factor(1:3, levels = 1:3, labels = c("100", "200", "300"))
num_fac
as.numeric(num_fac)
```
To coerce the labels that you see to a new data type, first coerce the factor to a character string with `as.character()`
```{r}
as.numeric(as.character(num_fac))
```
#### Interpretation
Add categorical variables to a model in the same way that you would add continuous variables.
### Nested variables
Another case that occassionally crops up is nested variables: you have an identifier that is locally unique, not globally unique. For example you might have this data about students in schools:
```{r}
students <- tibble::frame_data(
~student_id, ~school_id,
1, 1,
2, 1,
1, 2,
1, 3,
2, 3,
3, 3
)
```
The student id only makes sense in the context of the school: it doesn't make sense to generate every combination of student and school. You can use `nesting()` for this case:
```{r}
students %>% expand(nesting(school_id, student_id))
```
### Interpolation vs extrapolation
One danger with prediction plots is that it's easy to make predictions that are far away from the original data. This is dangerous because it's quite possible that the model (which is a simplification of reality) will no longer apply far away from observed values.
As the number of variables in your model grows ... "the curse of dimensionality": as the number of variables increases the average distance between points increases. That means most of the space is very sparse, and you have to rely on strong assumptions.
To help avoid this problem, it's good practice to include "nearby" observed data points in any prediction plot. These help you see if you're interpolating, making prediction "in between" existing data points, or extrapolating, making predictions about preivously unobserved slices of the data.
One way to do this is to use `condvis::visualweight()`.
<https://cran.rstudio.com/web/packages/condvis/>
## Response variations
### Transformations
### Genearlised linear models
So far the y variable of our models has been a continuous variable, `income`. You can use linear regression to model a categorical y variable by transforming y into a continuous variable with a _link function_. Then model fit a model to the results of the link function and use the link function to back transform and interpret the results.
The most common link function is the logit function, which transforms a bivariate y variable into a continuous range.
Use `glm()` to perform logistic regression in R.
## Other model families
### Robust models
Iteratively re-fit the model down-weighting outlying points (points with high residuals).
### Additive models
```{r eval = FALSE}
# Linear z
gam(y ~ s(x) + z, data = sim1)
# Smooth x and smooth z
gam(y ~ s(x) + s(z), data = sim1)
# Smooth surface of x and z
# (a smooth function that takes both x and z)
gam(y ~ s(x, z), data = df)
```
### Random forrests
## Summary