r4ds/model-basics.Rmd

# Model basics

## Introduction

The goal of a model is to provide a simple low-dimensional summary of a dataset. Ideally, the fitted model will capture true "signals" (i.e. patterns generated by the phenomenon of interest), and ignore "noise" (i.e. random variation that you're not interested in).

SEGUE

This chapter of the book is unique because it only uses simulated datasets. These datasets are very simple, and not at all interesting, but they will help you understand the essence of modelling before you apply the same techniques to real data in the next chapter.

SEGUE

A fitted model is a function: you give it a set of inputs (often called the predictors), and it returns a value (typically called the prediction). Fitting a model takes place in two steps:

1.  You define a __family of models__ that you think will capture an
    interesting pattern in your data. This class will look like a mathematical
    formula that relates your predictors to the response. For a simple 1d
    model, this might look like `y = a * x + b` or `y = a * x ^ b`.

1.  You generate a __fitted model__ by finding the one model from the family
    that is the closest to your data. This takes the generic model family
    and makes it specific, like `y = 3 * x + 7` or `y = 9 * x ^ 2`.

It's important to understand that a fitted model is just the closest model from a family of models. That implies that you have the "best" model (according to some criteria); it doesn't imply that you have a good model and it certainly doesn't imply that the model is "true". George Box puts this well in his famous aphorism that you've probably heard before:

> All models are wrong, but some are useful.

It's worth reading the fuller context of the quote:

> Now it would be very remarkable if any system existing in the real world
> could be exactly represented by any simple model. However, cunningly chosen
> parsimonious models often do provide remarkably useful approximations. For
> example, the law PV = RT relating pressure P, volume V and temperature T of
> an "ideal" gas via a constant R is not exactly true for any real gas, but it
> frequently provides a useful approximation and furthermore its structure is
> informative since it springs from a physical view of the behavior of gas
> molecules.
>
> For such a model there is no need to ask the question "Is the model true?".
> If "truth" is to be the "whole truth" the answer must be "No". The only
> question of interest is "Is the model illuminating and useful?".

The goal of a model is not to uncover Truth, but to discover a simple approximation that is still useful. This tension is at the heart of modelling: you want your model to be simpler than reality (so you can understand it more easily) while still generating good predictions. There is almost always a tradeoff between understandability and quality of predictions.

### Prerequisites

We need a couple of packages specifically designed for modelling, and all the packages you've used before for EDA. I also recommend setting some options that tweak the default linear model behaviour to be a little less surprising.

```{r setup, message = FALSE}
# Modelling functions
library(modelr)
library(broom)

# Modelling requires plently of visualisation and data manipulation
library(ggplot2)
library(dplyr)
library(tidyr)

# Options that make your life easier
options(
  contrasts = c("contr.treatment", "contr.treatment"),
  na.option = na.exclude
)
```

```{r, include = FALSE}
# (to be moved into modelr)

# 1 con
sim1 <- tibble(
  x = rep(1:10, each = 3),
  y = x * 2 + 5 + rnorm(length(x), sd = 2)
)

# 1 cat
means <- c(a = 1, b = 8, c = 6, d = 2)
sim2 <- tibble(
  x = rep(names(means), each = 10),
  y = means[x] + rnorm(length(x))
)

# 1 cat + 1 con with interaction
slopes <- c(a = -0.1, b = -0.8, c = -0.1, d = 0.2)
sim4 <-
  tidyr::crossing(x1 = 1:10, x2 = factor(names(means)), rep = 1:3) %>%
  mutate(y = means[x2] + slopes[x2] * x1 + rnorm(length(x1)), sd = 2)

# 2 con
sim5 <-
  tidyr::crossing(x1 = 1:10, x2 = 1:10) %>%
  mutate(
    y = 2 * x1 - 3 * x2 + rnorm(length(x1), sd = 3)
  )

# splines
```

## A simple model

Lets take a look at the simulated dataset `sim1`. It contains two continuous variables, `x` and `y`, which are related in a fairly straightforward way:

```{r}
ggplot(sim1, aes(x, y)) +
  geom_point()
```

You can see a strong pattern in the data. We can use a model to capture that pattern and make it explicit. It's our job to supply the basic form of the model. In this case, the relationship looks linear, i.e. `y = a * x + b`. To fit the model, we need to find the single model from the family that does the best job for this data.

Let's start by getting a feel for that family of models by randomly generating a few and overlaying them on the data. For this simple case, we can use `geom_abline()` which takes a slope and intercept as paramaters. Later on we'll learn more general techniques that work with any model.

```{r}
models <- tibble(
  a = runif(250, -5, 5),
  b = runif(250, -20, 80)
)

ggplot(sim1, aes(x, y)) +
  geom_abline(aes(slope = a, intercept = b), data = models, alpha = 1/4) +
  geom_point()
```

Of the 250 random models some seem much better than others. Intuitively, a good model will be close to the data. So to fit the model, finding the best model from this family, we need to make our intution precise. We need some way to measure the distance between the data and a model.

First, we need a way to generate the predicted y values given a model and some data. Here we represent the model as a numeric vector of length two.

```{r}
make_prediction <- function(mod, data) {
  data$x * mod[1] + mod[2]
}
make_prediction(c(2, 5), sim1)
```

Next, we need some way to compute an overall distance between the predicted and actual y for each observation. One common way to do this in statistics is the to use the "root-mean-squared deviation". We compute the difference between actual and predicted, square them, average them, and the take the square root. This distance has lots of appealing mathematical properties, which we're not going to talk about here. You'll just have to take my word for it!

```{r}
measure_distance <- function(mod, data) {
  diff <- data$y - make_prediction(mod, data)
  sqrt(mean(diff ^ 2))
}
measure_distance(c(2, 5), sim1)
```

Now we can use purrrr to compure the distance for all the models defined above. We need a helper function because our distance function expects the model as a numeric vector of length 2.

```{r}
sim1_dist <- function(a, b) {
  measure_distance(c(a, b), sim1)
}

models <- models %>% mutate(dist = purrr::map2_dbl(a, b, sim1_dist))
models
```

Next, let's overlay the 10 best models on to the data. I've coloured the models by `-dist`: this is an easy way to make sure that the best models (i.e. the ones with the smallest distance) get the brighest colours.

```{r}
ggplot(sim1, aes(x, y)) +
  geom_point(size = 2, colour = "grey30") +
  geom_abline(
    aes(slope = a, intercept = b, colour = -dist),
    data = filter(models, rank(dist) <= 10)
  )
```

Another way of looking to think about these models is to draw a scatterplot of `a` a `b`, again coloured colour by `-dist`. We can no longer see how the model compares to the data, but we can see many models at once. Again, I've highlight the 10 best models, this time by drawing red circles underneath them.

```{r}
ggplot(models, aes(a, b)) +
  geom_point(data = filter(models, rank(dist) < 10), size = 4, colour = "red") +
  geom_point(aes(colour = -dist))
```

Instead of trying lots of random models, we could be more systematic and generate an evenly spaced grid of points, a so called grid search. I picked the parameters of the grid rougly by looking at where the best models where in the plot above.

```{r}
grid <- expand.grid(
  a = seq(1, 3, length = 25),
  b = seq(-5, 20, length = 25)
  ) %>%
  mutate(dist = purrr::map2_dbl(a, b, sim1_dist))

grid %>%
  ggplot(aes(a, b)) +
  geom_point(data = filter(grid, rank(dist) < 10), size = 4, colour = "red") +
  geom_point(aes(colour = -dist))
```

When you overlay the best 10 models back on the original data, they all look pretty good:

```{r}
ggplot(sim1, aes(x, y)) +
  geom_point(size = 2, colour = "grey30") +
  geom_abline(
    aes(slope = a, intercept = b, colour = -dist),
    data = filter(grid, rank(dist) <= 10)
  )
```

Now you could imagine iteratively making the grid finer and finer until you narrowed in on the best model. But there's a better way to tackle that problem: a numerical minimisation tool called Newton-Rhapson search. The intuition of Newton-Rhapson is pretty simple: you pick a starting point and look around for the steepest slope. You then ski down that slope a little way, and then repeat again and again, until you can't go any lower. In R, we can do that with `optim()`:

```{r}
best <- optim(c(0, 0), measure_distance, data = sim1)
best$par

ggplot(sim1, aes(x, y)) +
  geom_point(size = 2, colour = "grey30") +
  geom_abline(slope = best$par[1], intercept = best$par[2])
```

Don't worry too much about the details - it's the intutition that's important here. If you have a function that defines the distance between a model and a dataset you can use existing mathematical tools to find the best model. The neat thing about this approach is that it will work for family of models that you can write an equation for.

However, this particular model is a special case of a broader family: linear models. A linear model has the general form `y = a_1 * x_1 + a_2 * x_2 + ... + a_n * x_n`. So this simple model is equivalent to a general linear model where n is n, `a_1` is `a`, `x_1` is `x`, `a_2` is `b` and `x_2` is a constant, 1. R has a tool specifically designed for linear models called `lm()`. `lm()` has a special way to specify the model family: a formula like `y ~ x` which `lm()` translates to `y = a * x + b`. We can fit the model and look at the output:

```{r}
sim1_mod <- lm(y ~ x, data = sim1)
coef(sim1_mod)
```

These are exactly the same values we got with `optim()`! However, behind the scenes `lm()` doesn't use `optim()` but instead takes advantage of the mathematical structure of linear models. Using some connection between geometry, calculus, and  linear algebra, it actually finds the closest model by (effectively) inverting a matrix.

### Exercises

1.  One downside of the linear model is that it is sensitive to unusual values
    because the distance incorporates a squared term. Fit a linear model to
    the simulated data below, and visualise the results. Rerun a few times to
    generate different simulated datasets. What do you notice about the model?

    ```{r}
    sim1a <- tibble(
      x = rep(1:10, each = 3),
      y = x * 1.5 + 6 + rt(length(x), df = 2)
    )
    ```

1.  One way to make linear models more robust is to use a different distance
    measure. For example, instead of root-mean-squared distance, you could use
    mean-absolute distance:

    ```{r}
    measure_distance <- function(mod, data) {
      diff <- data$y - make_prediction(mod, data)
      mean(abs(diff))
    }
    ```

    Use `optim()` to fit this model to the simulated data above and compare it
    to the linear model.

## Linear models

For simple models, like the one above, you can figure out what the model says about the data by carefully studying the coefficients. And if you ever take a statistics course on modelling, you're likely to spend a lot of time doing just that. Here, however, we're going to take a different tack. In this book, we're going to focus on understanding a model by looking at its predictions. This has a big advantage: every type of model makes predictions (otherwise what use would it be?) so we can use the same set of techniques to understand simple linear models or complex random forrests. We'll see that advantage later on when we explore some other families of models.

We are also going to take advantage of a powerful feature of linear models: they are additive. That means you can partition the data into pattern and residuals. This allows us to see what subtler patterns remain after we have removed the biggest trned.

### Predictions

To visualise the predictions from a model, we start by generating an evenly spaced grid of values that covers the region where our data lies. The easiest way to do that is to use `tidyr::expand()`. Its first argument is a data frame, and for each subsequent argument it finds the unique variables and then generates all combinations:

```{r}
grid <- sim1 %>% expand(x)
grid
```

(This will get more interesting when we start to add more variables to our model.)

Next we add predictions. We'll use `modelr::add_predictions()` which takes a data frame and a model. It adds the predictions from the model to the data frame:

```{r}
grid <- grid %>% add_predictions(sim1_mod)
grid
```

Next, we plot the predictions. Here the choice of plot is easy because we only have two variables. In general, however, figuring out how to the visualise the predictions can be quite challenging, and you'll often need to try a few alternatives before you get the most useful plot. For more complex models, you're likely to need multiple plots, not just one.

```{r}
ggplot(sim1, aes(x)) +
  geom_point(aes(y = y)) +
  geom_line(aes(y = pred), data = grid, colour = "red", size = 1)
```

Note that this plot takes advantage of ggplot2's ability to plot different datasets with different aesthetics on different layers.

### Residuals

The flip-side of predictions are __residuals__. The predictions tells you the pattern that the model has captured, and the residuals tell you what the model has missed. The residuals are just the distances between the observed and predicted values that we computed above!

We add residuals to the data with `add_residuals()`, which works much like `add_predictions()`. Note, however, that we use the original dataset, not a manufactured grid. Otherwise where would you get the value of the response?

```{r}
sim1 <- sim1 %>% add_residuals(sim1_mod)
```

There are a few different ways to understand what the residuals tell us about the model. One way is to simply draw a frequency polygon to help us understand the spread of the residuals:

```{r}
ggplot(sim1, aes(resid)) +
  geom_freqpoly(binwidth = 0.5)
```

(I prefer the frequency polygon over the histogram here because it makes it easier to display other categorical variables on the same plot.)

Another way is to plot the residuals against the other variables in the dataset:

```{r}
ggplot(sim1, aes(x, resid)) +
  geom_point()
```

There doesn't seem to be any pattern remaining here.

### Exercises

1.  Instead of using `lm()` to fit a straight line, you can use `loess()`
    to fit a smooth curve. Repeat the process of model fitting,
    grid generation, predictions, and visualisation on `sim1` using
    `loess()` instead of `lm()`. How does the result compare to
    `geom_smooth()`?

## Formulas and model families

The linear model formula gives a short-hand for writing out model specifications.

* `y ~ x` -> `y = a_0 + a_1 * x`
* `y ~ x - 1` -> `y = a_1 * x`

If you want to

* `y ~ x1 + x2` -> `y = a_0 + a_1 * x1 + a_2 * x2`

These each estimate the each of `x1` and `x2`

* `y ~ x1 * x2` -> `y = a_0 + a_1 * a1 + a_2 * a2 + a_12 * a1 * a2`

* `y ~ x1 + x2 + x1:x2` is the same thing.

### Categorical variables

This is straightforward when the predictor is continuous, but things get a bit more complicated when the predictor is categorical. Imagine you have a formula like `y ~ sex`, where sex could either be male or female. It doesn't make sense to convert that to a formula like `y = x_0 + x_1 * sex` because `sex` isn't a number - you can't multiply it! Instead what R does is convert it to `y = x_0 + x_1 * sex_male` where `sex_male` is one if `sex` is male and 0 otherwise.

You might wonder why R doesn't convert it to `y = x_0 + x_1 * sex_male + x_2 * sex_female`. A little thought should reveal that this would be difficult because ...

The process of turning a categorical variable into a 0-1 matrix has different names. Sometimes the individual 0-1 columns are called dummy variables. In machine learning, it's called one-hot encoding. In statistics, the process is called creating a contrast matrix.  General example of "feature generation": taking things that aren't continuous variables and figuring out how to represent them.

Let's look at some data and models to make that concrete.

```{r}
ggplot(sim2) +
  geom_point(aes(x, y))
```

```{r}
mod2 <- lm(y ~ x, data = sim2)

grid <- sim2 %>%
  expand(x) %>%
  add_predictions(mod2)

ggplot(sim2, aes(x)) +
  geom_point(aes(y = y)) +
  geom_point(data = grid, aes(y = pred), colour = "red", size = 4)

```

Effectively, a model with a categorical `x` will predict the mean value for each category. (Why? Because the mean minimise the root-mean-squared distance.)

Note that (obviously) you can't make predictions about levels that you didn't observe. Sometimes you'll do this by accident so it's good to recognise this error message:

```{r, error = TRUE}
tibble(x = "e") %>% add_predictions(mod2)
```

### A continuous and categorical

What happens when you combine a continuous an a categorical variable?

```{r}
ggplot(sim4, aes(x1, y)) +
  geom_point(aes(colour = x2))
```

Let's explore two models:

```{r}
mod1 <- lm(y ~ x1 + x2, data = sim4)
mod2 <- lm(y ~ x1 * x2, data = sim4)
```

There are two tricks to our grid. We have two predictors, so we need to give `expand()` two variables. It finds all the unique values of `x1` and `x2` and then generates all combinations. We are also going to add the predictions from both models simulteanously using `spread_predictions()` which adds each prediction to a new column. The complement of `spread_predictions()` is `gather_predictions()` which uses rows instead of columns

```{r}
grid <- sim4 %>%
  expand(x1, x2) %>%
  gather_predictions(mod1, mod2)
grid
```

Let's visualise the results using facetting to display both models on one plot.

```{r}
ggplot(sim4, aes(x1, y, colour = x2)) +
  geom_point() +
  geom_line(data = grid, aes(y = pred)) +
  facet_wrap(~ model)
```

The model that uses `+` has the same slope for each line. The model that uses `*` has a different slope for each line.

Which model is better for this data? We can take look at the residuals, using the same idea as above. Here I've used facetting because it makes it easier to see the pattern within each group.

```{r}
sim4 <- sim4 %>%
  gather_residuals(mod1, mod2)

ggplot(sim4, aes(x1, resid, colour = x2)) +
  geom_point() +
  facet_grid(model ~ x2)
```

There is little obvious pattern in the residuals for `mod2`. The residuals for `mod1` show that the model has clearly missed some pattern in `b`, and less so, but still present is pattern in `c`, and `d`.

### Two continuous variables

As you move into three dimensions and higher, creating good visualisations gets harder and harder. But since we're using models to support vis of your data, it's not so bad. As you discover more patterns, you'll grow your model over time.

## Missing values


## Other model families

### Generalised linear models

Different distance metric (likelihood)

Use `glm()` to perform logistic regression in R.

### Additive models

`mgcv::gam()`.

### Penalised models

Add penalty to distance.

### Robust models

Distance that is less sensitive to outliers.

### Trees and forests

### Gradient boosting