r4ds/model-assess.Rmd

# Model assessment

```{r setup-model, include=FALSE}
library(purrr)
set.seed(1014)
options(digits = 3)
```

In this chapter, you'll turn the tools of multiple models towards model assessment: generating a succinct numerical summary of a model. This must always be done with care, as a single number will never tell you as much as a good visualisation, but when you're considering large numbers of models, you need some way to quickly weed out the duds.

We're going to use two main techniques in this chapter:

* Cross-validation to assess model quality. In cross-validation, you randomly
  split the data into test and training sets. You fit the data to the training
  set, and evaluate it on the test set. This avoids intrinsic bias of using the
  same data twice. 
  
* Boostrapping to assess model stability/variability. If you sample data from
  the same population multiple times, how much does your model vary? It's hard
  to go back and collect more data, so bootstrap uses a simple trick to 
  approximate data re-collection.

If you're competing in competitions, like Kaggle, that are predominantly about creating good predicitons, developing a good strategy for avoiding overfitting is very important. Otherwise you risk tricking yourself into thinking that you have a good model, when in reality you just have a model that does a good job of fitting your data.

There is a closely related family that uses a similar idea: model ensembles. However, instead of trying to find the best models, ensembles make use of all the models, acknowledging that even models that don't fit all the data particularly well can still model some subsets well. In general, you can think of model ensemble techniques as functions that take a list of models, and a return a single model that attempts to take the best part of each.

### Prerequisites

```{r setup, message = FALSE}
# Standard data manipulation and visulisation
library(dplyr)
library(ggplot2)

# Tools for working with models
library(broom)
library(modelr)
library(splines)

# Tools for working with lots of models
library(purrr)
library(tidyr)
```

## Overfitting

Both bootstrapping and cross-validation help us to spot and remedy the problem of __over fitting__, where the model fits the data we've seen so far extremely well, but does a bad job of generalising to new data.

A classic example of over-fitting is to use a spline with too many degrees of freedom.

Bias - variance tradeoff.  Simpler = more biased. Complex = more variable.  Occam's razor.

```{r}
true_model <- function(x) {
  1 + 2 * x + rnorm(length(x), sd = 0.25)
}

df <- data_frame(
  x = seq(0, 1, length = 20),
  y = true_model(x)
)

df %>% 
  ggplot(aes(x, y)) +
  geom_point()
```

We can create a model that fits this data incredibly well:

```{r}
library(splines)
my_model <- function(df) {
  lm(y ~ ns(x, 5), data = df)
}

mod <- my_model(df)
rmse(mod, df)

grid <- df %>% expand(x = seq_range(x, 50))
preds <- grid %>% add_predictions(y = mod)

df %>% 
  ggplot(aes(x, y)) +
  geom_line(data = preds) + 
  geom_point()
```

But do you think this model will do well if we apply it to new data from the same population? 

This case is a simulation, so we could just resimulate data from the same process and see how well it does:

```{r}
df2 <- df %>% mutate(y = true_model(x))
rmse(mod, df2)
```

Obviously it does much worse.  But in real-life you can't easily go out and recollect your data. There are two approach to help you get around this problem. I'll introduce them briefly here, and then we'll go into more depth in the following sections.

```{r}
boots <- rerun(100, df %>% mutate(y = true_model(x)))
mods <- map(boots, my_model)
preds <- map2_df(list(grid), mods, ~ add_predictions(.x, y = .y), .id = "id")

preds %>% 
  ggplot(aes(x, y, group = id)) +
  geom_line(alpha = 1/3)
```

```{r}
boot <- bootstrap(df, 100)
mods <- boot$strap %>% map(safely(my_model)) %>% transpose()

ok <- mods$error %>% map_lgl(is.null)
```

```{r}
preds <- map2_df(list(grid), mods$result[ok], ~ add_predictions(.x, y = .y), .id = "id")

preds %>% 
  ggplot(aes(x, y, group = id)) +
  geom_line(alpha = 1/3)
```

(You might notice that while each individual model varies a lot, the average of all the models seems like it's pretty good. That gives rise to a model ensemble technique called model averaging.)

We could instead use cross-validation to focus on a summary of model quality. It basically works like this:

```{r}
cv <- crossv_mcmc(df, 100, test = 0.3)

mods <- map(cv$train, my_model)
rmses <- map2_dbl(mods, cv$test, rmse)

data_frame(x = rmses) %>% 
  ggplot(aes(x)) +
  geom_vline(xintercept = rmse(mod, df), colour = "white", size = 2) +
  geom_freqpoly(binwidth = 0.05) +
  geom_rug()
```

### Exercises

1.  Why can we only fit a model with spline containing 8 degrees of freedom?
    Why not 9 or 10?
Make sure first element is heading 2015-12-12 02:34:20 +08:00			`# Model assessment`

Local bookdown working 2015-12-12 03:28:10 +08:00			```{r setup-model, include=FALSE}
Add needed packages to model-assess 2015-12-07 23:06:19 +08:00			`library(purrr)`
			`set.seed(1014)`
			`options(digits = 3)`
			```

Brain dump on model assessment 2016-06-14 22:02:17 +08:00			`In this chapter, you'll turn the tools of multiple models towards model assessment: generating a succinct numerical summary of a model. This must always be done with care, as a single number will never tell you as much as a good visualisation, but when you're considering large numbers of models, you need some way to quickly weed out the duds.`
Add needed packages to model-assess 2015-12-07 23:06:19 +08:00
Brain dump on model assessment 2016-06-14 22:02:17 +08:00			`We're going to use two main techniques in this chapter:`
Extract out multiple models to modelling chapter 2015-12-06 17:11:52 +08:00
Brain dump on model assessment 2016-06-14 22:02:17 +08:00			`* Cross-validation to assess model quality. In cross-validation, you randomly`
			`split the data into test and training sets. You fit the data to the training`
			`set, and evaluate it on the test set. This avoids intrinsic bias of using the`
			`same data twice.`

			`* Boostrapping to assess model stability/variability. If you sample data from`
			`the same population multiple times, how much does your model vary? It's hard`
			`to go back and collect more data, so bootstrap uses a simple trick to`
			`approximate data re-collection.`
Extract out multiple models to modelling chapter 2015-12-06 17:11:52 +08:00
Brain dump on model assessment 2016-06-14 22:02:17 +08:00			`If you're competing in competitions, like Kaggle, that are predominantly about creating good predicitons, developing a good strategy for avoiding overfitting is very important. Otherwise you risk tricking yourself into thinking that you have a good model, when in reality you just have a model that does a good job of fitting your data.`
Extract out multiple models to modelling chapter 2015-12-06 17:11:52 +08:00
Brain dump on model assessment 2016-06-14 22:02:17 +08:00			`There is a closely related family that uses a similar idea: model ensembles. However, instead of trying to find the best models, ensembles make use of all the models, acknowledging that even models that don't fit all the data particularly well can still model some subsets well. In general, you can think of model ensemble techniques as functions that take a list of models, and a return a single model that attempts to take the best part of each.`

			`### Prerequisites`

			```{r setup, message = FALSE}
			`# Standard data manipulation and visulisation`
			`library(dplyr)`
			`library(ggplot2)`

			`# Tools for working with models`
			`library(broom)`
			`library(modelr)`
Update for modelr API changes 2016-06-15 21:27:35 +08:00			`library(splines)`
Brain dump on model assessment 2016-06-14 22:02:17 +08:00
			`# Tools for working with lots of models`
			`library(purrr)`
			`library(tidyr)`
			```

			`## Overfitting`

			`Both bootstrapping and cross-validation help us to spot and remedy the problem of __over fitting__, where the model fits the data we've seen so far extremely well, but does a bad job of generalising to new data.`

			`A classic example of over-fitting is to use a spline with too many degrees of freedom.`

			`Bias - variance tradeoff. Simpler = more biased. Complex = more variable. Occam's razor.`
Extract out multiple models to modelling chapter 2015-12-06 17:11:52 +08:00
			```{r}
Brain dump on model assessment 2016-06-14 22:02:17 +08:00			`true_model <- function(x) {`
			`1 + 2 * x + rnorm(length(x), sd = 0.25)`
Extract out multiple models to modelling chapter 2015-12-06 17:11:52 +08:00			`}`
Brain dump on model assessment 2016-06-14 22:02:17 +08:00
			`df <- data_frame(`
			`x = seq(0, 1, length = 20),`
			`y = true_model(x)`
			`)`

			`df %>%`
			`ggplot(aes(x, y)) +`
			`geom_point()`
Extract out multiple models to modelling chapter 2015-12-06 17:11:52 +08:00			```

Brain dump on model assessment 2016-06-14 22:02:17 +08:00			`We can create a model that fits this data incredibly well:`
Extract out multiple models to modelling chapter 2015-12-06 17:11:52 +08:00
			```{r}
Brain dump on model assessment 2016-06-14 22:02:17 +08:00			`library(splines)`
			`my_model <- function(df) {`
			`lm(y ~ ns(x, 5), data = df)`
			`}`

			`mod <- my_model(df)`
			`rmse(mod, df)`

			`grid <- df %>% expand(x = seq_range(x, 50))`
			`preds <- grid %>% add_predictions(y = mod)`
Extract out multiple models to modelling chapter 2015-12-06 17:11:52 +08:00
Brain dump on model assessment 2016-06-14 22:02:17 +08:00			`df %>%`
			`ggplot(aes(x, y)) +`
			`geom_line(data = preds) +`
			`geom_point()`
Extract out multiple models to modelling chapter 2015-12-06 17:11:52 +08:00			```

Brain dump on model assessment 2016-06-14 22:02:17 +08:00			`But do you think this model will do well if we apply it to new data from the same population?`

			`This case is a simulation, so we could just resimulate data from the same process and see how well it does:`
Extract out multiple models to modelling chapter 2015-12-06 17:11:52 +08:00
			```{r}
Brain dump on model assessment 2016-06-14 22:02:17 +08:00			`df2 <- df %>% mutate(y = true_model(x))`
			`rmse(mod, df2)`
Extract out multiple models to modelling chapter 2015-12-06 17:11:52 +08:00			```

Brain dump on model assessment 2016-06-14 22:02:17 +08:00			`Obviously it does much worse. But in real-life you can't easily go out and recollect your data. There are two approach to help you get around this problem. I'll introduce them briefly here, and then we'll go into more depth in the following sections.`
Extract out multiple models to modelling chapter 2015-12-06 17:11:52 +08:00
			```{r}
Brain dump on model assessment 2016-06-14 22:02:17 +08:00			`boots <- rerun(100, df %>% mutate(y = true_model(x)))`
			`mods <- map(boots, my_model)`
			`preds <- map2_df(list(grid), mods, ~ add_predictions(.x, y = .y), .id = "id")`
Extract out multiple models to modelling chapter 2015-12-06 17:11:52 +08:00
Brain dump on model assessment 2016-06-14 22:02:17 +08:00			`preds %>%`
			`ggplot(aes(x, y, group = id)) +`
			`geom_line(alpha = 1/3)`
Extract out multiple models to modelling chapter 2015-12-06 17:11:52 +08:00			```

Brain dump on model assessment 2016-06-14 22:02:17 +08:00			```{r}
Update for modelr API changes 2016-06-15 21:27:35 +08:00			`boot <- bootstrap(df, 100)`
			`mods <- boot$strap %>% map(safely(my_model)) %>% transpose()`
Brain dump on model assessment 2016-06-14 22:02:17 +08:00
			`ok <- mods$error %>% map_lgl(is.null)`
			```
Extract out multiple models to modelling chapter 2015-12-06 17:11:52 +08:00
			```{r}
Brain dump on model assessment 2016-06-14 22:02:17 +08:00			`preds <- map2_df(list(grid), mods$result[ok], ~ add_predictions(.x, y = .y), .id = "id")`

			`preds %>%`
			`ggplot(aes(x, y, group = id)) +`
			`geom_line(alpha = 1/3)`
			```
Extract out multiple models to modelling chapter 2015-12-06 17:11:52 +08:00
Brain dump on model assessment 2016-06-14 22:02:17 +08:00			`(You might notice that while each individual model varies a lot, the average of all the models seems like it's pretty good. That gives rise to a model ensemble technique called model averaging.)`
Extract out multiple models to modelling chapter 2015-12-06 17:11:52 +08:00
Brain dump on model assessment 2016-06-14 22:02:17 +08:00			`We could instead use cross-validation to focus on a summary of model quality. It basically works like this:`
Extract out multiple models to modelling chapter 2015-12-06 17:11:52 +08:00
Brain dump on model assessment 2016-06-14 22:02:17 +08:00			```{r}
Update for modelr API changes 2016-06-15 21:27:35 +08:00			`cv <- crossv_mcmc(df, 100, test = 0.3)`
Brain dump on model assessment 2016-06-14 22:02:17 +08:00
Update for modelr API changes 2016-06-15 21:27:35 +08:00			`mods <- map(cv$train, my_model)`
			`rmses <- map2_dbl(mods, cv$test, rmse)`
Brain dump on model assessment 2016-06-14 22:02:17 +08:00
			`data_frame(x = rmses) %>%`
			`ggplot(aes(x)) +`
			`geom_vline(xintercept = rmse(mod, df), colour = "white", size = 2) +`
			`geom_freqpoly(binwidth = 0.05) +`
			`geom_rug()`
			```

			`### Exercises`

			`1. Why can we only fit a model with spline containing 8 degrees of freedom?`
			`Why not 9 or 10?`