r4ds/model-assess.Rmd

# Model assessment

```{r setup-model, include=FALSE}
library(purrr)
set.seed(1014)
options(digits = 3)
```

* Some discussion of p-values.
* Bootstrapping to understand uncertainty in parameters.
* Cross-validation to understand predictive quality.


## Purrr + dplyr

i.e. how do dplyr and purrr intersect.

* Why use a data frame?
* List columns in a data frame
* Mutate & filter.
* Creating list columns with `group_by()` and `do()`.

## Multiple models

A natural application of `map2()` is handling test-training pairs when doing model evaluation.  This is an important modelling technique: you should never evaluate a model on the same data it was fit to because it's going to make you overconfident. Instead, it's better to divide the data up and use one piece to fit the model and the other piece to evaluate it. A popular technique for this is called k-fold cross validation. You randomly hold out x% of the data and fit the model to the rest. You need to repeat this a few times because of random variation.

Why you should store related vectors (even if they're lists!) in a
data frame. Need example that has some covariates so you can (e.g.)
select all models for females, or under 30s, ...

Let's start by writing a function that partitions a dataset into test and training:

```{r}
partition <- function(df, p) {
  n <- nrow(df)
  groups <- rep(c(TRUE, FALSE), n * c(p, 1 - p))
  sample(groups)
}
partition(mtcars, 0.1)
```

We'll generate 20 random test-training splits, and then create lists of test-training datasets:

```{r}
partitions <- rerun(200, partition(mtcars, 0.25))

tst <- partitions %>% map(~mtcars[.x, , drop = FALSE])
trn <- partitions %>% map(~mtcars[!.x, , drop = FALSE])
```

Then fit the models to each training dataset:

```{r}
mod <- trn %>% map(~lm(mpg ~ wt, data = .))
```

If we wanted, we could extract the coefficients using broom, and make a single data frame with `map_df()` and then visualise the distributions with ggplot2:

```{r}
coef <- mod %>%
  map_df(broom::tidy, .id = "i")
coef

library(ggplot2)
ggplot(coef, aes(estimate)) +
  geom_histogram(bins = 10) +
  facet_wrap(~term, scales = "free_x")
```

But we're most interested in the quality of the models, so we make predictions for each test data set and compute the mean squared distance between predicted and actual:

```{r}
pred <- map2(mod, tst, predict)
actl <- map(tst, "mpg")

msd <- function(x, y) sqrt(mean((x - y) ^ 2))
mse <- map2_dbl(pred, actl, msd)
mean(mse)

mod <- lm(mpg ~ wt, data = mtcars)
base_mse <- msd(mtcars$mpg, predict(mod))
base_mse

ggplot(, aes(mse)) +
  geom_histogram(binwidth = 0.25) +
  geom_vline(xintercept = base_mse, colour = "red")
```