More model basic tweaking

This commit is contained in:
hadley 2016-07-25 15:06:57 -05:00
parent 1b9b326b73
commit ae4a383453
2 changed files with 61 additions and 48 deletions

View File

@ -43,23 +43,17 @@ The goal of a model is not to uncover truth, but to discover a simple approximat
### Prerequisites
We need a couple of packages specifically designed for modelling, and all the packages you've used before for EDA. I also recommend setting some options that tweak the default linear model behaviour to be a little less surprising.
We need a couple of packages specifically designed for modelling, and all the packages you've used before for EDA.
```{r setup, message = FALSE}
# Modelling functions
library(modelr)
library(broom)
# Modelling requires plently of visualisation and data manipulation
# EDA tools
library(ggplot2)
library(dplyr)
library(tidyr)
# Options that make your life easier
options(
contrasts = c("contr.treatment", "contr.treatment"),
na.option = na.exclude
)
```
## A simple model
@ -327,15 +321,36 @@ This looks like random noise, suggesting that our model has done a good job of c
You've seen formulas before when using `facet_wrap()` and `facet_grid()`. In R, formulas provide a general way of getting "special behaviour". Rather than evaluating the values of the variables right away, they capture them so they can be interpreted by the function.
The majority of modelling functions in R use a standard conversion from formulas to functions. You've seen one simple conversion already: `y ~ x` is translated to `y = a_1 + a_2 * x`. You can make this arbitrarily more complicated by adding more variables: `y ~ x1 + x2 + x3` is translated to `y = a_1 + a_2 * x1 + a_3 * x2 + a_4 * x3`. R always adds an intercept term unless you specifically remove it: `y ~ x - 1` is translated to `y = a_1 * x`.
The majority of modelling functions in R use a standard conversion from formulas to functions. You've seen one simple conversion already: `y ~ x` is translated to `y = a_1 + a_2 * x`. If you want to see what R actually does, you can use the `model_matrix()` function. It takes a data frame and a formula and returns a tibble that defines the model equation: each column in the output is associated with one coefficient in the model, the function is always `y = a_1 * out1 + a_2 * out_2`. For the simplest case of `y ~ x1` this shows us something interesting:
The following sections explore how this plays out in more detail.
```{r}
df <- frame_data(
~y, ~x1, ~x2,
4, 2, 5,
5, 1, 6
)
model_matrix(df, y ~ x1)
```
The way that R adds the intercept to the model is just by having a column that is full of ones. By default, R will always add this column. If you don't want, you need to explicitly drop it with `-1`:
```{r}
model_matrix(df, y ~ x1 - 1)
```
The model matrix grows in an unsurprising way when you add more variables to the the model:
```{r}
model_matrix(df, y ~ x1 + x2)
```
This formula notation is sometimes called "Wilkinson-Rogers notation", and was initially described in _Symbolic Description of Factorial Models for Analysis of Variance_, by G. N. Wilkinson and C. E. Rogers <https://www.jstor.org/stable/2346786>. It's worth digging up and reading the original paper if you'd like to understand the full details of the modelling algebra.
The following sections expand on how this formula notation works for categorcal variables, interactions, and transformation.
### Categorical variables
Generating a function from a formula is straight forward when the predictor is continuous, but things get a bit more complicated when the predictor is categorical. Imagine you have a formula like `y ~ sex`, where sex could either be male or female. It doesn't make sense to convert that to a formula like `y = x_0 + x_1 * sex` because `sex` isn't a number - you can't multiply it! Instead what R does is convert it to `y = x_0 + x_1 * sex_male` where `sex_male` is one if `sex` is male and zero otherwise.
If you want to see what R actually does, you can use the `model_matrix()` function. It takes a data frame and a formula and returns a tibble that defines the model equation: each column in the output is associated with one coefficient in the model. This is useful if you ever want to understand exactly which equation is generated by your formula.
Generating a function from a formula is straight forward when the predictor is continuous, but things get a bit more complicated when the predictor is categorical. Imagine you have a formula like `y ~ sex`, where sex could either be male or female. It doesn't make sense to convert that to a formula like `y = x_0 + x_1 * sex` because `sex` isn't a number - you can't multiply it! Instead what R does is convert it to `y = x_0 + x_1 * sex_male` where `sex_male` is one if `sex` is male and zero otherwise:
```{r, echo = FALSE}
df <- frame_data(
@ -347,11 +362,9 @@ df <- frame_data(
model_matrix(df, response ~ sex)
```
The process of turning a categorical variable into a 0-1 matrix has different names. Sometimes the individual 0-1 columns are called dummy variables. In machine learning, it's called one-hot encoding. In statistics, the process is called creating a contrast matrix. General example of "feature generation": taking things that aren't continuous variables and figuring out how to represent them.
You might wonder why R also doesn't create a `sexfemale` column. The problem is that would create a column that is perfectly predictable based on the other columns (i.e. `sexfemale = 1 - sexmale`). Unfortunately the exact details of why this is a problem is beyond the scope of this book, but basically it creates a model family that is too flexible, and will have infinitely many models that are equally close to the data.
You might wonder why R doesn't convert it to `y = x_0 + x_1 * sex_male + x_2 * sex_female`. This unfortunately can't work because don't have enough information to figure out what the "intercept" is in this case. Understanding why this doesn't work in general requires someone linear algebra ideas (like full rank), but in this simple case you can understand it with a small thought experiment. The estimates `x_0` corresponds to the global mean, `x_1` to the effect for males, and `x_2` the effect for females. All you can measure is the average response for males and the average response for females. What does the global average mean?
With the simpler function, `y = x_0 + x_1 * sex_male`, `x_0` corresponds to the average for females, and `x_1` difference between females and males. You'll always see this if you look at the coefficients of a model with a categorical variable: there is always one less coefficient than you expect. Fortunately, however, if you focus on visualising predictions you don't need to worry about these details. Let's look at some data and models to make that concrete. Here's the `sim2` dataset from modelr:
Fortunately, however, if you focus on visualising predictions you don't need to worry about the exact parameterisation. Let's look at some data and models to make that concrete. Here's the `sim2` dataset from modelr:
```{r}
ggplot(sim2) +
@ -383,9 +396,9 @@ You can't make predictions about levels that you didn't observe. Sometimes you'l
tibble(x = "e") %>% add_predictions(mod2)
```
### Interactions
### Interactions (continuous and categorical)
What happens when you combine a continuous an a categorical variable? `sim3` contains a categorical predictor and a continuous predictor. We can visualise it with a simple plot:
What happens when you combine a continuous and a categorical variable? `sim3` contains a categorical predictor and a continuous predictor. We can visualise it with a simple plot:
```{r}
ggplot(sim3, aes(x1, y)) +
@ -445,6 +458,8 @@ ggplot(sim3, aes(x1, resid, colour = x2)) +
There is little obvious pattern in the residuals for `mod2`. The residuals for `mod1` show that the model has clearly missed some pattern in `b`, and less so, but still present is pattern in `c`, and `d`. You might wonder if there's a precise way to tell which of `mod1` or `mod2` is better. There is, but it requires a lot of mathematical background, and we don't really care. Here, we're interested in a qualitative assessment of whether or not the model has captured the pattern that we're interested in.
### Interactions (two continuous)
Let's take a look at the equivalent model for two continuous variables. Initially things proceed almost identically to the previous example:
```{r}
@ -470,7 +485,7 @@ ggplot(grid, aes(x1, x2)) +
facet_wrap(~ model)
```
That doesn't suggest that the models are very different! But that's partly an illusion our eyes + brains are not very good at accurately comparing shades of colour. Instead of looking at the surface from the top, we could look at it from either side, showing multiple slices:
That doesn't suggest that the models are very different! But that's partly an illusion: our eyes and brains are not very good at accurately comparing shades of colour. Instead of looking at the surface from the top, we could look at it from either side, showing multiple slices:
```{r, asp = 1/2}
ggplot(grid, aes(x1, pred, colour = x2, group = x2)) +
@ -485,17 +500,29 @@ This shows you that interaction between two continuous variables works basically
You can see that even with just two continuous variables, coming up with good visualisations are hard. But that's reasonable: you shouldn't expect it will be easy to understand how three or more variables simultaneously interact! But again, we're saved a little because we're using models for exploration, and you can gradually build up your model over time. The model doesn't have to be perfect, it just has to help you reveal a little more about your data.
I spent some time looking at the residuals to see if I could figure if `mod2` did `mod1`. I think it does but it's pretty subtle. You'll have a chance to work on it in the exercises.
I spent some time looking at the residuals to see if I could figure if `mod2` did `mod1`. I think it does, but it's pretty subtle. You'll have a chance to work on it in the exercises.
### Transformations
You can also perform transformations inside the model formula. For example, `log(y) ~ sqrt(x1) + x2` is transformed to `y = a_1 + a_2 * x1 * sqrt(x) + a_3 * x2`. If your transformation involves `+`, `*`, `^`, or `-`, you'll need to wrap it in `I()` so R doesn't treat it like part of the model specification. For example, `y ~ x + I(x ^ 2)` is translated to `y = a_1 + a_2 * x + a_3 * x^2`. If you forget the `I()` and specify `y ~ x ^ 2 + x`, R will compute `y ~ x * x + x`. `x * x` means the interaction of `x` with itself, which is the same as `x`. R automatically drops redundant variables so `x + x` become `x`, meaning that `y ~ x ^ 2 + x` specifies the function `y = a_1 + a_2 * x`. That's probably not what you intended!
This formula notation is sometimes called "Wilkinson-Rogers notation", and was initially described in _Symbolic Description of Factorial Models for Analysis of Variance_, by G. N. Wilkinson and C. E. Rogers <https://www.jstor.org/stable/2346786>. It's worth digging up and reading the original paper if you'd like to understand the full details of the modelling algebra.
Again, if you get confused about what your model is doing, you can always use `model_matrix()` to see exactly what equation `lm()` is fitting:
```{r}
df <- tibble(y = 1:3, x = 1:3)
model_matrix(df, y ~ x^2 + x)
model_matrix(df, y ~ I(x^2) + x)
```
### Exercises
1. What happens if you repeat the analysis of `sim2` using a model without
an intercept. What happens to the model equation? What happens to the
predictions?
1. Use `model_matrix()` to explore the equations generated for the models
I fit to `sim3` and `sim4`. Why is `*` a good shorthand for interaction?
1. Using the basic principles, convert the formulas in the following two
models into functions. (Hint: start by converting the categorical variable
into 0-1 variables.)
@ -511,7 +538,7 @@ This formula notation is sometimes called "Wilkinson-Rogers notation", and was i
## Missing values
Missing values obviously can not convey any information:
Missing values obviously can not convey any information about the relationship between the variables, so modelling functions will silently drop any rows that contain missing values:
```{r}
df <- tibble::frame_data(
@ -526,53 +553,44 @@ df <- tibble::frame_data(
mod <- lm(y ~ x, data = df)
```
Unfortunately this is one of the rare cases in R where missing values will go silently missing.
Unfortunately this is one of the rare cases in R where missing values will go silently missing without any warning. You can spot their absence by comparing the number of rows in the data frame with the number of observations used by the model:
```{r}
nobs(mod)
nrow(df)
```
Note that this is not the default behaviour of `predict()`, which normally just drops rows with missing values when generating predictions. That's what this setting does:
```{r}
options(na.option = na.exclude)
```
I don't really understand why it's called `na.exclude` when it causes missing values to be include in the predictions but there you have it.
## Other model families
Here we've focused on linear models, which is a fairly limited space (but it does include a first-order linear approximation of any more complicated model).
This chapter has focussed exclusively on the class of linear models, which assume a relationship of the form `y = a_1 * x1 + a_2 * x2 + ... + a_n * xn`. Linear models additionally assume that the residuals have a normal distribution, which we haven't talked about. There are a large set of model classes that extend the linear model in various interesting ways. Some of them are:
Some extensions of linear models are:
* Generalised linear models, e.g. `stats::glm()`. Linear models assume that
* __Generalised linear models__, e.g. `stats::glm()`. Linear models assume that
the response is continuous and the error has a normal distribution.
Generalised linear models extend linear models to include non-continuous
responses (e.g. binary data or counts). They work by defining a distance
metric based on the statistical idea of likelihood.
* Generalised additive models, e.g. `mgcv::gam()`, extend generalised
* __Generalised additive models__, e.g. `mgcv::gam()`, extend generalised
linear models to incorporate arbitrary smooth functions. That means you can
write a formula like `y ~ s(x)` which becomes an equation like
`y = f(x)` and the `gam()` estimate what that function is (subject to some
smoothness constraints to make the problem tractable).
* Penalised linear models, e.g. `glmnet::glmnet()`, add a penalty term to
* __Penalised linear models__, e.g. `glmnet::glmnet()`, add a penalty term to
the distance which penalises complex models (as defined by the distance
between the parameter vector and the origin). This tends to make
models that generalise better to new datasets from the same population.
* Robust linear models, e.g. `MASS:rlm()`, tweak the distance to downweight
* __Robust linear models__, e.g. `MASS:rlm()`, tweak the distance to downweight
points that are very far away. This makes them less sensitive to the presence
of outliers, at the cost of being not quite as good when there are no
outliers.
* Trees, e.g. `rpart::rpart()`, attack the problem in a complete different
* __Trees__, e.g. `rpart::rpart()`, attack the problem in a complete different
way to linear models. They fit a piece-wise constant model, splitting the
data into progressively smaller and smaller pieces. Trees aren't terribly
effective by themselves, but they are very powerful when used in aggregated
by models like random forests (e.g. `randomForest::randomForest()`) or
in gradient boosting machines (e.g. `xgboost::xgboost`.)
effective by themselves, but they are very powerful when used in aggregate
by models like __random forests__ (e.g. `randomForest::randomForest()`) or
__gradient boosting machines__ (e.g. `xgboost::xgboost`.)
These models all work similarly from a programming perspective. Once you've mastered linear models, you should find it easy to master the mechanics of these other model classes.

View File

@ -42,11 +42,6 @@ library(modelr)
```{r}
library(modelr)
options(
contrasts = c("contr.treatment", "contr.treatment"),
na.option = na.exclude
)
library(nycflights13)
library(lubridate)
library(dplyr)