First (mostly) complete pass through model basics

This commit is contained in:
hadley 2016-07-20 16:47:34 -05:00
parent e56032efe3
commit 87068f9a27
1 changed files with 149 additions and 76 deletions

View File

@ -63,38 +63,6 @@ options(
)
```
```{r, include = FALSE}
# (to be moved into modelr)
# 1 con
sim1 <- tibble(
x = rep(1:10, each = 3),
y = x * 2 + 5 + rnorm(length(x), sd = 2)
)
# 1 cat
means <- c(a = 1, b = 8, c = 6, d = 2)
sim2 <- tibble(
x = rep(names(means), each = 10),
y = means[x] + rnorm(length(x))
)
# 1 cat + 1 con with interaction
slopes <- c(a = -0.1, b = -0.8, c = -0.1, d = 0.2)
sim4 <-
tidyr::crossing(x1 = 1:10, x2 = factor(names(means)), rep = 1:3) %>%
mutate(y = means[x2] + slopes[x2] * x1 + rnorm(length(x1)), sd = 2)
# 2 con
sim5 <-
tidyr::crossing(x1 = 1:10, x2 = 1:10) %>%
mutate(
y = 2 * x1 - 3 * x2 + rnorm(length(x1), sd = 3)
)
# splines
```
## A simple model
Lets take a look at the simulated dataset `sim1`. It contains two continuous variables, `x` and `y`, which are related in a fairly straightforward way:
@ -318,26 +286,23 @@ There doesn't seem to be any pattern remaining here.
## Formulas and model families
The linear model formula gives a short-hand for writing out model specifications.
You've seen formulas before when using `facet_wrap()` and `facet_grid()`. Formulas provide a general way of getting "special behaviour" in R. Rather than evaluating the values of the variables right away, they capture them so they can be processed more by the function.
* `y ~ x` -> `y = a_0 + a_1 * x`
* `y ~ x - 1` -> `y = a_1 * x`
The majority of modelling functions in R use a standard conversion from formulas to functions. You've seen one simple conversion already: `y ~ x` is translated to `y = a_0 + a_1 * x`. You can make this arbitrarily more complicated by adding more variables: `y ~ x1 + x2 + x3` is translated to `y = a_0 + a_1 * x1 + a_2 * x2 + a_3 * x3`. Note that R always adds an intercept term. If you don't want it, you have to deliberately remove it: `y ~ x - 1` is translated to `y = a_1 * x`.
If you want to
When you add variables with `+`, the model will estimate each effect independent of all the others. It's possible to fit the so-called interaction by using `*`. For example, `y ~ x1 * x2` is translated to `y = a_0 + a_1 * a1 + a_2 * a2 + a_12 * a1 * a2`. Note that whenever you use `*`, both the interaction and the individual components are included in the model.
* `y ~ x1 + x2` -> `y = a_0 + a_1 * x1 + a_2 * x2`
You can also perform transformations inside the model formula. For example, `log(y) ~ sqrt(x1) + x2` is transformed to `y = a_0 + x1 * sqrt(x) + a_2 * x2`. If your transformation involves `+` or `*` you'll need to wrap it in `I()` so R doesn't treat it like part of the model specification. For example, `y ~ x + I(x ^ 2)` is translated to `y = a_0 + a_1 * x + a_2 * x^2`. If you forget the `I()` and specify `y ~ x^2 + x`, R considers this to be the same as `y ~ x * x + x`. `x * x` means the interaction of `x` with itself, so it's the same as `x`. R automatically drops redundant variables so `x + x` become `x`, meaning that `y ~ x ^ 2 + x` specifies the function `y = a_0 + a_1 * x`.
These each estimate the each of `x1` and `x2`
* `y ~ x1 * x2` -> `y = a_0 + a_1 * a1 + a_2 * a2 + a_12 * a1 * a2`
* `y ~ x1 + x2 + x1:x2` is the same thing.
This formula notation is known as "Wilkinson-Rogers notation", and was initially described in _Symbolic Description of Factorial Models for Analysis of Variance_, G. N. Wilkinson and C. E. Rogers. Journal of the Royal Statistical Society. Series C (Applied Statistics). Vol. 22, No. 3 (1973), pp. 392-399. <https://www.jstor.org/stable/2346786>. It's worth digging up and reading the original paper if you'd like to understand the full details of the underlying algebra.
### Categorical variables
This is straightforward when the predictor is continuous, but things get a bit more complicated when the predictor is categorical. Imagine you have a formula like `y ~ sex`, where sex could either be male or female. It doesn't make sense to convert that to a formula like `y = x_0 + x_1 * sex` because `sex` isn't a number - you can't multiply it! Instead what R does is convert it to `y = x_0 + x_1 * sex_male` where `sex_male` is one if `sex` is male and 0 otherwise.
Generating a function from a formula is straight forward when the predictor is continuous, but things get a bit more complicated when the predictor is categorical. Imagine you have a formula like `y ~ sex`, where sex could either be male or female. It doesn't make sense to convert that to a formula like `y = x_0 + x_1 * sex` because `sex` isn't a number - you can't multiply it! Instead what R does is convert it to `y = x_0 + x_1 * sex_male` where `sex_male` is one if `sex` is male and 0 otherwise.
You might wonder why R doesn't convert it to `y = x_0 + x_1 * sex_male + x_2 * sex_female`. A little thought should reveal that this would be difficult because ...
You might wonder why R doesn't convert it to `y = x_0 + x_1 * sex_male + x_2 * sex_female`. This unfortunately can't work because don't have enough information to figure out what the "intercept" is in this case. Understanding why this doesn't work in general requires someone linear algebra ideas (like full rank), but in this simple case you can understand it with a small thought experiment. The estimates `x_0` corresponds to the global mean, `x_1` to the effect for males, and `x_2` the effect for females. All you can measure is the average response for males and the average response for females. What does the global average mean?
With the simpler function, `y = x_0 + x_1 * sex_male`, `x_0` corresponds to the average for females, and `x_1` difference between females and males. You'll always see this if you look at the coefficients of a model with a categorical variable: there is always one less coefficient than you expect. Fortunately, however, if you stick with visualising predictions from the model you don't need to worry about this.
The process of turning a categorical variable into a 0-1 matrix has different names. Sometimes the individual 0-1 columns are called dummy variables. In machine learning, it's called one-hot encoding. In statistics, the process is called creating a contrast matrix. General example of "feature generation": taking things that aren't continuous variables and figuring out how to represent them.
@ -371,80 +336,188 @@ tibble(x = "e") %>% add_predictions(mod2)
### A continuous and categorical
What happens when you combine a continuous an a categorical variable?
What happens when you combine a continuous an a categorical variable? `sim3` contains a categorical predictor and a continuous predictor. We can visualise it with a simple plot:
```{r}
ggplot(sim4, aes(x1, y)) +
ggplot(sim3, aes(x1, y)) +
geom_point(aes(colour = x2))
```
Let's explore two models:
There are two possible models you could fit to this data:
```{r}
mod1 <- lm(y ~ x1 + x2, data = sim4)
mod2 <- lm(y ~ x1 * x2, data = sim4)
mod1 <- lm(y ~ x1 + x2, data = sim3)
mod2 <- lm(y ~ x1 * x2, data = sim3)
```
There are two tricks to our grid. We have two predictors, so we need to give `expand()` two variables. It finds all the unique values of `x1` and `x2` and then generates all combinations. We are also going to add the predictions from both models simulteanously using `spread_predictions()` which adds each prediction to a new column. The complement of `spread_predictions()` is `gather_predictions()` which uses rows instead of columns
To visualise these models we need two new tricks:
1. We have two predictors, so we need to give `expand()` two variables.
It finds all the unique values of `x1` and `x2` and then generates all
combinations.
1. To generate predictions from both models simultaneously, we can use
`gather_predictions()` which adds each prediction as a row. The
complete of `gather_predictions()` is `spread_predictions()` which adds
each prediction to a new column.
Together this gives us:
```{r}
grid <- sim4 %>%
grid <- sim3 %>%
expand(x1, x2) %>%
gather_predictions(mod1, mod2)
grid
```
Let's visualise the results using facetting to display both models on one plot.
We can visualise the results for both models on one plot using facetting:
```{r}
ggplot(sim4, aes(x1, y, colour = x2)) +
ggplot(sim3, aes(x1, y, colour = x2)) +
geom_point() +
geom_line(data = grid, aes(y = pred)) +
facet_wrap(~ model)
```
The model that uses `+` has the same slope for each line. The model that uses `*` has a different slope for each line.
Note that the model that uses `+` has the same slope for each line, but different intercepts. The model that uses `*` has a different slope and intercept for each line.
Which model is better for this data? We can take look at the residuals, using the same idea as above. Here I've used facetting because it makes it easier to see the pattern within each group.
Which model is better for this data? We can take look at the residuals. Here I've facetted by both model and `x2` because it makes it easier to see the pattern within each group.
```{r}
sim4 <- sim4 %>%
sim3 <- sim3 %>%
gather_residuals(mod1, mod2)
ggplot(sim4, aes(x1, resid, colour = x2)) +
ggplot(sim3, aes(x1, resid, colour = x2)) +
geom_point() +
facet_grid(model ~ x2)
```
There is little obvious pattern in the residuals for `mod2`. The residuals for `mod1` show that the model has clearly missed some pattern in `b`, and less so, but still present is pattern in `c`, and `d`.
There is little obvious pattern in the residuals for `mod2`. The residuals for `mod1` show that the model has clearly missed some pattern in `b`, and less so, but still present is pattern in `c`, and `d`. You might wonder if there's a precise way to tell which of `mod1` or `mod2` is better. There is, but it requires a lot of mathematical background, and we don't really care. Here, we're interested in a qualitative assessment of whether or not the model has captured the pattern that we're interested in.
### Two continuous variables
As you move into three dimensions and higher, creating good visualisations gets harder and harder. But since we're using models to support vis of your data, it's not so bad. As you discover more patterns, you'll grow your model over time.
Let's take a look at the equivalent model for two continuous variables. Initially things proceed almost identically to the previous example:
```{r}
mod1 <- lm(y ~ x1 + x2, data = sim4)
mod2 <- lm(y ~ x1 * x2, data = sim4)
grid <- sim4 %>%
expand(
x1 = seq_range(x1, 5),
x2 = seq_range(x2, 5)
) %>%
gather_predictions(mod1, mod2)
grid
```
Note my use of `seq_range()` inside `expand()`. Instead of using every unique value of `x`, I'm going to use a regularly spaced grid of five values between the minimum and maximum numbers. It's probably not super important here, but it's a useful technique in general.
Next let's try and visualise that model. We have two continuous predictors, so you can imagine the model like a 3d surface. We could display that using `geom_tile()`:
```{r}
ggplot(grid, aes(x1, x2)) +
geom_tile(aes(fill = pred)) +
facet_wrap(~ model)
```
That doesn't suggest that the models are very different! But that's partly an illusion our eyes + brains are not very good at accurately comparing shades of colour. Instead of looking at the surface from the top, we could look at it from either side, showing multiple slices:
```{r, asp = 1/2}
ggplot(grid, aes(x1, pred, colour = x2, group = x2)) +
geom_line() +
facet_wrap(~ model)
ggplot(grid, aes(x2, pred, colour = x1, group = x1)) +
geom_line() +
facet_wrap(~ model)
```
This shows you that interaction between two continuous variables works basically the same way as for a categorical and continuous variable. An interaction says that there's not a fixed offset: you need to consider both values of `x1` and `x2` simultaneously in order to predict `y`.
You can see that even with just two continuous variables, coming up with good visualisations are hard. But that's reasonable: you shouldn't expect it will be easy to understand how three or more variables simultaneously interact! But again, we're saved a little because we're using models for exploration, and you can gradually build up your model over time. The model doesn't have to be perfect, it just has to help you reveal a little more about your data.
I spent some time looking at the residuals to see if I could figure if `mod2` did `mod1`. I think it does but it's pretty subtle. You'll have a chance to work on it in the exercises.
### Exercises
1. Using the basic principles, convert the formuals in the following two
models into functions. (Hint: start by converting the categorical variable
into 0-1 variables.)
```{r, eval = FALSE}
mod1 <- lm(y ~ x1 + x2, data = sim3)
mod2 <- lm(y ~ x1 * x2, data = sim3)
```
1. For `sim4`, which of `mod1` and `mod2` is better? I think `mod2` does a
slighty better job at removing patterns, but it's pretty subtle. Can you
come up with a plot to support my claim?
## Missing values
Missing values obviously can not convey any information:
```{r}
df <- tibble::frame_data(
~x, ~y,
1, 2.2,
2, NA,
3, 3.5,
4, 8.3,
NA, 10
)
mod <- lm(y ~ x, data = df)
```
Unfortunately this is one of the rare cases in R where missing values will go silently missing.
```{r}
nobs(mod)
nrow(df)
```
Note that this is not the default behaviour of `predict()`, which normally just drops rows with missing values when generating predictions. That's what this setting does:
```{r}
options(na.option = na.exclude)
```
I don't really understand why it's called `na.exclude` when it causes missing values to be include in the predictions but there you have it.
## Other model families
### Generalised linear models
Here we've focussed on linear models, which is a fairly limited space (but it does include a first-order linear approximation of any more complicated model).
Different distance metric (likelihood)
Some extensions of linear models are:
Use `glm()` to perform logistic regression in R.
* Generalised linear models, e.g. `stats::glm()`. Linear models assume that
the predictor is continuous and the errors has a normal distribution.
Generalised linear models extend linear models to include non-continuous
predictors (e.g. binary data or counts). They work by defining a distance
metric based on the statistical idea of likelihood.
* Generalised additive models, e.g. `mgcv::gam()`, extend generalised
linear models to incorporate arbitrary smooth functions. That means you can
write a formula like `y ~ s(x)` which becomes an equation like
`y = f(x)` and the `gam()` estimate what that function is (subject to some
smoothness constraints to make the problem tractable).
* Penalised linear models, e.g. `glmnet::glmnet()`, add a penalty term to
the distance which penalises complex models (as defined by the distance
between the parameter vector and the origin). This tends to make
models that generalise better to new datasets from the same population.
### Additive models
`mgcv::gam()`.
* Robust linear models, e.g. `MASS:rlm()`, tweak the distance to downweight
points that are very far away. This makes them less sensitive to the presence
of outliers, at the cost of being not quite as good when there are no
outliers.
* Trees, e.g. `rpart::rpart()`, attack the problem in a complete different
way to linear models. They fit a piece-wise constant model, splitting the
data into progressively smaller and smaller pieces. Trees aren't terribly
effective by themselves, but they are very powerful when used in aggregated
by models like random forrests (e.g. `randomForest::randomForrest()`) or
in gradient boosting machines (e.g. `xgboost::xgboost`.)
### Penalised models
Add penalty to distance.
### Robust models
Distance that is less sensitive to outliers.
### Trees and forests
### Gradient boosting