More on model building

Including some needed material in model-basics
This commit is contained in:
hadley 2016-07-26 11:13:43 -05:00
parent 62824d9085
commit 88650992ea
4 changed files with 200 additions and 128 deletions

View File

@ -10,6 +10,7 @@ URL: https://github.com/hadley/r4ds
Imports:
bookdown,
broom,
condvis,
dplyr,
gapminder,
ggplot2,

View File

@ -299,30 +299,6 @@ tibble(education = seq(5, 25)) %>%
```
Other useful arguments to `seq_range()`:
* `pretty = TRUE` will generate a "pretty" sequence, i.e. something that looks
nice to the human eye. This is useful if you want to produce tables of
output:
```{r}
seq_range(c(0.0123, 0.923423), n = 5)
seq_range(c(0.0123, 0.923423), n = 5, pretty = TRUE)
```
* `trim = 0.1` will trim off 10% of the tail values. This is useful if the
variables has an long tailed distribution and you want to focus on generating
values near the center:
```{r}
x <- rcauchy(100)
seq_range(x, n = 5)
seq_range(x, n = 5, trim = 0.10)
seq_range(x, n = 5, trim = 0.25)
seq_range(x, n = 5, trim = 0.50)
```
### Additive models

View File

@ -475,7 +475,39 @@ grid <- sim4 %>%
grid
```
Note my use of `seq_range()` inside `expand()`. Instead of using every unique value of `x`, I'm going to use a regularly spaced grid of five values between the minimum and maximum numbers. It's probably not super important here, but it's a useful technique in general.
Note my use of `seq_range()` inside `expand()`. Instead of using every unique value of `x`, I'm going to use a regularly spaced grid of five values between the minimum and maximum numbers. It's probably not super important here, but it's a useful technique in general. There are two other useful arguments to `seq_range()`:
* `pretty = TRUE` will generate a "pretty" sequence, i.e. something that looks
nice to the human eye. This is useful if you want to produce tables of
output:
```{r}
seq_range(c(0.0123, 0.923423), n = 5)
seq_range(c(0.0123, 0.923423), n = 5, pretty = TRUE)
```
* `trim = 0.1` will trim off 10% of the tail values. This is useful if the
variables has an long tailed distribution and you want to focus on generating
values near the center:
```{r}
x1 <- rcauchy(100)
seq_range(x1, n = 5)
seq_range(x1, n = 5, trim = 0.10)
seq_range(x1, n = 5, trim = 0.25)
seq_range(x1, n = 5, trim = 0.50)
```
* `expand = 0.1` is in some sense the opposite of `trim()` it expands the
range by 10%.
```{r}
x2 <- c(0, 1)
seq_range(x2, n = 5)
seq_range(x2, n = 5, expand = 0.10)
seq_range(x2, n = 5, expand = 0.25)
seq_range(x2, n = 5, expand = 0.50)
```
Next let's try and visualise that model. We have two continuous predictors, so you can imagine the model like a 3d surface. We could display that using `geom_tile()`:
@ -514,6 +546,97 @@ model_matrix(df, y ~ x^2 + x)
model_matrix(df, y ~ I(x^2) + x)
```
Transformations are useful because you can use them to approximate non-linear functions. If you've taken a calculus class, you may have heard of Taylor's theorem which says you can approximate any smooth function with an infinite sum of polynomials. That means you can use a linear to get arbitrary close to a smooth function by fitting an equation like `y = a_1 + a_2 * x + a_3 * x^2 + a_4 * x ^ 3`. Typing that sequence by hand is tedious, so R provides a helper function: `poly()`:
```{r}
model_matrix(df, y ~ poly(x, 2))
```
However there's one major problem with using `poly()`: outside the range of the data, polynomials rapidly shoot off to positive or negative infinity. One safer alternative is to use the natural spline, `splines::ns()`.
```{r}
library(splines)
model_matrix(df, y ~ ns(x, 2))
```
Let's see what that looks like when we try and approximate a non-linear function:
```{r}
sim5 <- tibble(
x = seq(0, 3.5 * pi, length = 50),
y = 4 * sin(x) + rnorm(length(x))
)
ggplot(sim5, aes(x, y)) +
geom_point()
```
I'm going to fit five models to this data.
```{r}
mod1 <- lm(y ~ ns(x, 1), data = sim5)
mod2 <- lm(y ~ ns(x, 2), data = sim5)
mod3 <- lm(y ~ ns(x, 3), data = sim5)
mod4 <- lm(y ~ ns(x, 4), data = sim5)
mod5 <- lm(y ~ ns(x, 5), data = sim5)
grid <- sim5 %>%
expand(x = seq_range(x, n = 50, expand = 0.1)) %>%
gather_predictions(mod1, mod2, mod3, mod4, mod5, .pred = "y")
ggplot(sim5, aes(x, y)) +
geom_point() +
geom_line(data = grid, colour = "red") +
facet_wrap(~ model)
```
Notice that the extrapolation outside the range of the data is clearly bad. This is the downside to approximating a function with a polynomial. But this is a very real problem with every model: the model can never tell you if the behaviour is true when you start extrapolating outside the range of the data that you have seen. You must rely on theory or science.
### Interpolation vs. extrapolation
So far, when we've visualised the predictions from a model, we've been careful to overlay them on the data. This is important because it helps make sure we're not extrapolating the model to data that is far away from what we've observed.
However, as you start working with large datasets overlaying all the data will become increasing challenging. Particularly as you'll often use a model to simplify away some of the complexities of the raw data! To make your life easier, you might be tempted to just display the predictions. This is dangerous because you might have accidentally generated predictions that are very far away from your data.
As a compromise, you can use the convenient `similarityweight()` function from the __condvis__ package by Mark O'Connell. You give it your prediction grid and your dataset, and it computes the similarlity between each observation in your original dataset and the prediction grid. You can then display only the points that are close to your predictions. If no points are close, you know that your prediction grid is dangerous!
```{r}
sim6 <- tibble(
x1 = rbeta(1000, 2, 5),
x2 = rbeta(1000, 2, 5),
y = 2 * x1 * x2 + x1 + 2 + rnorm(length(x1))
)
mod <- lm(y ~ x1 * x2, data = sim6)
grid <- sim6 %>%
expand(
x1 = seq_range(x1, 10),
x2 = c(0, 0.5, 1, 1.5)
) %>%
add_predictions(mod)
add_similarity <- function(data, grid, ..., .similarity = "sim") {
common <- intersect(names(data), names(grid))
message("Using ", paste(common, collapse = ", "))
sim_m <- condvis::similarityweight(grid, data[common], ...)
sim <- apply(sim_m, 2, max)
data[[.similarity]] <- sim
data
}
sim6 %>%
add_similarity(grid) %>%
filter(sim > 0) %>%
ggplot(aes(x1, y, group = x2)) +
geom_point(aes(alpha = sim)) +
geom_line(data = grid, aes(y = pred)) +
scale_alpha(limit = c(0, NA))
```
### Exercises
1. What happens if you repeat the analysis of `sim2` using a model without
@ -562,6 +685,8 @@ nrow(df)
## Other model families
Pattern in variance, not just mean. Or other characteristics of distribution of residuals.
This chapter has focussed exclusively on the class of linear models, which assume a relationship of the form `y = a_1 * x1 + a_2 * x2 + ... + a_n * xn`. Linear models additionally assume that the residuals have a normal distribution, which we haven't talked about. There are a large set of model classes that extend the linear model in various interesting ways. Some of them are:
* __Generalised linear models__, e.g. `stats::glm()`. Linear models assume that

View File

@ -146,9 +146,7 @@ diamonds2 %>%
## What affects the number of daily flights?
We're going to start by building a model to help us understand the number of flights per day that leave NYC. We're not going to end up with a fully realised model, but as you'll see, the steps along the way will help us better understand the data.
We'll start by using dplyr to generate the data of interest:
Let's explore the number of flights that leave NYC per day. We're not going to end up with a fully realised model, but as you'll see, the steps along the way will help us better understand the data. Let's get started by counting the number of flights per day and visualising it with ggplot2.
```{r}
daily <- flights %>%
@ -156,18 +154,16 @@ daily <- flights %>%
group_by(date) %>%
summarise(n = n())
daily
```
And then we'll plot it with ggplot2:
```{r}
ggplot(daily, aes(date, n)) +
geom_line()
```
This is a really small dataset --- only 365 rows and 2 columns, but because as you'll see there's a rich set of interesting variables buried in the date.
### Day of week
This is a small dataset, but there's a lot of pattern to explore. Understanding the long-term trend is challenging because there's a very strong day-of-week effect that dominates the subtler patterns:
Understanding the long-term trend is challenging because there's a very strong day-of-week effect that dominates the subtler patterns. Let's summarise the number of flights per day-of-week:
```{r}
daily <- daily %>%
@ -176,27 +172,33 @@ ggplot(daily, aes(wday, n)) +
geom_boxplot()
```
There are fewer flights on weekends because a very large proportion of travel is for business. You might sometimes have to less on Sunday for an early flight, but it's very rare that you'd leave on Saturday: you'd much rather be home with your family.
There are fewer flights on weekends because most travel is for business. The effect is particularly pronounced on Saturday: you might sometimes have to leave on Sunday for a Monday morning meeting, but it's very rare that you'd leave on Saturday as you'd much rather be at home with your family.
One way to remove this strong pattern is to fit a model that "explains" (i.e. attempts to predict) the day of week effect, and then look at the residuals. Another way of thinking about this is that we're capturing the day-of-week effect, moving it from the data, into a model.
One way to remove this strong pattern is to use a model. First, we fit the model, and display its predictions overlaid on the original data:
```{r}
mod <- lm(n ~ wday, data = daily)
daily <- daily %>% add_residuals(mod)
daily %>%
ggplot(aes(date, resid)) +
geom_hline(yintercept = 0, size = 2, colour = "white") +
geom_line()
daily %>%
grid <- daily %>%
expand(wday) %>%
add_predictions(mod) %>%
ggplot(aes(wday, pred)) +
geom_point()
add_predictions(mod, "n")
ggplot(daily, aes(wday, n)) +
geom_boxplot() +
geom_point(data = grid, colour = "red", size = 4)
```
Note the change in the y-axis: now we are seeing the deviation from the expected number of flights, given the day of week. This plot is interesting because now that we've removed much of the large day-of-week effect, we can see some of the subtler patterns that remain:
Next we compute and visualise the residuals:
```{r}
daily <- daily %>% add_residuals(mod)
daily %>%
ggplot(aes(date, resid)) +
geom_ref_line(h = 0) +
geom_line()
```
Note the change in the y-axis: now we are seeing the deviation from the expected number of flights, given the day of week. This plot is useful because now that we've removed much of the large day-of-week effect, we can see some of the subtler patterns that remain:
1. Our day of week adjustment seems to fail starting around June: you can
still see a strong regular pattern that our model hasn't removed. Drawing
@ -205,16 +207,13 @@ Note the change in the y-axis: now we are seeing the deviation from the expected
```{r}
ggplot(daily, aes(date, resid, colour = wday)) +
geom_hline(yintercept = 0, size = 2, colour = "white") +
geom_ref_line(h = 0) +
geom_line()
```
The problem appears to be Saturdays: it seems like during summer there are
more flights on Saturdays than we expect, and during Fall there are fewer.
I suspect this is because of summer holidays: many people go on holiday
in the summer, and people don't mind travelling on Saturdays for vacation.
(This doesn't, however, explain why there are more Satruday flights in
spring than fall).
Our model fails to accurately predict the number of flights on Saturday:
during summer there are more flights than we expect, and during Fall there
are fewer. We'll see how we can do better in the next section.
1. There are some days with far fewer flights than expected:
@ -233,29 +232,36 @@ Note the change in the y-axis: now we are seeing the deviation from the expected
```{r}
daily %>%
ggplot(aes(date, resid)) +
geom_hline(yintercept = 0, size = 2, colour = "white") +
geom_ref_line(h = 0) +
geom_line(colour = "grey50") +
geom_smooth(se = FALSE, span = 0.20)
```
There are fewer flights in January (and December), and more in summer
(May-Sep). We can't do much more with this trend than brainstorm possible
explanations because we only have a single year's worth of data.
(May-Sep). We can't do much with this pattern numerically, because we only
have a single year of data. But we can use our domain knowledge to
brainstorm potential explanations.
### Seasonal Saturday effect
We'll tackle the day of week effect first. Let's zoom in on Saturdays, going back to raw numbers:
Let's first tackle our failure to accurately predict the number of flights on Saturday. A good place to start is to go back to the raw numbers, focussing on Saturdays:
```{r}
daily %>%
filter(wday == "Sat") %>%
ggplot(aes(date, n)) +
geom_point() +
geom_line() +
geom_point(alpha = 1/3) +
scale_x_datetime(NULL, date_breaks = "1 month", date_labels = "%b")
```
So it looks like summer holidays are from early June to late August. That seems to line up fairly well with the [state's school terms](http://schools.nyc.gov/Calendar/2013-2014+School+Year+Calendars.htm): summer break is Jun 26 - Sep 9. Few families travel in the fall because of the big Thanksgiving and Christmas holidays. So lets add a "term" variable to attemp to control for that.
(I've used both points and lines to make it more clear what is data and what is interpolation.)
I suspect pattern is caused by summer holidays: many people go on holiday in the summer, and people don't mind travelling on Saturdays for vacation. Looking at this plot, we might guess that summer holidays are from early June to late August. That seems to line up fairly well with the [state's school terms](http://schools.nyc.gov/Calendar/2013-2014+School+Year+Calendars.htm): summer break in 2013 was Jun 26--Sep 9.
Why are there Saturday flights in the Fall than the Spring? I asked some American friends and they suggested that it's less common to plan family vacations during the Fall becuase of the big Thanksgiving and Christmas holidays. We can't tell if that's exactly the reason, but it seems like a plausible working hypothesis.
Lets create a "term" variable that roughly captures the three school terms, and check our work with a plot:
```{r}
term <- function(date) {
@ -292,12 +298,25 @@ mod1 <- lm(n ~ wday, data = daily)
mod2 <- lm(n ~ wday * term, data = daily)
daily %>%
gather_residuals(mod1, mod2) %>%
gather_residuals(without_term = mod1, with_term = mod2) %>%
ggplot(aes(date, resid, colour = model)) +
geom_line(alpha = 0.75)
```
That's because this model is basically calculating an average for each combination of wday and school term. (How many observations do we have for each day of week in each term?) We have a lot of big outliers, so they tend to drag the mean far away from the typical value. We can alleviate this problem by using a model that is robust to the effect of outliers: `rlm`. This greatly reduces the impact of the outliers on our estimates, and gives a result that does a good job of removing the day of week pattern:
We can see the problem by overlaying the predictions from the model on to the raw data:
```{r}
grid <- daily %>%
expand(wday, term) %>%
add_predictions(mod2, "n")
ggplot(daily, aes(wday, n)) +
geom_boxplot() +
geom_point(data = grid, colour = "red") +
facet_wrap(~ term)
```
Our model is finding the _mean_ effect, but we have a lot of big outliers, so they tend to drag the mean far away from the typical value. We can alleviate this problem by using a model that is robust to the effect of outliers: `MASS::rlm()`. This greatly reduces the impact of the outliers on our estimates, and gives a model that does a good job of removing the day of week pattern:
```{r, warn = FALSE}
mod3 <- MASS::rlm(n ~ wday * term, data = daily)
@ -311,29 +330,9 @@ daily %>%
It's now much easier to see the long-term trend, and the positive and negative outliers.
Very common to use residual plots when figuring out if a model is ok. But it's easy to get the impression that there's just one type of residual plot you should do, when in fact there are infinite.
### Time of year: an alternative approach
In the previous section we used our knowledge of phenomonen to improve the model. An alternative to using making our knowledge explicit in the model is to give the data more room to speak. We could use a more flexible model and allow that to capture the pattern we're interested in.
When you have a continuous variable in the model, rather than using the unique values that you've seen, it's often more useful to generate an evenly spaced grid. One convenient way to do this is with `modelr::seq_range()` which takes a continuous variable, calculates its range, and then generates an evenly spaced points between the minimum and maximum.
```{r, warn = FALSE}
mod <- MASS::rlm(n ~ wday * yday(date), data = daily)
grid <- daily %>%
tidyr::expand(wday, date = seq_range(date, n = 13)) %>%
add_predictions(mod)
ggplot(grid, aes(date, pred, colour = wday)) +
geom_line() +
geom_point()
```
(Why use `yday(date)` instead of `date`? That's saying we think that the pattern depends only the day of the year, so we might expect it to be the same in other years. Again, since we only have a single year of data we can't test that hypothesis.)
We know that this pattern doesn't do a good job of capturing the variation in the data. There isn't a simple linear trend across the entire year, so instead we could use a natural spline to allow a smoothly varying trend across the year.
In the previous section we used our domain knowledge (how the US school term affects travel) to improve the model. An alternative to using making our knowledge explicit in the model is to give the data more room to speak. We could use a more flexible model and allow that to capture the pattern we're interested in. We know that a simple linear trend isn't adeqaute, so instead we could use a natural spline to allow a smoothly varying trend across the year:
```{r}
library(splines)
@ -347,16 +346,13 @@ daily %>%
geom_point()
```
Particularly, we see the strongly pattern in Saturdays that we identified when coming in the opposite direction. It's always a good sign when you see the same signal from multiple approaches. (But note our previous model was explanatory - this is just predictatory.)
How many degrees of freedom to use? Either pick manually to extract the shape of the data, or you can use one of the model assessment techniques in the following chapter to pick algorithmically. Here we're most interested in explanation, so picking by hand (with a little though and plenty of scepticism) is typically fine.
### Public holidays
We see a strong pattern in the numbers of Saturday flights. This is reassuring, because we also saw that pattern in the raw data. It's a good sign when you see the same signal from multiple approaches.
How do you decide how many parameters to use for the spline? You can either either it pick by eye, or you could use automated techniques which you'll learn about in [model assessment]. For exploration, picking by eye to capture the most important patterns is fine.
### Computed variables
If you're experimenting with many models and many visualisations, it's a good idea to bundle the creation of variables up into a function so there's no chance of accidentally applying a different transformation in different places.
If you're experimenting with many models and many visualisations, it's a good idea to bundle the creation of variables up into a function so there's no chance of accidentally applying a different transformation in different places. For example, we could write:
```{r}
compute_vars <- function(data) {
@ -367,25 +363,14 @@ compute_vars <- function(data) {
}
```
Another option is to wrap it ito the model formula:
Another option is to put the transformations directly in the model formula:
```{r}
wday2 <- function(x) wday(x, label = TRUE)
mod3 <- lm(n ~ wday2(date) * term(date), data = daily)
daily %>%
expand(date) %>%
add_predictions(mod3)
```
I think this is fine to do provided that you've carefully checked that the functions do what you think they do (i.e. with a visualisation). There are two disadvantages:
1. You may need to add the variables back in anyway if you want to use
them in a visualsiation.
1. When looking at the coefficients, their values are longer and harder to
read. (But this is a general problem with the way that linear models report
categorical coefficients in R, not a specific problem with this case.)
Either approach is reasonable. Making the transformed variable explicit is useful if you want to check your work, or use them in a visualisation. But you can't easily use transformations (like splines) that return multiple columns. Including the transformations in the model function makes life a little easier when you're working with many different datasets because the model is self contained.
### Exercises
@ -409,34 +394,19 @@ I think this is fine to do provided that you've carefully checked that the funct
(for Saturdays), and public holidays. What do the residuals of
that model look like?
1. What happens if you fit a day of week effect that varies by month?
Why is this not very helpful?
1. What happens if you fit a day of week effect that varies by month
(i.e. `n ~ wday * month`)? Why is this not very helpful?
1. Above we made the hypothesis that people leaving on Sundays are more
likely to be business travellers who need to be somewhere on Monday.
Explore that hypothesis by seeing how it breaks down based on distance:
if it's true, you'd expect to see more Sunday flights to places that
1. What would you expect the model `n ~ wday + ns(date, 5)` to look like?
Knowing what you know about the data, why would you expect it to be
not particularly effective?
1. We hypothesised that people leaving on Sundays are more likely to be
business travellers who need to be somewhere on Monday. Explore that
hypothesis by seeing how it breaks down based on distance and itme: if
it's true, you'd expect to see more Sunday evening flights to places that
are far away.
1. It's a little frustrating that Sunday and Saturday are on separate ends
of the plot. Write a small function to set the levels of the factor so
that the week starts on Monday.
1. Compare the predictions for each `wday` combined with `term` for the
`lm` and `rlm`
### Interpolation vs extrapolation
One danger with prediction plots is that it's easy to make predictions that are far away from the original data. This is dangerous because it's quite possible that the model (which is a simplification of reality) will no longer apply far away from observed values.
As the number of variables in your model grows ... "the curse of dimensionality": as the number of variables increases the average distance between points increases. That means most of the space is very sparse, and you have to rely on strong assumptions.
To help avoid this problem, it's good practice to include "nearby" observed data points in any prediction plot. These help you see if you're interpolating, making prediction "in between" existing data points, or extrapolating, making predictions about preivously unobserved slices of the data.
One way to do this is to use `condvis::visualweight()`.
<https://cran.rstudio.com/web/packages/condvis/>
## Other types of pattern
Pattern in variance, not just mean. Standardised residuals.
of the plot. Write a small function to set the manipulate the levels of the
factor so that the week starts on Monday.