Starting big rewrite of model basics

This commit is contained in:
hadley 2016-07-18 17:40:18 -05:00
parent 10fbd4ce9f
commit b0c6f898d5
3 changed files with 511 additions and 455 deletions

336
heights.Rmd Normal file
View File

@ -0,0 +1,336 @@
## Heights data
Have you heard that a relationship exists between your height and your income? It sounds far-fetched---and maybe it is---but many people believe that taller people will be promoted faster and valued more for their work, an effect that increases their income. Could this be true?
Luckily, it is easy to measure someone's height, as well as their income, which means that we can collect data relevant to the question. In fact, the Bureau of Labor Statistics has been doing this in a controlled way for over 50 years. The BLS [National Longitudinal Surveys (NLS)](https://www.nlsinfo.org/) track the income, education, and life circumstances of a large cohort of Americans across several decades. In case you are wondering just how your tax dollars are being spent, the point of the NLS is not to study the relationship between height and income, that's just a lucky accident.
A small sample of the full dataset is included in modelr:
```{r}
heights
```
As well as `height` and `income` there are some other variables that might affect someone's income: `age`, `sex`, `race`, years of `education`, and their score on the `afqt` (Armed Forces Qualification Test).
Now that you have the data, you can visualize the relationship between height and income. But what does the data say? How would you describe the relationship?
```{r warnings = FALSE}
ggplot(heights, aes(height, income)) +
geom_point()
```
First, let's address a distraction: the data is censored in an odd way. The y variable is income, which means that there are no y values less than zero. That's not odd. However, there are also no y values above $180,331. In fact, there are a line of unusual values at exactly $180,331. This is because the Bureau of Labor Statistics removed the top 2% of income values and replaced them with the mean value of the top 2% of values, an action that was not designed to enhance the usefulness of the data for data science.
```{r}
n <- nrow(heights)
heights <- heights %>% filter(income < 150000)
nrow(heights) / n
```
I'm going to record the original number of observations in `n`. We'll come back to this every now and then to make sure that we haven't throw out too much of our data.
Also, you can see that heights have been rounded to the nearest inch so using boxplots will make it easier to see the pattern. We'll also remove the very tall and very short people so we can focus on the most typically heights:
```{r}
heights <- heights %>% filter(between(height, 59, 78))
nrow(heights) / n
ggplot(heights, aes(height, income, group = height)) +
geom_boxplot()
```
(Throwing away data in the first pass at a model is perfectly acceptable: starting with a simple subset of a problem that you can easily solve is a good general strategy. But in a real analysis, once you've got the first simple model working, you really should come back and all look at the full dataset. Is removing the data still a good idea?)
You can see there seems to be a fairly weak relationship: as height increase the median wage also seems to increase. But how could we summarise that more quantitiatively?
## Linear models
One way is to use a linear model. A linear model is a very broad family of models: it encompasses all models that are a weighted sum of variables.
The formula specifies a family of models: for example, `income ~ height` describes the family of models specified by `x1 * income + x0`, where `x0` and `x1` are real numbers.
```{r}
income ~ height
```
We fit the model by supplying the family of models (the formula), and the data, to a model fitting function, `lm()`. `lm()` finds the single model in the family of models that is closest to the data:
```{r}
h <- lm(income ~ height, data = heights)
h
```
We can extract the coefficients of this fitted model and write down the model it specifies:
```{r}
coef(h)
```
This tells says the model is $`r coef(h)[1]` + `r coef(h)[2]` * height$. In other words, one inch increase of height associated with an increase of \$937 in income.
The definition that `lm()` uses for closeness is that it looks for a model that minimises the "root mean squared error".
`lm()` fits a straight line that describes the relationship between the variables in your formula. You can picture the result visually like this.
```{r}
ggplot(heights, aes(height, income)) +
geom_boxplot(aes(group = height)) +
geom_smooth(method = lm, se = FALSE)
```
`lm()` treats the variable(s) on the right-hand side of the formula as _explanatory variables_ that partially determine the value of the variable on the left-hand side of the formula, which is known as the _response variable_. In other words, it acts as if the _response variable_ is determined by a function of the _explanatory variables_. Linear regression is _linear_ because it finds the linear combination of the explanatory variables that best predict the response.
### Exercises
1. What variables in `heights` do you expect to be most highly correlated with
income? Use `cor()` plus `purrr::map_dbl()` to check your guesses.
1. Correlation only summarises the linear relationship between two continuous
variables. There are some famous drawbacks to the correlation. What
are they? Hint: google for Anscombe's quartet, read <https://xkcd.com/552/>.
### Categorical
Our model so far is extremely simple: it only uses one variable to try and predict income. We also know something else important: women tend to be shorter than men and tend to get paid less.
```{r}
ggplot(heights, aes(height, colour = sex)) +
geom_freqpoly(binwidth = 1)
ggplot(heights, aes(income, colour = sex)) +
geom_freqpoly(binwidth = 5000)
```
What happens if we also include `sex` in the model?
```{r}
h2 <- lm(income ~ height * sex, data = heights)
grid <- heights %>%
expand(height, sex) %>%
add_predictions(h2, "income")
ggplot(heights, aes(height, income)) +
geom_point() +
geom_line(data = grid) +
facet_wrap(~sex)
```
Need to commment about predictions for tall women and short men - there is not a lot of data there. Need to be particularly sceptical.
`*` vs `+`.
```{r}
h3 <- lm(income ~ height + sex, data = heights)
grid <- heights %>%
expand(height, sex) %>%
gather_predictions(h2, h3)
ggplot(grid, aes(height, pred, colour = sex)) +
geom_line() +
facet_wrap(~model)
```
### Continuous
There appears to be a relationship between a person's education and how poorly the model predicts their income. If we graph the model residuals against `education` above, we see that the more a person is educated, the worse the model underestimates their income:
But before we add a variable to our model, we need to do a little EDA + cleaning:
```{r}
ggplot(heights, aes(education)) + geom_bar()
heights_ed <- heights %>% filter(education >= 12)
nrow(heights) / n
```
We could improve the model by adding education:
```{r}
he1 <- lm(income ~ height + education, data = heights_ed)
he2 <- lm(income ~ height * education, data = heights_ed)
```
How can we visualise the results of this model? One way to think about it as a surface: we have a 2d grid of height and education, and point on that grid gets a predicted income.
```{r}
grid <- heights_ed %>%
expand(height, education) %>%
gather_predictions(he1, he2)
ggplot(grid, aes(height, education, fill = pred)) +
geom_raster() +
facet_wrap(~model)
```
It's easier to see what's going on in a line plot:
```{r}
ggplot(grid, aes(height, pred, group = education)) +
geom_line() +
facet_wrap(~model)
ggplot(grid, aes(education, pred, group = height)) +
geom_line() +
facet_wrap(~model)
```
One of the big advantages to `+` instead of `*` is that because the terms are independent we display them using two simple plots instead of one complex plot:
```{r}
heights_ed %>%
expand(
height = seq_range(height, 10),
education = mean(education, na.rm = TRUE)
) %>%
add_predictions(he1, "income") %>%
ggplot(aes(height, income)) +
geom_line()
heights_ed %>%
expand(
height = mean(height, na.rm = TRUE),
education = seq_range(education, 10)
) %>%
add_predictions(he1, "income") %>%
ggplot(aes(education, income)) +
geom_line()
```
The full interaction suggests that height matters less as education increases. But which model is "better"? We'll come back to that question later.
What happens if we add the data back in to the plot? Do you get more or less sceptical about the results from this model?
You can imagine that if you had a model with four continuous predictions all interacting, that it would be pretty complicated to understand what's going in the model! And certainly you don't have to - it's totally fine to use a model simply as a tool for predicting new values, and in the next chapters you'll learn some techniques to help evaluate such models without looking at them. However, I think the more you can connect your understand of the domain to the model, the more likely you are to detect potential problems before they occur. The goal is not to undertand every last nuance of the model, but instead to understand more than what you did previously.
condvis.
### Categorical
```{r}
s <- lm(income ~ sex, data = heights)
tidy(s)
```
Every level of the factor except one receives its own coefficient. The missing level acts as a baseline.
To change the baseline, create a new factor with a new levels attribute. R will use the first level in the levels attribute as the baseline.
```{r}
heights$sex <- factor(heights$sex, levels = c("male", "female"))
```
```{r}
hes <- lm(income ~ height + education + sex, data = heights)
tidy(hes)
```
```{r}
heights %>%
group_by(sex) %>%
do(glance(lm(income ~ height, data = .)))
```
```{r}
hes2 <- lm(income ~ height + education * sex, data = heights)
tidy(hes2)
```
### Splines
But what if the relationship between variables is not linear? For example, the relationship between income and education does not seem to be linear:
```{r}
ggplot(heights_ed, aes(education, income)) +
geom_boxplot(aes(group = education)) +
geom_smooth(se = FALSE)
```
One way to introduce non-linearity into our model is to use transformed variants of the predictors.
```{r}
mod_e1 <- lm(income ~ education, data = heights_ed)
mod_e2 <- lm(income ~ education + I(education ^ 2) + I(education ^ 3), data = heights_ed)
heights_ed %>%
expand(education) %>%
gather_predictions(mod_e1, mod_e2) %>%
ggplot(aes(education, pred, colour = model)) +
geom_point() +
geom_line()
```
This is a bit clunky because we have to surround each transformation with `I()`. This is because the rules of model algebra are a little different to usual algebra. `x ^ 2` is equivalent to `x * x` which in the modelling algebra is equivalent to `x + x + x:x` which is the same as `x`. This is useful because `(x + y + z)^2` fit all all major terms and second order interactions of x, y, and z.
```{r}
mod_e1 <- lm(income ~ education, data = heights_ed)
mod_e2 <- lm(income ~ poly(education, 2), data = heights_ed)
mod_e3 <- lm(income ~ poly(education, 3), data = heights_ed)
heights_ed %>%
expand(education) %>%
gather_predictions(mod_e1, mod_e2, mod_e3) %>%
ggplot(aes(education, pred, colour = model)) +
geom_point() +
geom_line()
```
However: there's one major problem with using `poly()`: outside the range of the data, polynomials are going to rapidly shoot off to positive or negative infinity.
```{r}
tibble(education = seq(5, 25)) %>%
gather_predictions(mod_e1, mod_e2, mod_e3) %>%
ggplot(aes(education, pred, colour = model)) +
geom_line()
```
Splines avoid this problem by linearly interpolating outside the range of the data. This isn't great either, but it's a safer default when you don't know for sure what's going to happen.
```{r}
library(splines)
mod_e1 <- lm(income ~ education, data = heights_ed)
mod_e2 <- lm(income ~ ns(education, 2), data = heights_ed)
mod_e3 <- lm(income ~ ns(education, 3), data = heights_ed)
tibble(education = seq(5, 25)) %>%
gather_predictions(mod_e1, mod_e2, mod_e3) %>%
ggplot(aes(education, pred, colour = model)) +
geom_line()
```
Other useful arguments to `seq_range()`:
* `pretty = TRUE` will generate a "pretty" sequence, i.e. something that looks
nice to the human eye. This is useful if you want to produce tables of
output:
```{r}
seq_range(c(0.0123, 0.923423), n = 5)
seq_range(c(0.0123, 0.923423), n = 5, pretty = TRUE)
```
* `trim = 0.1` will trim off 10% of the tail values. This is useful if the
variables has an long tailed distribution and you want to focus on generating
values near the center:
```{r}
x <- rcauchy(100)
seq_range(x, n = 5)
seq_range(x, n = 5, trim = 0.10)
seq_range(x, n = 5, trim = 0.25)
seq_range(x, n = 5, trim = 0.50)
```
### Additive models
```{r, dev = "png"}
library(mgcv)
gam(income ~ s(education), data = heights)
ggplot(data = heights, mapping = aes(x = education, y = income)) +
geom_point() +
geom_smooth(method = gam, formula = y ~ s(x))
```

View File

@ -1,15 +1,31 @@
# Model basics
We're going to give you a basic strategy, and point you to places to learn more. The key is to think about data generated from your model as regular data - you're going to want to manipulate it and visualise it in many different ways. Being good at modelling is a mixture of having some good general principles and having a big toolbox of techniques. Here we'll focus on general techniques to help you undertand what your model is telling you.
## Introduction
The goal of a fitted model is to provide a simple low-dimensional summary of a dataset. Ideally, the fitted model will capture true "signals" (i.e. patterns generated by the phenomenon of interest), and ignore "noise" (i.e. random variation that you're not interested in).
The goal of a model is to provide a simple low-dimensional summary of a dataset. Ideally, the fitted model will capture true "signals" (i.e. patterns generated by the phenomenon of interest), and ignore "noise" (i.e. random variation that you're not interested in).
A model is a tool for making predictions. Goal of a model is to be simple and useful.
This is a hard problem because any fitted dataset is just the "best" (closest) model from a family of models. Just because it's the best model doesn't make it good. And it certainly doesn't imply that the model is true. But a model doesn't need to be true to be useful. You've probably heard George Box's famous aphorism:
SEGUE
> All models are worng, but some are useful.
This chapter of the book is unique because it only uses simulated datasets. These datasets are very simple, and not at all interesting, but they will help you understand the essence of modelling before you apply the same techniques to real data in the next chapter.
But you might not have read the fuller content.
SEGUE
A fitted model is a function: you give it a set of inputs (often called the predictors), and it returns a value (typically called the prediction). Fitting a model takes place in two steps:
1. You define a __family of models__ that you think will capture an
interesting pattern in your data. This class will look like a mathematical
formula that relates your predictors to the response. For a simple 1d
model, this might look like `y = a * x + b` or `y = a * x ^ b`.
1. You generate a __fitted model__ by finding the one model from the family
that is the closest to your data. This takes the generic model family
and makes it specific, like `y = 3 * x + 7` or `y = 9 * x ^ 2`.
It's important to understand that a fitted model is just the closest model from a family of models. That implies that you have the "best" model (according to some criteria); it doesn't imply that you have a good model and it certainly doesn't imply that the model is "true". George Box puts this well in his famous aphorism that you've probably heard before:
> All models are wrong, but some are useful.
It's worth reading the fuller context of the quote:
> Now it would be very remarkable if any system existing in the real world
> could be exactly represented by any simple model. However, cunningly chosen
@ -24,13 +40,11 @@ But you might not have read the fuller content.
> If "truth" is to be the "whole truth" the answer must be "No". The only
> question of interest is "Is the model illuminating and useful?".
In this chapter, we'll explore the basics of model fitting.
We are going to focus on predictive models, how you can use simple fitted models to help better understand your data. Many of the models will be motivated by plots: you'll use a model captures to strong signals in the data so you can focus on what remains. This is a different motivation from most introductions to modelling, but if you go on to more traditional coverage, you can apply these same ideas to help you understand what's going on.
The goal of a model is not to uncover Truth, but to discover a simple approximation that is still useful. This tension is at the heart of modelling: you want your model to be simpler than reality (so you can understand it more easily) while still generating good predictions. There is almost always a tradeoff between understandability and quality of predictions.
### Prerequisites
To access the functions and datasets that we will use in the chapter, load the following packages:
We need a couple of packages specifically designed for modelling, and all the packages you've used before for EDA. I also recommend setting some options that tweak the default linear model behaviour to be a little less surprising.
```{r setup, message = FALSE}
# Modelling functions
@ -41,11 +55,7 @@ library(broom)
library(ggplot2)
library(dplyr)
library(tidyr)
```
I also recommend setting the following options. These make base models a little less surprising.
```{r}
# Options that make your life easier
options(
contrasts = c("contr.treatment", "contr.treatment"),
@ -53,33 +63,61 @@ options(
)
```
## Basics
### Simulated datasets
I can be easier to learn about modelling in a simulated environment where we know the truth. For example, imagine we have this data:
(to be moved into modelr)
```{r}
df <- tibble(
# 1 con
sim1 <- tibble(
x = rep(1:10, each = 3),
y = 5 + x * 2 + rnorm(length(x), sd = 2)
)
ggplot(sim1) + geom_point(aes(x, y))
ggplot(df, aes(x, y)) +
# 1 cat
means <- c(a = 1, b = 3, c = 6, d = 2)
sim2 <- tibble(
x = rep(names(means), each = 10),
y = means[x] + rnorm(length(x))
)
ggplot(sim2) + geom_point(aes(x, y))
# 1 cat + 1 con
ints <- c(a = 1, b = 3, c = 6, d = 2)
sim3 <-
tidyr::crossing(x1 = 1:10, x2 = names(ints), rep = 1:3) %>%
mutate(y = ints[x2] + 0.5 * x1 + rnorm(length(x1)), sd = 2)
ggplot(sim3) + geom_point(aes(x1, y, colour = x2))
# 1 cat + 1 con with interaction
slopes <- c(a = -0.2, b = -0.4, c = -0.1, d = 0.2)
sim4 <-
tidyr::crossing(x1 = 1:10, x2 = names(ints), rep = 1:3) %>%
mutate(y = ints[x2] + slopes[x2] * x1 + rnorm(length(x1)), sd = 2)
ggplot(sim3) + geom_point(aes(x1, y, colour = x2))
# 2 con
sim5 <-
tidyr::crossing(x1 = 1:10, x2 = 1:10) %>%
mutate(
y = 2 * x1 - 3 * x2 + rnorm(length(x1), sd = 3)
)
ggplot(sim3) + geom_raster(aes(x1, x2, fill = y))
# splines
```
## A simple model
Lets take a look at the simulated dataset `sim1`. It contains two continuous variables `x` and `y` which are related in a fairly straightforward way.
```{r}
ggplot(sim1, aes(x, y)) +
geom_point()
```
You already know how to fit a straight line to that dataset:
```{r}
ggplot(df, aes(x, y)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
```
But what is actually happening here?
Model class: linear. Model family: `y = a + b * x`.
There are lots of possible models in that family. Here are a few:
You can see a strong pattern in the data and we can use a model to capture it. This data looks like it comes from a linear famiy, i.e. `y = a * x + b`. Let's start by randomly generating a few models from that family and overlaying them on the data:
```{r}
models <- tibble(
@ -87,169 +125,110 @@ models <- tibble(
b = runif(250, -5, 5)
)
ggplot(df, aes(x, y)) +
ggplot(sim1, aes(x, y)) +
geom_abline(aes(intercept = a, slope = b), data = models, alpha = 1/4) +
geom_point()
```
There are an infinite number of possible models. But some of them are obviously bad. How can we judge a good model? It seems like it must have something to do with distance: a good model should be close to the data points.
There are an infinite number of possible models, but some are obviously better than others. Intuitively, it seems reasonable that a good model should generate a line that's close to the data. How can we make that intuition concrete?
One common measurement of distance is Euclidean distance. For each, we could compute the Euclidean distance between each data point and the closest point on the model. We have multiple points so we could take the average.
First, we need a way to generate the predicted y values given a model and some data:
```{r}
distance <- function(a, b) {
dist_one <- function(a, b) sqrt(mean((df$y - (a + b * df$x)) ^ 2))
purrr::map2_dbl(a, b, dist_one)
make_prediction <- function(mod, data) {
data$x * mod$b + mod$a
}
models <- models %>%
mutate(dist = distance(a, b)) %>%
arrange(dist)
ggplot(df, aes(x, y)) +
geom_abline(aes(intercept = a, slope = b, alpha = dist), data = filter(models, dist < 10)) +
geom_point() +
scale_alpha_continuous(trans = "reverse")
ggplot(models, aes(a, b)) +
geom_point(aes(colour = dist)) +
scale_colour_continuous(trans = "reverse")
make_prediction(list(a = 1, b = 3), sim1)
```
Obviously trying lots of random models is unlikely to find us a good model quickly. Instead, lets try to be systematic:
Next, we need some way to measure the distance between the prediction and the actual actual values. A reasonable place to start is Euclidean distance
```{r}
mu <- c(5, 2)
sd <- c(0.9, .14)
measure_distance <- function(mod, data) {
diff <- data$y - make_prediction(mod, data)
sqrt(mean(diff ^ 2))
}
measure_distance(list(a = 1, b = 3), sim1)
```
Now we can use purrrr to compure the distance for all the models defined above.
```{r}
sim1_dist <- function(a, b) {
mod <- list(a = a, b = b)
measure_distance(mod, sim1)
}
models <- models %>% mutate(dist = purrr::map2_dbl(a, b, sim1_dist))
models
```
And then overlay only the 10 best models on to the data. I've coloured the points by `-dist`: this is an easy way to make sure that the best models (i.e. the ones with the smallest distance) are coloured brightly.
```{r}
ggplot(sim1, aes(x, y)) +
geom_point(size = 2, colour = "grey30") +
geom_abline(
aes(intercept = a, slope = b, colour = -dist),
data = filter(models, rank(dist) <= 10)
)
```
Another way of looking at this is to look at the 2d grid formed by the values of `a` and `b` that we sampled:
```{r}
ggplot(models, aes(a, b)) +
geom_point(aes(colour = -dist))
```
Instead of trying lots of random models we could be more systematic and generate an evenly spaced grid of points, a so called grid search:
```{r}
grid <- expand.grid(
a = mu[1] + seq(-5, 5, length = 25) * sd[1],
b = mu[2] + seq(-5, 5, length = 25) * sd[2]
a = 5 + seq(-5, 5, length = 25) * 1,
b = 2 + seq(-5, 5, length = 25) * 0.15
) %>%
mutate(dist = distance(a, b))
best_10 <- grid %>% filter(rank(dist) < 10)
best_10
mutate(dist = purrr::map2_dbl(a, b, sim1_dist))
grid %>%
ggplot(aes(a, b, fill = dist)) +
geom_raster() +
geom_point(data = best_10) +
scale_fill_continuous(trans = "reverse")
ggplot(df, aes(x, y)) +
geom_abline(aes(intercept = a, slope = b), data = best_10, alpha = 1/3) +
geom_point()
ggplot(aes(a, b)) +
geom_point(data = filter(grid, rank(dist) < 10), size = 4, colour = "red") +
geom_point(aes(colour = -dist))
```
That's a called a grid search. Or we could do something even more complex: a Newton-Rhapson search. Here you effectively ski down the steepest slope and until you hit the bottom.
All those models look pretty good:
```{r}
nlm(function(x) distance(x[1], x[2]), c(0, 0))
ggplot(sim1, aes(x, y)) +
geom_point(size = 2, colour = "grey30") +
geom_abline(
aes(intercept = a, slope = b, colour = -dist),
data = filter(grid, rank(dist) <= 10)
)
```
However, generally, rather than doing it yourself, you'll rely on a model function to do so. Here we're using a linear model, so we can do:
You could imagine iteratively making the grid finer and finer until you narrowed in on the best model. A more efficient approach is to use a numerical minimisation tool called a Newton-Rhapson search. The intuition of Newton-Rhapson is pretty simple: you pick a starting point and look around for the steepest slope. You then ski down that slope a little way, and then repeat.
```{r}
summary(lm(y ~ x, data = df))
best <- optim(c(0, 0), function(x) sim1_dist(x[1], x[2]))
best$par
ggplot(sim1, aes(x, y)) +
geom_point(size = 2, colour = "grey30") +
geom_abline(intercept = best$par[1], slope = best$par[2])
```
You can that basic approach for any family of model that you can write an equation for. However, the specfic model we fit is a part of a broader class called linear models. We can use the `lm()` function to solve this specific problem. `lm()` has a special way to specify the model family: a formula like `y ~ x` which `lm()` translates to `y = a * x + b`. We can fit the model and look at the output:
```{r}
sim1_mod <- lm(y ~ x, data = sim1)
summary(sim1_mod)
```
Generally, model functions will do something more sophisticated than a gradient search: for linear models based on linear algebra. Using connection between geometry, calculus, and linear algebra: the closest linear model to the data can always be found by constructing and then inverting a matrix.
## Heights data
Have you heard that a relationship exists between your height and your income? It sounds far-fetched---and maybe it is---but many people believe that taller people will be promoted faster and valued more for their work, an effect that increases their income. Could this be true?
Luckily, it is easy to measure someone's height, as well as their income, which means that we can collect data relevant to the question. In fact, the Bureau of Labor Statistics has been doing this in a controlled way for over 50 years. The BLS [National Longitudinal Surveys (NLS)](https://www.nlsinfo.org/) track the income, education, and life circumstances of a large cohort of Americans across several decades. In case you are wondering just how your tax dollars are being spent, the point of the NLS is not to study the relationship between height and income, that's just a lucky accident.
A small sample of the full dataset is included in modelr:
```{r}
heights
```
As well as `height` and `income` there are some other variables that might affect someone's income: `age`, `sex`, `race`, years of `education`, and their score on the `afqt` (Armed Forces Qualification Test).
Now that you have the data, you can visualize the relationship between height and income. But what does the data say? How would you describe the relationship?
```{r warnings = FALSE}
ggplot(heights, aes(height, income)) +
geom_point()
```
First, let's address a distraction: the data is censored in an odd way. The y variable is income, which means that there are no y values less than zero. That's not odd. However, there are also no y values above $180,331. In fact, there are a line of unusual values at exactly $180,331. This is because the Bureau of Labor Statistics removed the top 2% of income values and replaced them with the mean value of the top 2% of values, an action that was not designed to enhance the usefulness of the data for data science.
```{r}
n <- nrow(heights)
heights <- heights %>% filter(income < 150000)
nrow(heights) / n
```
I'm going to record the original number of observations in `n`. We'll come back to this every now and then to make sure that we haven't throw out too much of our data.
Also, you can see that heights have been rounded to the nearest inch so using boxplots will make it easier to see the pattern. We'll also remove the very tall and very short people so we can focus on the most typically heights:
```{r}
heights <- heights %>% filter(between(height, 59, 78))
nrow(heights) / n
ggplot(heights, aes(height, income, group = height)) +
geom_boxplot()
```
(Throwing away data in the first pass at a model is perfectly acceptable: starting with a simple subset of a problem that you can easily solve is a good general strategy. But in a real analysis, once you've got the first simple model working, you really should come back and all look at the full dataset. Is removing the data still a good idea?)
You can see there seems to be a fairly weak relationship: as height increase the median wage also seems to increase. But how could we summarise that more quantitiatively?
## Linear models
One way is to use a linear model. A linear model is a very broad family of models: it encompasses all models that are a weighted sum of variables.
The formula specifies a family of models: for example, `income ~ height` describes the family of models specified by `x1 * income + x0`, where `x0` and `x1` are real numbers.
```{r}
income ~ height
```
We fit the model by supplying the family of models (the formula), and the data, to a model fitting function, `lm()`. `lm()` finds the single model in the family of models that is closest to the data:
```{r}
h <- lm(income ~ height, data = heights)
h
```
We can extract the coefficients of this fitted model and write down the model it specifies:
```{r}
coef(h)
```
This tells says the model is $`r coef(h)[1]` + `r coef(h)[2]` * height$. In other words, one inch increase of height associated with an increase of \$937 in income.
The definition that `lm()` uses for closeness is that it looks for a model that minimises the "root mean squared error".
`lm()` fits a straight line that describes the relationship between the variables in your formula. You can picture the result visually like this.
```{r}
ggplot(heights, aes(height, income)) +
geom_boxplot(aes(group = height)) +
geom_smooth(method = lm, se = FALSE)
```
`lm()` treats the variable(s) on the right-hand side of the formula as _explanatory variables_ that partially determine the value of the variable on the left-hand side of the formula, which is known as the _response variable_. In other words, it acts as if the _response variable_ is determined by a function of the _explanatory variables_. Linear regression is _linear_ because it finds the linear combination of the explanatory variables that best predict the response.
### Exercises
1. What variables in `heights` do you expect to be most highly correlated with
income? Use `cor()` plus `purrr::map_dbl()` to check your guesses.
1. Correlation only summarises the linear relationship between two continuous
variables. There are some famous drawbacks to the correlation. What
are they? Hint: google for Anscombe's quartet, read <https://xkcd.com/552/>.
## Understanding the model
For simple models, like this one, you can figure out what the model says about the data by carefully studying the coefficients. If you ever take a formal statistics course on modelling, you'll spend a lot of time doing that. Here, however, we're going to take a different tack. In this book, we're going to focus on understanding a model by looking at its predictions. This has a big advantage: every type of model makes predictions (otherwise what use would it be?) so we can use the same set of techniques to understand simple linear models or complex random forrests. We'll see that advantage later on when we explore some other families of models.
1. Make predictions across a uniform grid to understand the "shape" of
@ -266,7 +245,7 @@ For simple models, like this one, you can figure out what the model says about t
To visualise the predictions from a model, we start by generating an evenly spaced grid of values that covers the region where our data lies. The easiest way to do that is to use `tidyr::expand()`. It's first argument is a data frame, and for each subsequent argument it finds the unique variables and then generates all combinations:
```{r}
grid <- heights %>% expand(height)
grid <- sim1 %>% expand(x)
grid
```
@ -275,70 +254,57 @@ grid
Next we add predicitons. We'll use `modelr::add_predictions()` which works in exactly the same way as `add_residuals()`, but just compute predictions (so doesn't need a data frame that contains the response variable:)
```{r}
grid <- grid %>% add_predictions(h, "income")
grid <- grid %>% add_predictions(sim1_mod, "y")
grid
```
And then we plot the predictions. Here the choice of plots is pretty obvious because we only havce two variables. In general, however, figuring out how to the visualise the predictions can be quite challenging, and you'll often need to try a few alternatives before you get the most useful plot. For more complex models, you're likely to need multiple plots, not just one.
And then we plot the predictions. Here the choice of plots is pretty obvious because we only havee two variables. In general, however, figuring out how to the visualise the predictions can be quite challenging, and you'll often need to try a few alternatives before you get the most useful plot. For more complex models, you're likely to need multiple plots, not just one.
```{r}
ggplot(heights, aes(height, income)) +
geom_boxplot(aes(group = height)) +
ggplot(sim1, aes(x, y)) +
geom_point() +
geom_line(data = grid, colour = "red", size = 1)
```
This is exactly what `geom_smooth()` does for you behind the scenes!
### Residuals
The flip-side of predictions are residuals. The predictions tell you what the model is doing; the residuals tell you what the model is missing. We can compute residuals with `add_residuals()`. Note that we computing residuals, you'll use the original dataset, not a manufactured grid. Otherwise where would you get the value of the response?
The flip-side of predictions are residuals. The predictions tell you what the model is doing; the residuals tell you what the model is missing. The residuals are just the distances between the observed and predicted values that we computed above!
We can compute residuals with `add_residuals()`. Note that we computing residuals, you'll use the original dataset, not a manufactured grid. Otherwise where would you get the value of the response?
```{r}
heights <- heights %>% add_residuals(h)
sim1 <- sim1 %>% add_residuals(sim1_mod)
```
There are a few different ways to understand what the residuals tell us about the model. One way is to simply draw a frequency polygon to help us understand the spread of the residuals:
```{r}
ggplot(heights, aes(resid)) +
geom_freqpoly(binwidth = 2000)
ggplot(sim1, aes(resid)) +
geom_freqpoly(binwidth = 0.5)
```
(I prefer the frequency polygon over the histogram here because it makes it easier to display other categorical variables on the same plot.)
Here you can see that the range of the residuals is quite large. (Note that by the formal definiton of the linear model, the mean of the residuals will always be zero).
For many problems, the sign of the residual (i.e. whether the prediction is too high or too low) isn't important, and you might just want to focus on the magnitude of the residuals. You can do that by plotting the absolute value:
```{r}
ggplot(heights, aes(abs(resid))) +
geom_freqpoly(binwidth = 2000, boundary = 0)
ggplot(sim1, aes(abs(resid))) +
geom_freqpoly(binwidth = 0.25, boundary = 0)
```
You can also explore how the residuals vary with other variables in the data:
```{r}
ggplot(heights, aes(height, resid)) + geom_point()
```
Iterative plotting the residuals instead of the original response leads to a natual way of building up a complex model in simple steps, which we'll explore in detail in the next chapter.
For real data, you'll also explore explore how the residuals vary with other variables in the data. You'll see that in the next chapter.
### Exercises
1. In the plot of the absolute residuals, there's a bin smaller than
zero. What does this bin represent, and why is it necessary?
1. Explore how the distribution of residuals varies with the sex and
race. What makes understanding the residuals split up by race
particularly challenging?
1. P-values are an important part of model interpretation, particularly in
science, that we're not going to talk much about. <https://xkcd.com/882/>
1. It's often useful to recreate your initial EDA plots using residuals
instead of the original missing values. How does visualising `resid`
instead of `height` change your understanding of the heights data?
## Multiple predictors
## Other predictors
In most cases, like this one, a single variable is not enough to generate a useful model. Instead we need multiple variables in the model, which you can include with either `+` or `*` in the modelling formula.
@ -351,45 +317,10 @@ The distinction between `+` and `*` is important:
We'll explore what this means in the sections below, as we add more categorical and continuous variables to the dataset.
### Categorical
### Missing values
Our model so far is extremely simple: it only uses one variable to try and predict income. We also know something else important: women tend to be shorter than men and tend to get paid less.
```{r}
ggplot(heights, aes(height, colour = sex)) +
geom_freqpoly(binwidth = 1)
ggplot(heights, aes(income, colour = sex)) +
geom_freqpoly(binwidth = 5000)
```
What happens if we also include `sex` in the model?
```{r}
h2 <- lm(income ~ height * sex, data = heights)
grid <- heights %>%
expand(height, sex) %>%
add_predictions(h2, "income")
ggplot(heights, aes(height, income)) +
geom_point() +
geom_line(data = grid) +
facet_wrap(~sex)
```
Need to commment about predictions for tall women and short men - there is not a lot of data there. Need to be particularly sceptical.
`*` vs `+`.
```{r}
h3 <- lm(income ~ height + sex, data = heights)
grid <- heights %>%
expand(height, sex) %>%
gather_predictions(h2, h3)
ggplot(grid, aes(height, pred, colour = sex)) +
geom_line() +
facet_wrap(~model)
```
### Categorical data
#### Factors
@ -430,34 +361,6 @@ as.numeric(as.character(num_fac))
Add categorical variables to a model in the same way that you would add continuous variables.
```{r}
s <- lm(income ~ sex, data = heights)
tidy(s)
```
Every level of the factor except one receives its own coefficient. The missing level acts as a baseline.
To change the baseline, create a new factor with a new levels attribute. R will use the first level in the levels attribute as the baseline.
```{r}
heights$sex <- factor(heights$sex, levels = c("male", "female"))
```
```{r}
hes <- lm(income ~ height + education + sex, data = heights)
tidy(hes)
```
```{r}
heights %>%
group_by(sex) %>%
do(glance(lm(income ~ height, data = .)))
```
```{r}
hes2 <- lm(income ~ height + education * sex, data = heights)
tidy(hes2)
```
### Nested variables
@ -481,164 +384,6 @@ The student id only makes sense in the context of the school: it doesn't make se
students %>% expand(nesting(school_id, student_id))
```
### Continuous
There appears to be a relationship between a person's education and how poorly the model predicts their income. If we graph the model residuals against `education` above, we see that the more a person is educated, the worse the model underestimates their income:
But before we add a variable to our model, we need to do a little EDA + cleaning:
```{r}
ggplot(heights, aes(education)) + geom_bar()
heights_ed <- heights %>% filter(education >= 12)
nrow(heights) / n
```
We could improve the model by adding education:
```{r}
he1 <- lm(income ~ height + education, data = heights_ed)
he2 <- lm(income ~ height * education, data = heights_ed)
```
How can we visualise the results of this model? One way to think about it as a surface: we have a 2d grid of height and education, and point on that grid gets a predicted income.
```{r}
grid <- heights_ed %>%
expand(height, education) %>%
gather_predictions(he1, he2)
ggplot(grid, aes(height, education, fill = pred)) +
geom_raster() +
facet_wrap(~model)
```
It's easier to see what's going on in a line plot:
```{r}
ggplot(grid, aes(height, pred, group = education)) +
geom_line() +
facet_wrap(~model)
ggplot(grid, aes(education, pred, group = height)) +
geom_line() +
facet_wrap(~model)
```
One of the big advantages to `+` instead of `*` is that because the terms are independent we display them using two simple plots instead of one complex plot:
```{r}
heights_ed %>%
expand(
height = seq_range(height, 10),
education = mean(education, na.rm = TRUE)
) %>%
add_predictions(he1, "income") %>%
ggplot(aes(height, income)) +
geom_line()
heights_ed %>%
expand(
height = mean(height, na.rm = TRUE),
education = seq_range(education, 10)
) %>%
add_predictions(he1, "income") %>%
ggplot(aes(education, income)) +
geom_line()
```
The full interaction suggests that height matters less as education increases. But which model is "better"? We'll come back to that question later.
What happens if we add the data back in to the plot? Do you get more or less sceptical about the results from this model?
You can imagine that if you had a model with four continuous predictions all interacting, that it would be pretty complicated to understand what's going in the model! And certainly you don't have to - it's totally fine to use a model simply as a tool for predicting new values, and in the next chapters you'll learn some techniques to help evaluate such models without looking at them. However, I think the more you can connect your understand of the domain to the model, the more likely you are to detect potential problems before they occur. The goal is not to undertand every last nuance of the model, but instead to understand more than what you did previously.
condvis.
### Splines
But what if the relationship between variables is not linear? For example, the relationship between income and education does not seem to be linear:
```{r}
ggplot(heights_ed, aes(education, income)) +
geom_boxplot(aes(group = education)) +
geom_smooth(se = FALSE)
```
One way to introduce non-linearity into our model is to use transformed variants of the predictors.
```{r}
mod_e1 <- lm(income ~ education, data = heights_ed)
mod_e2 <- lm(income ~ education + I(education ^ 2) + I(education ^ 3), data = heights_ed)
heights_ed %>%
expand(education) %>%
gather_predictions(mod_e1, mod_e2) %>%
ggplot(aes(education, pred, colour = model)) +
geom_point() +
geom_line()
```
This is a bit clunky because we have to surround each transformation with `I()`. This is because the rules of model algebra are a little different to usual algebra. `x ^ 2` is equivalent to `x * x` which in the modelling algebra is equivalent to `x + x + x:x` which is the same as `x`. This is useful because `(x + y + z)^2` fit all all major terms and second order interactions of x, y, and z.
```{r}
mod_e1 <- lm(income ~ education, data = heights_ed)
mod_e2 <- lm(income ~ poly(education, 2), data = heights_ed)
mod_e3 <- lm(income ~ poly(education, 3), data = heights_ed)
heights_ed %>%
expand(education) %>%
gather_predictions(mod_e1, mod_e2, mod_e3) %>%
ggplot(aes(education, pred, colour = model)) +
geom_point() +
geom_line()
```
However: there's one major problem with using `poly()`: outside the range of the data, polynomials are going to rapidly shoot off to positive or negative infinity.
```{r}
tibble(education = seq(5, 25)) %>%
gather_predictions(mod_e1, mod_e2, mod_e3) %>%
ggplot(aes(education, pred, colour = model)) +
geom_line()
```
Splines avoid this problem by linearly interpolating outside the range of the data. This isn't great either, but it's a safer default when you don't know for sure what's going to happen.
```{r}
library(splines)
mod_e1 <- lm(income ~ education, data = heights_ed)
mod_e2 <- lm(income ~ ns(education, 2), data = heights_ed)
mod_e3 <- lm(income ~ ns(education, 3), data = heights_ed)
tibble(education = seq(5, 25)) %>%
gather_predictions(mod_e1, mod_e2, mod_e3) %>%
ggplot(aes(education, pred, colour = model)) +
geom_line()
```
Other useful arguments to `seq_range()`:
* `pretty = TRUE` will generate a "pretty" sequence, i.e. something that looks
nice to the human eye. This is useful if you want to produce tables of
output:
```{r}
seq_range(c(0.0123, 0.923423), n = 5)
seq_range(c(0.0123, 0.923423), n = 5, pretty = TRUE)
```
* `trim = 0.1` will trim off 10% of the tail values. This is useful if the
variables has an long tailed distribution and you want to focus on generating
values near the center:
```{r}
x <- rcauchy(100)
seq_range(x, n = 5)
seq_range(x, n = 5, trim = 0.10)
seq_range(x, n = 5, trim = 0.25)
seq_range(x, n = 5, trim = 0.50)
```
### Interpolation vs extrapolation
@ -656,17 +401,7 @@ One way to do this is to use `condvis::visualweight()`.
### Transformations
```{r, dev = "png"}
ggplot(diamonds, aes(x = carat, y = price)) +
geom_point()
ggplot(diamonds, aes(x = log(carat), y = log(price))) +
geom_point()
```
```{r}
lm(log(price) ~ log(carat), data = diamonds)
# visualize model line
```
### Genearlised linear models
@ -676,11 +411,6 @@ The most common link function is the logit function, which transforms a bivariat
Use `glm()` to perform logistic regression in R.
```{r}
she <- glm(sex ~ height + education, family = binomial(link = "logit"), data = heights)
tidy(she)
```
## Other model families
### Robust models
@ -689,21 +419,12 @@ Iteratively re-fit the model down-weighting outlying points (points with high re
### Additive models
```{r, dev = "png"}
library(mgcv)
gam(income ~ s(education), data = heights)
ggplot(data = heights, mapping = aes(x = education, y = income)) +
geom_point() +
geom_smooth(method = gam, formula = y ~ s(x))
```
```{r eval = FALSE}
# Linear z
gam(y ~ s(x) + z, data = df)
gam(y ~ s(x) + z, data = sim1)
# Smooth x and smooth z
gam(y ~ s(x) + s(z), data = df)
gam(y ~ s(x) + s(z), data = sim1)
# Smooth surface of x and z
# (a smooth function that takes both x and z)
@ -713,16 +434,3 @@ gam(y ~ s(x, z), data = df)
### Random forrests
## Summary
We've avoided two things in this chapter that are usually conflated with models: hypothesis testing and predictive analysis.
There are other types of modeling algorithms; each provides a valid description of the data.
Which description will be best? Does the relationship have a known form? Does the data have a known structure? Are you going to attempt hypothesis testing that imposes its own constraints?
In the course of modelling, you'll often discover data quality problems. Maybe a missing value is recorded as 999. Whenever you discover a problem like this, you'll need to review an update your import scripts. You'll often discover a problem with one variable, but you'll need to think about it for all variables. This is often frustrating, but it's typical.

View File

@ -14,6 +14,18 @@ In the previous chapter you learned how some basic models worked, and learned so
* Regression modelling strategies.
We are going to focus on predictive models, how you can use simple fitted models to help better understand your data. Many of the models will be motivated by plots: you'll use a model captures to strong signals in the data so you can focus on what remains. This is a different motivation from most introductions to modelling, but if you go on to more traditional coverage, you can apply these same ideas to help you understand what's going on.
We're going to give you a basic strategy, and point you to places to learn more. The key is to think about data generated from your model as regular data - you're going to want to manipulate it and visualise it in many different ways. Being good at modelling is a mixture of having some good general principles and having a big toolbox of techniques. Here we'll focus on general techniques to help you undertand what your model is telling you.
In the course of modelling, you'll often discover data quality problems. Maybe a missing value is recorded as 999. Whenever you discover a problem like this, you'll need to review an update your import scripts. You'll often discover a problem with one variable, but you'll need to think about it for all variables. This is often frustrating, but it's typical.
The way we're going to work is to subtract patterns from the data, while adding them to the model. The goal is to transition from implicit knowledge in the data and your head to explicit knowledge in a quantitative model. This makes it easier to apply to new domains, and easier for others to use.
If you had a "perfect" model the residuals would be perfectly independent noise. But "perfect" is not always what you strive for: sometimes you actually want a model that leaves some signal on the table because you want a model that is simpler, faster, or easier to understand.