r4ds/extra/heights.Rmd

313 lines
12 KiB
Plaintext

## Heights data
Have you heard that a relationship exists between your height and your income? It sounds far-fetched---and maybe it is---but many people believe that taller people will be promoted faster and valued more for their work, an effect that increases their income. Could this be true?
Luckily, it is easy to measure someone's height, as well as their income, which means that we can collect data relevant to the question. In fact, the Bureau of Labor Statistics has been doing this in a controlled way for over 50 years. The BLS [National Longitudinal Surveys (NLS)](https://www.nlsinfo.org/) track the income, education, and life circumstances of a large cohort of Americans across several decades. In case you are wondering just how your tax dollars are being spent, the point of the NLS is not to study the relationship between height and income, that's just a lucky accident.
A small sample of the full dataset is included in modelr:
```{r}
heights
```
As well as `height` and `income` there are some other variables that might affect someone's income: `age`, `sex`, `race`, years of `education`, and their score on the `afqt` (Armed Forces Qualification Test).
Now that you have the data, you can visualize the relationship between height and income. But what does the data say? How would you describe the relationship?
```{r warnings = FALSE}
ggplot(heights, aes(height, income)) +
geom_point()
```
First, let's address a distraction: the data is censored in an odd way. The y variable is income, which means that there are no y values less than zero. That's not odd. However, there are also no y values above $180,331. In fact, there are a line of unusual values at exactly $180,331. This is because the Bureau of Labor Statistics removed the top 2% of income values and replaced them with the mean value of the top 2% of values, an action that was not designed to enhance the usefulness of the data for data science.
```{r}
n <- nrow(heights)
heights <- heights %>% filter(income < 150000)
nrow(heights) / n
```
I'm going to record the original number of observations in `n`. We'll come back to this every now and then to make sure that we haven't throw out too much of our data.
Also, you can see that heights have been rounded to the nearest inch so using boxplots will make it easier to see the pattern. We'll also remove the very tall and very short people so we can focus on the most typically heights:
```{r}
heights <- heights %>% filter(between(height, 59, 78))
nrow(heights) / n
ggplot(heights, aes(height, income, group = height)) +
geom_boxplot()
```
(Throwing away data in the first pass at a model is perfectly acceptable: starting with a simple subset of a problem that you can easily solve is a good general strategy. But in a real analysis, once you've got the first simple model working, you really should come back and all look at the full dataset. Is removing the data still a good idea?)
You can see there seems to be a fairly weak relationship: as height increase the median wage also seems to increase. But how could we summarise that more quantitiatively?
## Linear models
One way is to use a linear model. A linear model is a very broad family of models: it encompasses all models that are a weighted sum of variables.
The formula specifies a family of models: for example, `income ~ height` describes the family of models specified by `x1 * income + x0`, where `x0` and `x1` are real numbers.
```{r}
income ~ height
```
We fit the model by supplying the family of models (the formula), and the data, to a model fitting function, `lm()`. `lm()` finds the single model in the family of models that is closest to the data:
```{r}
h <- lm(income ~ height, data = heights)
h
```
We can extract the coefficients of this fitted model and write down the model it specifies:
```{r}
coef(h)
```
This tells says the model is $`r coef(h)[1]` + `r coef(h)[2]` * height$. In other words, one inch increase of height associated with an increase of \$937 in income.
The definition that `lm()` uses for closeness is that it looks for a model that minimises the "root mean squared error".
`lm()` fits a straight line that describes the relationship between the variables in your formula. You can picture the result visually like this.
```{r}
ggplot(heights, aes(height, income)) +
geom_boxplot(aes(group = height)) +
geom_smooth(method = lm, se = FALSE)
```
`lm()` treats the variable(s) on the right-hand side of the formula as _explanatory variables_ that partially determine the value of the variable on the left-hand side of the formula, which is known as the _response variable_. In other words, it acts as if the _response variable_ is determined by a function of the _explanatory variables_. Linear regression is _linear_ because it finds the linear combination of the explanatory variables that best predict the response.
### Exercises
1. What variables in `heights` do you expect to be most highly correlated with
income? Use `cor()` plus `purrr::map_dbl()` to check your guesses.
1. Correlation only summarises the linear relationship between two continuous
variables. There are some famous drawbacks to the correlation. What
are they? Hint: google for Anscombe's quartet, read <https://xkcd.com/552/>.
### Categorical
Our model so far is extremely simple: it only uses one variable to try and predict income. We also know something else important: women tend to be shorter than men and tend to get paid less.
```{r}
ggplot(heights, aes(height, colour = sex)) +
geom_freqpoly(binwidth = 1)
ggplot(heights, aes(income, colour = sex)) +
geom_freqpoly(binwidth = 5000)
```
What happens if we also include `sex` in the model?
```{r}
h2 <- lm(income ~ height * sex, data = heights)
grid <- heights %>%
expand(height, sex) %>%
add_predictions(h2, "income")
ggplot(heights, aes(height, income)) +
geom_point() +
geom_line(data = grid) +
facet_wrap(~sex)
```
Need to commment about predictions for tall women and short men - there is not a lot of data there. Need to be particularly sceptical.
`*` vs `+`.
```{r}
h3 <- lm(income ~ height + sex, data = heights)
grid <- heights %>%
expand(height, sex) %>%
gather_predictions(h2, h3)
ggplot(grid, aes(height, pred, colour = sex)) +
geom_line() +
facet_wrap(~model)
```
### Continuous
There appears to be a relationship between a person's education and how poorly the model predicts their income. If we graph the model residuals against `education` above, we see that the more a person is educated, the worse the model underestimates their income:
But before we add a variable to our model, we need to do a little EDA + cleaning:
```{r}
ggplot(heights, aes(education)) + geom_bar()
heights_ed <- heights %>% filter(education >= 12)
nrow(heights) / n
```
We could improve the model by adding education:
```{r}
he1 <- lm(income ~ height + education, data = heights_ed)
he2 <- lm(income ~ height * education, data = heights_ed)
```
How can we visualise the results of this model? One way to think about it as a surface: we have a 2d grid of height and education, and point on that grid gets a predicted income.
```{r}
grid <- heights_ed %>%
expand(height, education) %>%
gather_predictions(he1, he2)
ggplot(grid, aes(height, education, fill = pred)) +
geom_raster() +
facet_wrap(~model)
```
It's easier to see what's going on in a line plot:
```{r}
ggplot(grid, aes(height, pred, group = education)) +
geom_line() +
facet_wrap(~model)
ggplot(grid, aes(education, pred, group = height)) +
geom_line() +
facet_wrap(~model)
```
One of the big advantages to `+` instead of `*` is that because the terms are independent we display them using two simple plots instead of one complex plot:
```{r}
heights_ed %>%
expand(
height = seq_range(height, 10),
education = mean(education, na.rm = TRUE)
) %>%
add_predictions(he1, "income") %>%
ggplot(aes(height, income)) +
geom_line()
heights_ed %>%
expand(
height = mean(height, na.rm = TRUE),
education = seq_range(education, 10)
) %>%
add_predictions(he1, "income") %>%
ggplot(aes(education, income)) +
geom_line()
```
The full interaction suggests that height matters less as education increases. But which model is "better"? We'll come back to that question later.
What happens if we add the data back in to the plot? Do you get more or less sceptical about the results from this model?
You can imagine that if you had a model with four continuous predictions all interacting, that it would be pretty complicated to understand what's going in the model! And certainly you don't have to - it's totally fine to use a model simply as a tool for predicting new values, and in the next chapters you'll learn some techniques to help evaluate such models without looking at them. However, I think the more you can connect your understand of the domain to the model, the more likely you are to detect potential problems before they occur. The goal is not to undertand every last nuance of the model, but instead to understand more than what you did previously.
condvis.
### Categorical
```{r}
s <- lm(income ~ sex, data = heights)
tidy(s)
```
Every level of the factor except one receives its own coefficient. The missing level acts as a baseline.
To change the baseline, create a new factor with a new levels attribute. R will use the first level in the levels attribute as the baseline.
```{r}
heights$sex <- factor(heights$sex, levels = c("male", "female"))
```
```{r}
hes <- lm(income ~ height + education + sex, data = heights)
tidy(hes)
```
```{r}
heights %>%
group_by(sex) %>%
do(glance(lm(income ~ height, data = .)))
```
```{r}
hes2 <- lm(income ~ height + education * sex, data = heights)
tidy(hes2)
```
### Splines
But what if the relationship between variables is not linear? For example, the relationship between income and education does not seem to be linear:
```{r}
ggplot(heights_ed, aes(education, income)) +
geom_boxplot(aes(group = education)) +
geom_smooth(se = FALSE)
```
One way to introduce non-linearity into our model is to use transformed variants of the predictors.
```{r}
mod_e1 <- lm(income ~ education, data = heights_ed)
mod_e2 <- lm(income ~ education + I(education ^ 2) + I(education ^ 3), data = heights_ed)
heights_ed %>%
expand(education) %>%
gather_predictions(mod_e1, mod_e2) %>%
ggplot(aes(education, pred, colour = model)) +
geom_point() +
geom_line()
```
This is a bit clunky because we have to surround each transformation with `I()`. This is because the rules of model algebra are a little different to usual algebra. `x ^ 2` is equivalent to `x * x` which in the modelling algebra is equivalent to `x + x + x:x` which is the same as `x`. This is useful because `(x + y + z)^2` fit all all major terms and second order interactions of x, y, and z.
```{r}
mod_e1 <- lm(income ~ education, data = heights_ed)
mod_e2 <- lm(income ~ poly(education, 2), data = heights_ed)
mod_e3 <- lm(income ~ poly(education, 3), data = heights_ed)
heights_ed %>%
expand(education) %>%
gather_predictions(mod_e1, mod_e2, mod_e3) %>%
ggplot(aes(education, pred, colour = model)) +
geom_point() +
geom_line()
```
However: there's one major problem with using `poly()`: outside the range of the data, polynomials are going to rapidly shoot off to positive or negative infinity.
```{r}
tibble(education = seq(5, 25)) %>%
gather_predictions(mod_e1, mod_e2, mod_e3) %>%
ggplot(aes(education, pred, colour = model)) +
geom_line()
```
Splines avoid this problem by linearly interpolating outside the range of the data. This isn't great either, but it's a safer default when you don't know for sure what's going to happen.
```{r}
library(splines)
mod_e1 <- lm(income ~ education, data = heights_ed)
mod_e2 <- lm(income ~ ns(education, 2), data = heights_ed)
mod_e3 <- lm(income ~ ns(education, 3), data = heights_ed)
tibble(education = seq(5, 25)) %>%
gather_predictions(mod_e1, mod_e2, mod_e3) %>%
ggplot(aes(education, pred, colour = model)) +
geom_line()
```
### Additive models
```{r, dev = "png"}
library(mgcv)
gam(income ~ s(education), data = heights)
ggplot(data = heights, mapping = aes(x = education, y = income)) +
geom_point() +
geom_smooth(method = gam, formula = y ~ s(x))
```