Start moving towards simulated example

This commit is contained in:
hadley 2016-07-06 08:11:20 -05:00
parent a1b4536f9d
commit da2d2b4117
1 changed files with 113 additions and 22 deletions

View File

@ -1,6 +1,8 @@
# Model
The goal of a fitted model is to provide a simple, low-dimensional, summary of a dataset. Ideally, the fitted model will capture "true" signals (i.e. patterns generated by the phenomenon of interest, not random variation), and ignore "false" signals. This is a hard problem because any fitted dataset is just the best model from a family of models. Just because it's the best, doesn't make it good. And it certainly doesn't imply that the model is true. But a model doesn't need to be true to be useful. You've probably heard George Box's famous aphorism:
The goal of a fitted model is to provide a simple low-dimensional summary of a dataset. Ideally, the fitted model will capture true "signals" (i.e. patterns generated by the phenomenon of interest), and ignore "noise" (i.e. random variation that you're not interested in).
This is a hard problem because any fitted dataset is just the "best" (closest) model from a family of models. Just because it's the best model doesn't make it good. And it certainly doesn't imply that the model is true. But a model doesn't need to be true to be useful. You've probably heard George Box's famous aphorism:
> All models are worng, but some are useful.
@ -21,15 +23,13 @@ But you might not have read the fuller content.
In this chapter, we'll explore the basics of model fitting.
We are going to focus on predictive models, how you can use simple fitted models
A good model captures the important signal in the data, and releases the noise.
We are going to focus on predictive models, how you can use simple fitted models to help better understand your data. Many of the models will be motivated by plots: you'll use a model captures to strong signals in the data so you can focus on what remains. This is a different motivation from most introductions to modelling, but if you go on to more traditional coverage, you can apply these same ideas to help you understand what's going on.
### Prerequisites
To access the functions and data sets that we will use in the chapter, load the following packages:
```{r message = FALSE}
```{r setup, message = FALSE}
# Modelling functions
library(modelr)
library(broom)
@ -50,33 +50,121 @@ options(
)
```
## Basics
I can be easier to learn about modelling in a simulated environment where we know the truth. For example, imagine we have this data:
```{r}
df <- data_frame(
x = rep(1:10, each = 3),
y = 5 + x * 2 + rnorm(length(x), sd = 2)
)
ggplot(df, aes(x, y)) +
geom_point()
```
You already know how to fit a straight line to that dataset:
```{r}
ggplot(df, aes(x, y)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
```
But what is actually happening here?
Model class: linear. Model family: `y = a + b * x`.
There are lots of possible models in that family. Here are a few:
```{r}
models <- data_frame(
a = runif(250, -20, 80),
b = runif(250, -5, 5)
)
ggplot(df, aes(x, y)) +
geom_abline(aes(intercept = a, slope = b), data = models, alpha = 1/4) +
geom_point()
```
There are an infinite number of possible models. But some of them are obviously bad. How can we judge a good model? It seems like it must have something to do with distance: a good model should be close to the data points.
One common measurement of distance is Euclidean distance. For each, we could compute the Euclidean distance between each data point and the closest point on the model. We have multiple points so we could take the average.
```{r}
distance <- function(a, b) {
dist_one <- function(a, b) sqrt(mean((df$y - (a + b * df$x)) ^ 2))
purrr::map2_dbl(a, b, dist_one)
}
models <- models %>%
mutate(dist = distance(a, b)) %>%
arrange(dist)
ggplot(df, aes(x, y)) +
geom_abline(aes(intercept = a, slope = b, alpha = dist), data = filter(models, dist < 10)) +
geom_point() +
scale_alpha_continuous(trans = "reverse")
ggplot(models, aes(a, b)) +
geom_point(aes(colour = dist)) +
scale_colour_continuous(trans = "reverse")
```
Obviously trying lots of random models is unlikely to find us a good model quickly. Instead, lets try to be systematic:
```{r}
mu <- c(5, 2)
sd <- c(0.9, .14)
grid <- expand.grid(
a = mu[1] + seq(-5, 5, length = 25) * sd[1],
b = mu[2] + seq(-5, 5, length = 25) * sd[2]
) %>%
mutate(dist = distance(a, b))
best_10 <- grid %>% filter(rank(dist) < 10)
best_10
grid %>%
ggplot(aes(a, b, fill = dist)) +
geom_raster() +
geom_point(data = best_10) +
scale_fill_continuous(trans = "reverse")
ggplot(df, aes(x, y)) +
geom_abline(aes(intercept = a, slope = b), data = best_10, alpha = 1/3) +
geom_point()
```
That's a called a grid search. Or we could do something even more complex: a Newton-Rhapson search. Here you effectively ski down the steepest slope and until you hit the bottom.
```{r}
nlm(function(x) distance(x[1], x[2]), c(0, 0))
```
However, generally, rather than doing it yourself, you'll rely on a model function to do so. Here we're using a linear model, so we can do:
```{r}
summary(lm(y ~ x, data = df))
```
Generally, model functions will do something more sophisticated than a gradient search: for linear models based on linear algebra. Using connection between geometry, calculus, and linear algebra: the closest linear model to the data can always be found by constructing and then inverting a matrix.
## Heights data
Have you heard that a relationship exists between your height and your income? It sounds far-fetched---and maybe it is---but many people believe that taller people will be promoted faster and valued more for their work, an effect that increases their income. Could this be true?
Luckily, it is easy to measure someone's height, as well as their income, which means that we can collect data relevant to the question. In fact, the Bureau of Labor Statistics has been doing this in a controlled way for over 50 years. The BLS [National Longitudinal Surveys (NLS)](https://www.nlsinfo.org/) track the income, education, and life circumstances of a large cohort of Americans across several decades. In case you are wondering just how your tax dollars are being spent, the point of the NLS is not to study the relationship between height and income, that's just a lucky accident.
You can load the latest cross-section of NLS data, collected in 2013 with the code below.
A small sample of the full data set is included in modelr:
```{r echo = FALSE}
heights <- tibble::as_data_frame(readRDS("data/heights.RDS"))
```{r}
heights
n <- nrow(heights)
n
```
I've narrowed the data down to 10 variables:
* `id` - A number to identify each subject
* `income` - The self-reported income of each subject
* `height` - The height of each subject in inches
* `weight` - The weight of each subject in pounds
* `sex` - The sex of each subject
* `race` - The race of each subject
* `education` - The number of years of education completed by each subject
* `asvab` - Each subject's score on the Armed Services Vocational Aptitude Battery (ASVAB), an intelligence assessment, out of 100.
* `sat_math` - Each subject's score on the math portion of the Scholastic Aptitude Test (SAT), out of 800.
* `bdate` - Month of birth with 1 = January.
As well as `height` and `income` there are some other variables that might affect someone's income: `age`, `sex`, `race`, years of `education`, and their score on the `afqt` (Armed Forces Qualification Test).
Now that you have the data, you can visualize the relationship between height and income. But what does the data say? How would you describe the relationship?
@ -88,10 +176,13 @@ ggplot(heights, aes(height, income)) +
First, let's address a distraction: the data is censored in an odd way. The y variable is income, which means that there are no y values less than zero. That's not odd. However, there are also no y values above $180,331. In fact, there are a line of unusual values at exactly $180,331. This is because the Bureau of Labor Statistics removed the top 2% of income values and replaced them with the mean value of the top 2% of values, an action that was not designed to enhance the usefulness of the data for data science.
```{r}
n <- nrow(heights)
heights <- heights %>% filter(income < 150000)
nrow(heights) / n
```
I'm going to record the original number of observations in `n`. We'll come back to this every now and then to make sure that we haven't throw out too much of our data.
Also, you can see that heights have been rounded to the nearest inch so using boxplots will make it easier to see the pattern. We'll also remove the very tall and very short people so we can focus on the most typically heights:
```{r}