More model brainstorming

This commit is contained in:
hadley 2016-06-20 09:56:46 -05:00
parent fd8e21e47a
commit 98c14b843c
4 changed files with 97 additions and 92 deletions

View File

@ -184,6 +184,8 @@ You will need to reload the package every time you start a new R session.
* Jenny Bryan and Lionel Henry for many helpful discussions around working
with lists and list-columns.
* Genevera Allen for discussions about models and modelling.
## Colophon
This book was built with:

View File

@ -72,7 +72,7 @@ Either is fine, but confirmatory is much much harder. If you want your confirmat
cross-validate it.
1. 20% of goes into a __query__ set. You can use this data
to compare models by hand, but you're not allowed to use it automatically.
to compare models by hand, but you're not allowed to use vit automatically.
1. 20% goes into amount back for a __test__ set. You can only use this
data ONCE, to test your final model. If you use this data more than
@ -96,6 +96,15 @@ library(purrr)
library(tidyr)
```
```{r}
# Options that make your life easier
options(
contrasts = c("contr.treatment", "contr.treatment"),
na.option = na.exclude
)
```
## Overfitting
Both bootstrapping and cross-validation help us to spot and remedy the problem of __over fitting__, where the model fits the data we've seen so far extremely well, but does a bad job of generalising to new data.
@ -272,9 +281,41 @@ If you get a strange error, it's probably because the modelling function doesn't
and `test` contains the data you should use to validate the model. Together,
the test and train columns form an exclusive partition of the full dataset.
## Numeric summaries of model quality
When you start dealing with many models, it's helpful to have some rough way of comparing them so you can spend your time looking at the models that do the best job of capturing important features in the data.
One way to capture the quality of the model is to summarise the distribution of the residuals. For example, you could look at the quantiles of the absolute residuals. For this dataset, 25% of predictions are less than \$7,400 away, and 75% are less than \$25,800 away. That seems like quite a bit of error when predicting someone's income!
```{r}
heights <- tibble::as_data_frame(readRDS("data/heights.RDS"))
h <- lm(income ~ height, data = heights)
h
qae(h, heights)
range(heights$income)
```
You might be familiar with the $R^2$. That's a single number summary that rescales the variance of the residuals to between 0 (very bad) and 1 (very good):
```{r}
rsquare(h, heights)
```
$R^2$ can be interpreted as the amount of variation in the data explained by the model. Here we're explaining 3% of the total variation - not a lot! But I don't think worrying about the relative amount of variation explained is that useful; instead I think you need to consider whether the absolute amount of variation explained is useful for your project.
It's called the $R^2$ because for simple models like this, it's just the square of the correlation between the variables:
```{r}
cor(heights$income, heights$height) ^ 2
```
The $R^2$ is an ok single number summary, but I prefer to think about the unscaled residuals because it's easier to interpret in the context of the original data. As you'll also learn later, it's also a rather optimistic interpretation of the model. Because you're asssessing the model using the same data that was used to fit it, it really gives more of an upper bound on the quality of the model, not a fair assessment.
## Bootstrapping
## Cross-validation

View File

@ -1,22 +1,29 @@
# Model
A model is a function that summarizes how the values of one variable vary in relation to the values of other variables. Models play a large role in hypothesis testing and prediction, but for the moment you should think of models just like you think of statistics. A statistic summarizes a *distribution* in a way that is easy to understand; and a model summarizes *covariation* in a way that is easy to understand. In other words, a model is just another way to describe data.
The goal of a fitted model is to provide a simple, low-dimensional, summary of a dataset. Ideally, the fitted model will capture "true" signals (i.e. patterns generated by the phenomenon of interest, not random variation), and ignore "false" signals. This is a hard problem because any fitted dataset is just the best model from a family of models. Just because it's the best, doesn't make it good. And it certainly doesn't imply that the model is true. But a model doesn't need to be true to be useful. You've probably heard George Box's famous aphorism:
Family of models vs fitted model. Set of possible values, vs. one specific model. A fitted model = family of models plus a dataset.
> All models are worng, but some are useful.
This chapter will explain how to build useful models with R.
## Outline
But you might not have read the fuller content.
*Section 1* will show you how to build linear models, the most commonly used type of model. Along the way, you will learn R's model syntax, a general syntax that you can reuse with most of R's modeling functions.
> Now it would be very remarkable if any system existing in the real world
> could be exactly represented by any simple model. However, cunningly chosen
> parsimonious models often do provide remarkably useful approximations. For
> example, the law PV = RT relating pressure P, volume V and temperature T of
> an "ideal" gas via a constant R is not exactly true for any real gas, but it
> frequently provides a useful approximation and furthermore its structure is
> informative since it springs from a physical view of the behavior of gas
> molecules.
>
> For such a model there is no need to ask the question "Is the model true?".
> If "truth" is to be the "whole truth" the answer must be "No". The only
> question of interest is "Is the model illuminating and useful?".
*Section 2* will show you the best ways to use R's model output, which often requires additional wrangling.
In this chapter, we'll explore the basics of model fitting.
*Section 3* will teach you to build and interpret multivariate linear models, models that use more than one explanatory variable to explain the values of a response variable.
We are going to focus on predictive models, how you can use simple fitted models
*Section 4* will explain how to use categorical variables in your models and how to interpret the results of models that use categorical variables. Here you will learn about interaction effects, as well as logistic models.
*Section 5* will present a logical way to extend linear models to describe non-linear relationships.
A good model captures the important signal in the data, and releases the noise.
### Prerequisites
@ -31,7 +38,11 @@ library(broom)
library(ggplot2)
library(dplyr)
library(tidyr)
```
I also recommend setting the following options. These make base models a little less surprising.
```{r}
# Options that make your life easier
options(
contrasts = c("contr.treatment", "contr.treatment"),
@ -39,7 +50,7 @@ options(
)
```
## Linear models
## Heights data
Have you heard that a relationship exists between your height and your income? It sounds far-fetched---and maybe it is---but many people believe that taller people will be promoted faster and valued more for their work, an effect that increases their income. Could this be true?
@ -95,48 +106,34 @@ ggplot(heights, aes(height, income, group = height)) +
You can see there seems to be a fairly weak relationship: as height increase the median wage also seems to increase. But how could we summarise that more quantitiatively?
One option is the __correlation__, $r$, from statistics, which measures how strongly the values of two variables are related. The sign of the correlation describes whether the variables have a positive or negative relationship. The magnitude of the correlation describes how strongly the values of one variable determine the values of the second. A correlation of 1 or -1 implies that the value of one variable completely determines the value of the second variable.
## Linear models
```{r echo = FALSE, cache=TRUE, fig.height = 2}
x1 <- rnorm(100)
y1 <- .5 * x1 + rnorm(100, sd = .5)
y2 <- -.5 * x1 + rnorm(100, sd = .5)
One way is to use a linear model. A linear model is a very broad family of models: it encompasses all models that are a weighted sum of variables.
cordat <- data_frame(
x = rep(x1, 5),
y = c(-x1, y2, rnorm(100), y1, x1),
cor = factor(
rep(1:5, each = 100),
labels = paste0("Correlation = ", c(-1, -0.5, 0, 0.5, 1))
)
)
ggplot(cordat, aes(x, y)) +
geom_point() +
facet_grid(. ~ cor) +
coord_fixed() +
xlab(NULL) +
ylab(NULL)
```
In R, we can compute the correlation with `cor()`:
```{r}
cor(heights$height, heights$income)
```
The correlation suggests that heights may have a small effect on income.
Another way to summarise the relationship is with a linear model.
Use R's `lm()` function to fit a linear model to your data. The first argument of `lm()` should be a formula, two or more variables separated by a `~`. You've seen formulas before, we used them in Chapter 2 to facet graphs.
The formula specifies a family of models: for example, `income ~ height` describes the family of models specified by `x1 * income + x0`, where `x0` and `x1` are real numbers.
```{r}
income ~ height
h <- lm(income ~ height, data = heights)
h
```
We fit the model by supplying the family of models (the formula), and the data, to a model fitting function, `lm()`. `lm()` finds the single model in the family of models that is closest to the data:
```{r}
h <- lm(income ~ height, data = heights)
h
```
We can extract the coefficients of this fitted model and write down the model it specifies:
```{r}
coef(h)
```
This tells says the model is $`r coef(h)[1]` + `r coef(h)[2]` * height$. In other words, one inch increase of height associated with an increase of \$937 in income.
The definition that `lm()` uses for closeness is that it looks for a model that minimises the "root mean squared error".
`lm()` fits a straight line that describes the relationship between the variables in your formula. You can picture the result visually like this.
```{r}
@ -147,11 +144,6 @@ ggplot(heights, aes(height, income)) +
`lm()` treats the variable(s) on the right-hand side of the formula as _explanatory variables_ that partially determine the value of the variable on the left-hand side of the formula, which is known as the _response variable_. In other words, it acts as if the _response variable_ is determined by a function of the _explanatory variables_. Linear regression is _linear_ because it finds the linear combination of the explanatory variables that best predict the response.
Linear models are straightforward to interpret. Incomes have a baseline mean of $`r coef(h)[1]`$. Each one inch increase of height above zero is associated with an increase of $`r coef(h)[2]`$ in income.
```{r}
summary(h)
```
### Exercises
@ -235,32 +227,6 @@ ggplot(heights, aes(height, resid)) + geom_point()
Iterative plotting the residuals instead of the original response leads to a natual way of building up a complex model in simple steps, which we'll explore in detail in the next chapter.
### Numeric summaries of model quality
When you start dealing with many models, it's helpful to have some rough way of comparing them so you can spend your time looking at the models that do the best job of capturing important features in the data.
One way to capture the quality of the model is to summarise the distribution of the residuals. For example, you could look at the quantiles of the absolute residuals. For this dataset, 25% of predictions are less than \$7,400 away, and 75% are less than \$25,800 away. That seems like quite a bit of error when predicting someone's income!
```{r}
qae(h, heights)
range(heights$income)
```
You might be familiar with the $R^2$. That's a single number summary that rescales the variance of the residuals to between 0 (very bad) and 1 (very good):
```{r}
rsquare(h, heights)
```
$R^2$ can be interpreted as the amount of variation in the data explained by the model. Here we're explaining 3% of the total variation - not a lot! But I don't think worrying about the relative amount of variation explained is that useful; instead I think you need to consider whether the absolute amount of variation explained is useful for your project.
It's called the $R^2$ because for simple models like this, it's just the square of the correlation between the variables:
```{r}
cor(heights$income, heights$height) ^ 2
```
The $R^2$ is an ok single number summary, but I prefer to think about the unscaled residuals because it's easier to interpret in the context of the original data. As you'll also learn later, it's also a rather optimistic interpretation of the model. Because you're asssessing the model using the same data that was used to fit it, it really gives more of an upper bound on the quality of the model, not a fair assessment.
### Exercises

View File

@ -1,15 +1,3 @@
```{r include=FALSE, cache=FALSE}
set.seed(1014)
options(digits = 3)
knitr::opts_chunk$set(
comment = "#>",
collapse = TRUE,
cache = TRUE
)
options(dplyr.print_min = 6, dplyr.print_max = 6)
```
# (PART) Model {-}
# Introduction
@ -25,6 +13,10 @@ In the course of modelling, you'll often discover data quality problems. Maybe a
<https://blog.engineyard.com/2014/pets-vs-cattle>.
<https://en.wikipedia.org/wiki/R/K_selection_theory>
## Fitted models vs. families of models
Family of models vs fitted model. Set of possible values, vs. one specific model. A fitted model = family of models plus a dataset.
## Exploring vs. confirming
In this book we are going to focus on models primarily as tools for description. This is rather non-standard because we're normally interested in models for their inferential power: their ability to make accurate predictions for observations that we haven't seen yet.
@ -36,3 +28,7 @@ It's not possible to do both on the same data set.
Doing correct inference is hard!
Generally, however, this will tend to make us over-optimistic about the quality of our model. Chapter XXX you'll start to learn more about how we can judge the quality of a model on data that it was 't fit it. But you have to beware of overfitting the data - in the next section we'll discuss some formal methods. But a healthy dose of scepticism is also as powerful as precise quantitative methods: do you believe that a pattern you see in your sample is going to generalise to a wider population?
## Prediction vs. data discovery
PCA, clustering, ...