Merge branch 'master' of github.com:hadley/r4ds

This commit is contained in:
hadley 2016-01-13 09:45:26 -06:00
commit b41d39feba
2 changed files with 236 additions and 180 deletions

BIN
data/heights.RDS Normal file

Binary file not shown.

416
model.Rmd
View File

@ -3,21 +3,21 @@ layout: default
title: Model
---
# Model
Models are one of the most important tools for data scientists, because models describe relationships. Would you list out every value of a variable, or would you state the mean? Would you list out every pair of values, or would you state the function between variables?
A model is a function that summarizes how the values of one variable vary in response to the values of other variables. Models play a large role in hypothesis testing and prediction, but for the moment you should think of models just like you think of statistics. A statistic summarizes a *distribution* in a way that is easy to understand; and a model summarizes *covariation* in a way that is easy to understand. In other words, a model is just another way to describe data.
This chapter will explain how to build useful models with R.
### Outline
*Section 1* will explain what models are and what they can do for you.
*Section 1* will show you how to build linear models, the most commonly used type of model. Along the way, you will learn R's model syntax, a general syntax that you can reuse with most of R's modeling functions.
*Section 2* will show you how to use R to build linear models, the most commonly used modeling tool. The section introduces R's model syntax, a general syntax that you can reuse with any of R's modelling functions.
*Section 2* will show you the best ways to use R's model output, which is often reguires additional wrangling.
*Section 3* will teach you to build and interpret multivariate linear models, models that use more than one variable to make a prediction.
*Section 3* will teach you to build and interpret multivariate linear models, models that use more than one explanatory variable to explain the values of a response variable.
*Section 4* will explain how to use categorical variables in your models and how to interpret the results.
*Section 4* will explain how to use categorical variables in your models and how to interpret the results of models that use categorical variables. Here you will learn about interaction effects, as well as logistic models.
*Section 5* will present a logical way to extend models to non-linear settings.
*Section 5* will present a logical way to extend linear models to describe non-linear relationships.
### Prerequisites
@ -32,172 +32,251 @@ library(splines)
library(broom)
```
**Note: the current examples use a data set that will be replaced in later drafts.**
## What is a model?
## Linear models
Have you heard that a relationship exists between your height and your income? It sounds far-fetched---and maybe it is---but many people believe that taller people will be promoted faster and valued more for their work, an effect that directly inflates the income of the vertically gifted. Do you think this is true?
Luckily, it is easy to measure someone's height, as well as their income (and a swath of other variables besides), which means that we can collect data relevant to the question. In fact, the Bureau of Labor Statistics has been doing this in a controlled way for over 50 years. The BLS [National Longitudinal Surveys (NLS)](https://www.nlsinfo.org/) track the income, education, and life circumstances of a large cohort of Americans across several decades. In case you are wondering, the point of the NLS is not to study the relationhip between height and income, that's just a lucky accident.
You can load the latest cross-section of NLS data, collected in 2013 with the code below.
1. A model is just a summary, like a mean, median, or variance.
+ Example problem/data set
```{r echo = FALSE}
heights <- read.csv("data/heights.csv")
heights <- readRDS("data/heights.RDS")
```
I've narrowed the data down to 10 variables:
* `id` - A number ot identify each subject
* `income` - The self-reported income of each subject
* `height` - The height of each subject in inches
* `weight` - The weight of each subject in inches
* `sex` - The sex of each subject
* `race` - The race of each subject
* `education` - The number of years of education completed by each subject
* `asvab` - Each subject's score on the Armed Services Vocational Aptitude Battery (ASVAB), an intelligence assessment, out of 100.
* `sat_math` - Each subject's score on the math portion of the Scholastic Aptitude Test (SAT), out of 800.
* `bdate` - Month of birth with 1 = January.
```{r}
head(heights)
```
2. As normally taught, modeling is a conflation of three subjects
+ Models as summaries
+ Hypothesis testing
+ Predictive modeling
3. C. This chapter shows how to build a model and use it as a summary. The methods for building a model apply to all three subjects.
## How to build a model
1. Best fit
+ Best fit of what? A certain class of function.
+ But how do you know which class to use? In some cases, the data can provide suggestions. In other cases existing theory can provide suggestions. But ultimately, you'll never know for sure. But that's okay, good enough is good enough.
2. What does best fit mean?
+ It may or may not accurately describe the true relationship. Heck, there might not even be a true relationship. But it is the best guess given the data.
+ Example problem/data set
+ It does not mean causation exists. Causation is just one type of relations, which is difficult enough to define, let alone prove.
3. How do you find the best fit?
+ With an algorithm. There is an algorithm to fit each specific class of function. We will cover some of the most useful here.
4. How do you know how good the fit is?
+ Adjusted $R^{2}$
5. Are we making assumptions when we fit a model?
+ No. Not unless you assume that you've selected the correct type of function (and I see no reason why you should assume that).
+ Assumptions come when you start hypothesis testing.
Now that you have the data, you can visualize the relationship between height and income. But what does the data say? How would you describe the relationship?
## Linear models
1. Linear models fit linear functions
2. How to fit in R
+ model syntax, which is reusable with all model functions
```{r}
earn ~ height
lm(earn ~ height, data = heights)
```{r warnings = FALSE}
ggplot(data = heights, mapping = aes(x = height, y = income)) +
geom_point()
```
+ save model output
First, let's address a distraction: the data is censored in an odd way. The y variable is income, which means that there are no y values less than zero. That's not odd. However, there are also no y values above $180,331. In fact, there are a line of unusual values at exactly $180,331. This is because the Burea of Labor Statistics removed the top 2% of income values and replaced them with the mean value of the top 2% of values, an action that was not designed to enhance the usefulness of the data for data science.
Also, you can see that heights have been rounded to the nearest inch.
Second, the relationship is not very strong.
```{r}
hmod <- lm(earn ~ height, data = heights)
coef(hmod)
summary(hmod)
cor(heights$height, heights$income, use = "na")
```
+ visualize
A model describes the relationship between two or more variables. There are multiple ways to describe any relationship. Which is best?
A common choice: decide form of relationship, then minimize residuals.
Use R's `lm()` function to fit a linear model to your data. The first argument of `lm()` should be a formula, two or more varibles separated by a `~`. You've seen forumlas before, we used them in Chapter 2 to facet graphs.
```{r}
ggplot(data = heights, mapping = aes(x = height, y = earn)) +
income ~ height
h <- lm(income ~ height, data = heights)
h
```
`lm()` fits a straight line that describes the relationship between the variables in your formula. You can picture the result visually like this.
```{r echo = FALSE}
ggplot(data = heights, mapping = aes(x = height, y = income)) +
geom_point() +
geom_smooth(method = lm)
```
+ intercept or no intercept
`lm()` treats the variable(s) on the right-hand side of the formula as _explanatory variables_ that partially determine the value of the variable on the left-hand side of the formula, which is known as the _response variable_. In other words, it acts as if the _response variable_ is determined by a function of the _explanatory variables_. It then spots the linear function that best fits the data.
Linear models are straightforward to interpret. Incomes have a baseline mean of $`r coef(h)[1]`. Each one inch increase of height above zero is associated with an increase of $`r coef(h)[2]` in income.
```{r}
0 + earn ~ height
lm(earn ~ 0 + height, data = heights)
lm(earn ~ 0 + height, data = heights)
summary(h)
```
3. How to interpret
+ extract information. Resid. Predict.
```{r eval = FALSE}
resid(hmod)
predict(hmod)
To create a model without an intercept, add 0 to the formula.
```{r}
lm(income ~ 0 + height, data = heights)
```
+ Interpret coefficient
4. How to use the results (with `broom`)
+ tidy. augment. glance.
```{r eval = FALSE}
tidy(hmod)
augment(hmod)
glance(hmod)
## Using model output
R model output is not very tidy. It is designed to provide a data store that you can extract information from with helper functions.
```{r}
coef(h)
predict(h)[1:5]
resid(h)[1:5]
```
The `broom` package provides the most useful helper functions for working with R models. `broom` functions return the most useful model information as a data frames, which lets you quickly embed the information into your data science workflow.
### tidy()
```{r}
tidy(h)
```
### glance()
```{r}
glance(h)
```
### augment()
```{r}
augment(h)[1:5, ]
```
```{r}
heights2 <- augment(h, heights)
ggplot(data = heights2, mapping = aes(x = education, y = .resid)) +
geom_point() +
geom_smooth()
```
## Multivariate models
There appears to be a relationship between a person's education and how poorly the model predicts their income. When we graphed the model residuals against `education` above, we see that the more a person is educated, the worse the model underestimates their income.
Patterns in the residuals suggest that relationships exist between y and other variables, even when the effect of heights is accounted for.
Add variables to a model by adding variables to the righthand side of the model formula.
```{r}
income ~ height + education
he <- lm(income ~ height + education, data = heights)
tidy(he)
```
### Interpretation
The coefficient of each variable displays the change of income that is associated with a one unit change in the variable _when all other variables are held constant_.
### Interaction effects
```{r}
tidy(lm(income ~ height + education, data = heights))
tidy(lm(income ~ height + education + height:education, data = heights))
tidy(lm(income ~ height * education, data = heights))
```
## Categorical variables
What about sex? Many sources have observed that there is a difference in income between genders. Might this explain the height effect? We can find the effect of height independent of sex by adding sex to the model; however, sex is a categorical variable.
### Factors
R stores categorical data as factors or character strings. If you add a string to a model, R will convert it to a factor for the purposes of the model.
A factor is an integer vector with a levels attribute. You can make a factor with `factor()`.
```{r}
fac <- factor(c("c", "a", "b"), levels = c("a", "b", "c"), labels = c("blond", "brunette", "red"))
fac
unclass(fac)
```
Each level of the factor (i.e. unique value) is encoded as an integer and displayed with the label that is associated with that integer.
If you use factors outside of a model, you will notice some limiting behavior:
* You cannot add values to a factor that do not appear in its levels attribute
* factors retain all of their levels attribute when you subset them. To avoid this use `drop = TRUE`.
```{r}
fac[1]
fac[1, drop = TRUE]
```
* If you coerce a factor to a numeric, R will convert the integer vector that underlies the factor, not the level labels that you see when you print the factor.
```{r}
num_fac <- factor(1:3, levels = 1:3, labels = c("100", "200", "300"))
num_fac
as.numeric(num_fac)
```
To coerce these labels to a different data type, first coerce the factor to a charater string with `as.character()`
```{r}
as.numeric(as.character(num_fac))
```
### Interpretation
Add categorical variables to a model in the same way that you would add continuous variables.
```{r}
s <- lm(income ~ sex, data = heights)
tidy(s)
```
Every level of the factor except one receives its own coefficient. The missing level acts as a baseline.
To change the baseline, create a new factor with a new levels attribute. R will use the first level in the levels attribute as the baseline.
```{r}
heights$sex <- factor(heights$sex, levels = c("male", "female"))
```
```{r}
hes <- lm(income ~ height + education + sex, data = heights)
tidy(hes)
```
```{r}
heights %>%
group_by(sex) %>%
do(glance(lm(earn ~ height, data = .)))
```
## Categorical data
```{r}
smod <- lm(earn ~ sex, data = heights)
smod
```
1. Factors
```{r}
heights$sex <- factor(heights$sex, levels = c("male", "female"))
smod2 <- lm(earn ~ sex, data = heights)
smod
smod2
```
2. How to interpret
```{r}
coef(smod)
```
## Multiple Variables
1. How to fit multivariate models in R
```{r}
mmod <- lm(earn ~ height + sex, data = heights)
mmod
```
2. How to interpret
```{r}
coef(mmod)
```
3. Interaction effects
```{r}
lm(earn ~ height + sex, data = heights)
lm(earn ~ height + sex + height:sex, data = heights)
lm(earn ~ height * sex, data = heights)
do(glance(lm(income ~ height, data = .)))
```
```{r}
lm(earn ~ height + ed, data = heights)
lm(earn ~ height * ed, data = heights)
hes2 <- lm(income ~ height + education * sex, data = heights)
tidy(hes2)
```
4. Partition variance
+ Checking residuals
### Logistic models
So far the y variable of our models has been a continuous variable, `income`. You can use linear regression to model a categorical y variable by transforming y into a continuous variable with a _link function_. Then model fit a model to the results of the link function and use the link function to back transform and interpret the results.
The most common link function is the logit function, which transforms a bivariate y variable into a continuous range.
Use `glm()` to perform logistic regression in R.
```{r}
m1 <- lm(earn ~ height, data = heights)
# plot histogram of residuals
# plot residulas vs. sex
m2 <- lm(earn ~ height + sex, data = heights)
# plot histogram of residuals
# plot residuals vs. education
m3 <- lm(earn ~ height + sex + ed, data = heights)
# plot histogram of residuals
m4 <- lm(earn ~ height + sex + race + ed + age,
data = heights)
# plot histogram of residuals
m5 <- lm(earn ~ ., data = heights)
she <- glm(sex ~ height + education, family = binomial(link = "logit"), data = heights)
tidy(she)
```
## Non-linear models
## Non-linear functions (recipes?)
But what if the relationship between variables is not linear. For example, the relationship between income and education does not seem to be linear.
```{r}
ggplot(data = heights, mapping = aes(x = education, y = income)) +
geom_boxplot(aes(group = education)) +
geom_smooth() +
coord_cartesian(ylim = c(0, 125000))
```
You can still use linear regression to model non-linear relationships.
### Transformations
0. Transformations
+ Log
```{r}
ggplot(diamonds, aes(x = carat, y = price)) +
geom_point()
@ -210,33 +289,8 @@ lm(log(price) ~ log(carat), data = diamonds)
# visualize model line
```
+ Logit with `glm()`
What if no handy transformation exists?
### Splines
```{r}
ggplot(data = heights, mapping = aes(x= age, y = earn)) +
geom_point() +
geom_smooth() +
coord_cartesian(ylim = c(0, 50000))
```
1. Polynomials
+ How to fit
```{r}
lm(earn ~ poly(age, 3), data = heights)
ggplot(data = heights, mapping = aes(x= age, y = earn)) +
geom_point() +
geom_smooth(method = lm, formula = y ~ poly(x, 3))
```
+ How to interpret
+ Strengths and Weaknesses
2. Splines
+ How to fit. Knots. Different types of splines.
```{r eval = FALSE}
bs(degree = 1) # linear splines
bs() # cubic splines
@ -244,35 +298,26 @@ ns() # natural splines
```
```{r}
lm(earn ~ ns(age, knots = c(40, 60)), data = heights)
lm(earn ~ ns(age, df = 4), data = heights)
tidy(lm(income ~ ns(education, knots = c(10, 17)), data = heights))
tidy(lm(income ~ ns(education, df = 4), data = heights))
```
```{r}
lm(earn ~ ns(age, df = 6), data = heights)
ggplot(data = heights, mapping = aes(x= age, y = earn)) +
ggplot(data = heights, mapping = aes(x= education, y = income)) +
geom_point() +
geom_smooth(method = lm, formula = y ~ ns(x, df = 6)) +
coord_cartesian(ylim = c(0, 50000))
geom_smooth(method = lm, formula = y ~ ns(x, df = 4)) +
coord_cartesian(ylim = c(0, 125000))
```
+ How to interpret
+ Strengths and weaknesses
3. General Additive Models
+ How to fit
### Additive models
```{r}
gmod <- gam(earn ~ s(height), data = heights)
gam(income ~ s(education), data = heights)
ggplot(data = heights, mapping = aes(x= age, y = earn)) +
ggplot(data = heights, mapping = aes(x= education, y = income)) +
geom_point() +
geom_smooth(method = gam, formula = y ~ s(x))
```
+ How to interpret
+ Strengths and weaknesses
```{r eval = FALSE}
# Linear z
@ -286,4 +331,15 @@ gam(y ~ s(x) + s(z), data = df)
gam(y ~ s(x, z), data = df)
```
## Summary
We've avoided two things in this chapter that are usually conflated with models: hypothesis testing and predictive analysis.
There are other types of modeling algorithms; each provides a valid description about the data.
Which description will be best? Does the relationship have a known form? Does the data have a known structure? Are you going to attempt hypothesis testing that imposes its own constraints?