r4ds/model.Rmd

299 lines
8.9 KiB
Plaintext

---
layout: default
title: Model
---
# Model
A model is a function that summarizes how the values of one variable vary in response to the values of other variables. Models play a large role in hypothesis testing and prediction, but for the moment you should think of models just like you think of statistics. A statistic summarizes a *distribution* in a way that is easy to understand; and a model summarizes *covariation* in a way that is easy to understand. In other words, a model is just another way to describe data.
This chapter will explain how to build useful models with R.
### Outline
*Section 1* will explain what models are and what they can do for you.
*Section 2* will introduce R's model syntax, a general syntax that you can reuse with any of R's modeling functions. In this section, you will use the syntax to build linear models, the most commonly used type of model.
*Section 3* will teach you to build and interpret multivariate linear models, models that use more than one explanatory variable to explain the values of a response variable.
*Section 4* will explain how to use categorical variables in your models and how to interpret the results of models that use categorical variables.
*Section 5* will present a logical way to extend linear models to describe non-linear relationships.
### Prerequisites
To access the functions and data sets that we will use in the chapter, load the `ggplot2`, `dplyr`, `mgcv`, `splines`, and `broom` packages:
```{r}
# install.packages("")
library(ggplot2)
library(dplyr)
library(mgcv)
library(splines)
library(broom)
```
## What is a model?
Have you heard that a relationship exists between your height and your income? It sounds far-fetched---and maybe it is---but many people believe that taller people will be promoted faster and valued more for their work, an effect that directly inflates the income of the vertically gifted.
Do you think this is true? Could a relationship exist between a person's height and their income? Luckily, it is easy to measure someone's height as well as their income (and a swath of other related variables to boot), which means that we can collect data relevant to the question. In fact, the Bureau of Labor Statistics has been doing this in a controlled way for over 50 years with the [National Longitudinal Surveys (NLS)](https://www.nlsinfo.org/). The NLS tracks the income, education, and life circumstances of a large cohort of Americans across several decades. In case you are wondering, the point of the NLS is not to study the relationhip between height and income, that's just a lucky accident of the data.
You can load the latest cross-section of NLS data, collected in 2013 with the code below.
```{r echo = FALSE}
heights <- readRDS("data/heights.RDS")
```
I've narrowed the data down to 10 variables:
* `id` - A number ot identify each subject
* `income` - The self-reported income of each subject
* `height` - The height of each subject in inches
* `weight` - The weight of each subject in inches
* `sex` - The sex of each subject
* `race` - The race of each subject
* `education` - The number of years of education completed by each subject
* `asvab` - Each subject's score on the Armed Services Vocational Aptitude Battery (ASVAB), an intelligence assessment, out of 100.
* `sat_math` - Each subject's score on the math portion of the Scholastic Aptitude Test (SAT), out of 800.
* `bdate` - Month of birth with 1 = January.
summary that describes a r, like a mean, median, or variance.
+ Example problem/data set
```{r}
head(heights)
```
2. As normally taught, modeling conflates three subjects
+ Models as summaries
+ Hypothesis testing
+ Predictive modeling
3. C. This chapter shows how to build a model and use it as a summary. The methods for building a model apply to all three subjects.
## How to build a model
1. Best fit
+ Best fit of what? A certain class of function.
+ But how do you know which class to use? In some cases, the data can provide suggestions. In other cases existing theory can provide suggestions. But ultimately, you'll never know for sure. But that's okay, good enough is good enough.
2. What does best fit mean?
+ It may or may not accurately describe the true relationship. Heck, there might not even be a true relationship. But it is the best guess given the data.
+ Example problem/data set
+ It does not mean causation exists. Causation is just one type of relations, which is difficult enough to define, let alone prove.
3. How do you find the best fit?
+ With an algorithm. There is an algorithm to fit each specific class of function. We will cover some of the most useful here.
4. How do you know how good the fit is?
+ Adjusted $R^{2}$
5. Are we making assumptions when we fit a model?
+ No. Not unless you assume that you've selected the correct type of function (and I see no reason why you should assume that).
+ Assumptions come when you start hypothesis testing.
## Linear models
1. Linear models fit linear functions
2. How to fit in R
+ model syntax, which is reusable with all model functions
```{r}
income ~ height
mod <- lm(income ~ height, data = heights)
mod
tidy(mod)
glance(mod)
augment(mod)
```
+ visualize
```{r}
ggplot(data = heights, mapping = aes(x = height, y = income)) +
geom_point() +
geom_smooth(method = lm)
```
+ intercept or no intercept
```{r}
0 + earn ~ height
tidy(lm(income ~ 0 + height, data = heights))
tidy(lm(income ~ height, data = heights))
```
3. How to interpret
+ extract information. Resid. Predict.
```{r}
augment(mod)$.resid
augment(mod)$.fitted
```
+ Interpret coefficient
```{r}
tidy(mod)$estimate
```
```{r}
heights %>%
group_by(sex) %>%
do(glance(lm(income ~ height, data = .)))
```
## Categorical data
```{r}
smod <- lm(income ~ sex, data = heights)
smod
```
1. Factors
```{r}
heights$sex <- factor(heights$sex, levels = c("female", "male"))
smod2 <- lm(income ~ sex, data = heights)
tidy(smod)
tidy(smod2)
```
2. How to interpret
## Multiple Variables
1. How to fit multivariate models in R
```{r}
mmod <- lm(income ~ height + sex, data = heights)
summary(mmod)
```
2. How to interpret
```{r}
tidy(mmod)
```
3. Interaction effects
```{r}
tidy(lm(income ~ height + sex, data = heights))
tidy(lm(income ~ height + sex + height:sex, data = heights))
tidy(lm(income ~ height * sex, data = heights))
```
```{r}
tidy(lm(income ~ height + education, data = heights))
tidy(lm(income ~ height * education, data = heights))
```
4. Partition variance
+ Checking residuals
```{r}
m1 <- lm(income ~ height, data = heights)
# plot histogram of residuals
# plot residulas vs. sex
m2 <- lm(income ~ height + sex, data = heights)
# plot histogram of residuals
# plot residuals vs. education
m3 <- lm(income ~ height + sex + education, data = heights)
# plot histogram of residuals
m4 <- lm(income ~ height + sex + race + education,
data = heights)
# plot histogram of residuals
m5 <- lm(income ~ ., data = heights)
```
## Non-linear functions (recipes?)
0. Transformations
+ Log
```{r}
ggplot(diamonds, aes(x = carat, y = price)) +
geom_point()
ggplot(diamonds, aes(x = log(carat), y = log(price))) +
geom_point()
```
```{r}
lm(log(price) ~ log(carat), data = diamonds)
# visualize model line
```
+ Logit with `glm()`
What if no handy transformation exists?
```{r}
ggplot(data = heights, mapping = aes(x = education, y = income)) +
geom_boxplot(aes(group = education)) +
geom_smooth() +
coord_cartesian(ylim = c(0, 125000))
```
1. Polynomials
+ How to fit
+ why it doesn't work with missing values
```{r}
heights2 <- na.omit(heights)
tidy(lm(income ~ poly(education, 3), data = heights2))
ggplot(data = heights2, mapping = aes(x= education, y = income)) +
geom_point() +
geom_smooth(method = lm, formula = y ~ poly(x, 3))
```
+ How to interpret
+ Strengths and Weaknesses
2. Splines
+ How to fit. Knots. Different types of splines.
```{r eval = FALSE}
bs(degree = 1) # linear splines
bs() # cubic splines
ns() # natural splines
```
```{r}
tidy(lm(income ~ ns(education, knots = c(10, 17)), data = heights))
tidy(lm(income ~ ns(education, df = 4), data = heights))
```
```{r}
ggplot(data = heights, mapping = aes(x= education, y = income)) +
geom_point() +
geom_smooth(method = lm, formula = y ~ ns(x, df = 4)) +
coord_cartesian(ylim = c(0, 125000))
```
+ How to interpret
+ Strengths and weaknesses
3. General Additive Models
+ How to fit
```{r}
gam(income ~ s(education), data = heights)
ggplot(data = heights, mapping = aes(x= education, y = income)) +
geom_point() +
geom_smooth(method = gam, formula = y ~ s(x))
```
+ How to interpret
+ Strengths and weaknesses
```{r eval = FALSE}
# Linear z
gam(y ~ s(x) + z, data = df)
# Smooth x and smooth z
gam(y ~ s(x) + s(z), data = df)
# Smooth surface of x and z
# (a smooth function that takes both x and z)
gam(y ~ s(x, z), data = df)
```