Models are one of the most important tools for data scientists, because models describe relationships. Would you list out every value of a variable, or would you state the mean? Would you list out every pair of values, or would you state the function between variables?
*Section 2* will show you how to use R to build linear models, the most commonly used modeling tool. The section introduces R's model syntax, a general syntax that you can reuse with any of R's modelling functions.
**Note: the current examples use a data set that will be replaced in later drafts.**
## What is a model?
1. A model is just a summary, like a mean, median, or variance.
+ Example problem/data set
```{r echo = FALSE}
heights <- read.csv("data/heights.csv")
```
```{r}
head(heights)
```
2. As normally taught, modeling is a conflation of three subjects
+ Models as summaries
+ Hypothesis testing
+ Predictive modeling
3. C. This chapter shows how to build a model and use it as a summary. The methods for building a model apply to all three subjects.
## How to build a model
1. Best fit
+ Best fit of what? A certain class of function.
+ But how do you know which class to use? In some cases, the data can provide suggestions. In other cases existing theory can provide suggestions. But ultimately, you'll never know for sure. But that's okay, good enough is good enough.
2. What does best fit mean?
+ It may or may not accurately describe the true relationship. Heck, there might not even be a true relationship. But it is the best guess given the data.
+ Example problem/data set
+ It does not mean causation exists. Causation is just one type of relations, which is difficult enough to define, let alone prove.
3. How do you find the best fit?
+ With an algorithm. There is an algorithm to fit each specific class of function. We will cover some of the most useful here.
4. How do you know how good the fit is?
+ Adjusted $R^{2}$
5. Are we making assumptions when we fit a model?
+ No. Not unless you assume that you've selected the correct type of function (and I see no reason why you should assume that).
+ Assumptions come when you start hypothesis testing.
## Linear models
1. Linear models fit linear functions
2. How to fit in R
+ model syntax, which is reusable with all model functions
```{r}
earn ~ height
lm(earn ~ height, data = heights)
```
+ save model output
```{r}
hmod <- lm(earn ~ height, data = heights)
coef(hmod)
summary(hmod)
```
+ visualize
```{r}
ggplot(data = heights, mapping = aes(x = height, y = earn)) +