More work on model basics

2016-07-21 13:37:29 -05:00 · 2016-07-21 13:37:29 -05:00 · a07c4b25d2
parent e9d67ce389
commit a07c4b25d2
1 changed files with 128 additions and 81 deletions
--- a/model-basics.Rmd
+++ b/model-basics.Rmd
@ -2,26 +2,25 @@

 ## Introduction

-The goal of a model is to provide a simple low-dimensional summary of a dataset. Ideally, the fitted model will capture true "signals" (i.e. patterns generated by the phenomenon of interest), and ignore "noise" (i.e. random variation that you're not interested in). 
+The goal of a model is to provide a simple low-dimensional summary of a dataset. In the context of this book we're going to use models to partition data into patterns and residuals. Strong patterns will hide subtler trends, so we'll use models to help peel back layers of structure as we explore a datasets.

-SEGUE
+However, before we can start using models on interesting, real, datasets, you need to understand the basics of how models work. For that reason, this chapter of the book is unique because it only uses simulated datasets. These datasets are very simple, and not at all interesting, but they will help you understand the essence of modelling before you apply the same techniques to real data in the next chapter.

-This chapter of the book is unique because it only uses simulated datasets. These datasets are very simple, and not at all interesting, but they will help you understand the essence of modelling before you apply the same techniques to real data in the next chapter.
+There are three parts to a model:

-SEGUE
+1.  First, you define a __family of models__ that express a precise, but 
+    generic, pattern that you want to capture. For example, the pattern 
+    might be a straight line, or a quadatric curve. You will express
+    the model family as an equation like `y = a_1 * x + a_2` or 
+    `y = a_1 * x ^ a_2`. Here, `x` and `y` are known variables from your
+    data, and `a_1` and `a_2` are parameters that can vary to capture 
+    different patterns.

-A fitted model is a function: you give it a set of inputs (often called the predictors), and it returns a value (typically called the prediction). Fitting a model takes place in two steps:
+1.  Next, you generate a __fitted model__ by finding the the model from the 
+    family that is the closest to your data. This takes the generic model 
+    family and makes it specific, like `y = 3 * x + 7` or `y = 9 * x ^ 2`.

-1.  You define a __family of models__ that you think will capture an
-    interesting pattern in your data. This class will look like a mathematical
-    formula that relates your predictors to the response. For a simple 1d
-    model, this might look like `y = a * x + b` or `y = a * x ^ b`.
-    
-1.  You generate a __fitted model__ by finding the one model from the family 
-    that is the closest to your data. This takes the generic model family
-    and makes it specific, like `y = 3 * x + 7` or `y = 9 * x ^ 2`.
-
-It's important to understand that a fitted model is just the closest model from a family of models. That implies that you have the "best" model (according to some criteria); it doesn't imply that you have a good model and it certainly doesn't imply that the model is "true". George Box puts this well in his famous aphorism that you've probably heard before:
+It's important to understand that a fitted model is just the closest model from a family of models. That implies that you have the "best" model (according to some criteria); it doesn't imply that you have a good model and it certainly doesn't imply that the model is "true". George Box puts this well in his famous aphorism:

 > All models are wrong, but some are useful.

@ -40,7 +39,7 @@ It's worth reading the fuller context of the quote:
 > If "truth" is to be the "whole truth" the answer must be "No". The only 
 > question of interest is "Is the model illuminating and useful?".

-The goal of a model is not to uncover Truth, but to discover a simple approximation that is still useful. This tension is at the heart of modelling: you want your model to be simpler than reality (so you can understand it more easily) while still generating good predictions. There is almost always a tradeoff between understandability and quality of predictions.
+The goal of a model is not to uncover truth, but to discover a simple approximation that is still useful. 

 ### Prerequisites

@ -53,6 +52,7 @@ library(broom)

 # Modelling requires plently of visualisation and data manipulation
 library(ggplot2)
+library(tibble)
 library(dplyr)
 library(tidyr)

@ -65,57 +65,75 @@ options(

 ## A simple model

-Lets take a look at the simulated dataset `sim1`. It contains two continuous variables, `x` and `y`, which are related in a fairly straightforward way:
+Lets take a look at the simulated dataset `sim1`. It contains two continuous variables, `x` and `y`. Let's plot them to see how they're related:

 ```{r}
 ggplot(sim1, aes(x, y)) + 
  geom_point()
 ```

-You can see a strong pattern in the data. We can use a model to capture that pattern and make it explicit. It's our job to supply the basic form of the model. In this case, the relationship looks linear, i.e. `y = a * x + b`. To fit the model, we need to find the single model from the family that does the best job for this data.
-
-Let's start by getting a feel for that family of models by randomly generating a few and overlaying them on the data. For this simple case, we can use `geom_abline()` which takes a slope and intercept as paramaters. Later on we'll learn more general techniques that work with any model.
+You can see a strong pattern in the data. Let's use a model to capture that pattern and make it explicit. It's our job to supply the basic form of the model. In this case, the relationship looks linear, i.e. `y = a_0 + a_1 * x`.  Let's start by getting a feel for what models from that family look like by randomly generating a few and overlaying them on the data. For this simple case, we can use `geom_abline()` which takes a slope and intercept as paramaters. Later on we'll learn more general techniques that work with any model.

 ```{r}
 models <- tibble(
-  a = runif(250, -5, 5),
-  b = runif(250, -20, 80)
+  a1 = runif(250, -20, 40),
+  a2 = runif(250, -5, 5)
 )

 ggplot(sim1, aes(x, y)) + 
-  geom_abline(aes(slope = a, intercept = b), data = models, alpha = 1/4) +
+  geom_abline(aes(intercept = a1, slope = a2), data = models, alpha = 1/4) +
  geom_point() 
 ```

-Of the 250 random models some seem much better than others. Intuitively, a good model will be close to the data. So to fit the model, finding the best model from this family, we need to make our intution precise. We need some way to measure the distance between the data and a model.
+There are 250 models on this plot, but a lot are really bad! We need to find the good models by making precise our intution that a good model is "close" to the data. We need a way to quantify the distance between the data and a model. Then we can fit the model by finding the value of `a_0` and `a_1` that generate the model with the smallest distance from this data.

-First, we need a way to generate the predicted y values given a model and some data. Here we represent the model as a numeric vector of length two.
+One easy place to start is to find the vertical distance between each point and the model, as in the following diagram. (Note that I've shifted the x values slightly so you can see the individual distances.)

-```{r}
-make_prediction <- function(mod, data) {
-  data$x * mod[1] + mod[2]
-}
-make_prediction(c(2, 5), sim1)
+```{r, echo = FALSE}
+dist1 <- sim1 %>% 
+  mutate(
+    dodge = rep(c(-1, 0, 1) / 20, 10),
+    x1 = x + dodge,
+    pred = 7 + x1 * 1.5
+  )
+
+ggplot(dist1, aes(x1, y)) + 
+  geom_abline(intercept = 7, slope = 1.5, colour = "grey40") +
+  geom_point(colour = "grey40") +
+  geom_linerange(aes(ymin = y, ymax = pred), colour = "#3366FF") 
 ```

-Next, we need some way to compute an overall distance between the predicted and actual y for each observation. One common way to do this in statistics is the to use the "root-mean-squared deviation". We compute the difference between actual and predicted, square them, average them, and the take the square root. This distance has lots of appealing mathematical properties, which we're not going to talk about here. You'll just have to take my word for it!
+This distance is just the difference between the y value given by the model (the __prediction__), and the actual y value in the data (the __response__).
+
+To compute this distance, we first turn our model family into an R function. This takes the model parameters and the data as inputs, and gives values predicted by the model as output:
+
+```{r}
+model1 <- function(a, data) {
+  a[1] + data$x * a[2]
+}
+model1(c(7, 1.5), sim1)
+```
+
+Next, we need some way to compute an overall distance between the predicted and actual values. In other words, the plot above shows 30 distances: how do we collapse that into a single number?
+
+One common way to do this in statistics is the to use the "root-mean-squared deviation". We compute the difference between actual and predicted, square them, average them, and the take the square root. This distance has lots of appealing mathematical properties, which we're not going to talk about here. You'll just have to take my word for it!

 ```{r}
 measure_distance <- function(mod, data) {
-  diff <- data$y - make_prediction(mod, data)
+  diff <- data$y - model1(mod, data)
  sqrt(mean(diff ^ 2))
 }
-measure_distance(c(2, 5), sim1)
+measure_distance(c(7, 1.5), sim1)
 ```

 Now we can use purrrr to compute the distance for all the models defined above. We need a helper function because our distance function expects the model as a numeric vector of length 2.

 ```{r}
-sim1_dist <- function(a, b) {
-  measure_distance(c(a, b), sim1)
+sim1_dist <- function(a1, a2) {
+  measure_distance(c(a1, a2), sim1)
 }

-models <- models %>% mutate(dist = purrr::map2_dbl(a, b, sim1_dist))
+models <- models %>% mutate(dist = purrr::map2_dbl(a1, a2, sim1_dist))
 models
 ```

@ -125,30 +143,30 @@ Next, let's overlay the 10 best models on to the data. I've coloured the models
 ggplot(sim1, aes(x, y)) + 
  geom_point(size = 2, colour = "grey30") + 
  geom_abline(
-    aes(slope = a, intercept = b, colour = -dist), 
+    aes(intercept = a1, slope = a2, colour = -dist), 
    data = filter(models, rank(dist) <= 10)
  )
 ```

-Another way of thinking about these models is to draw a scatterplot of `a` and `b`, again coloured by `-dist`. We can no longer see how the model compares to the data, but we can see many models at once. Again, I've highlighted the 10 best models, this time by drawing red circles underneath them.
+We can also think about these models as observations, and visualising with a scatterplot of `a1` vs  `a2`, again coloured by `-dist`. We can no longer directly see how the model compares to the data, but we can see many models at once. Again, I've highlighted the 10 best models, this time by drawing red circles underneath them.

 ```{r}
-ggplot(models, aes(a, b)) +
+ggplot(models, aes(a1, a2)) +
  geom_point(data = filter(models, rank(dist) < 10), size = 4, colour = "red") +
  geom_point(aes(colour = -dist))
 ```

-Instead of trying lots of random models, we could be more systematic and generate an evenly spaced grid of points, a so called grid search. I picked the parameters of the grid roughly by looking at where the best models were in the plot above.
+Instead of trying lots of random models, we could be more systematic and generate an evenly spaced grid of points (this is called a grid search). I picked the parameters of the grid roughly by looking at where the best models were in the plot above.

 ```{r}
 grid <- expand.grid(
-  a = seq(1, 3, length = 25),
-  b = seq(-5, 20, length = 25)
+  a1 = seq(-5, 20, length = 25),
+  a2 = seq(1, 3, length = 25)
  ) %>% 
-  mutate(dist = purrr::map2_dbl(a, b, sim1_dist))
+  mutate(dist = purrr::map2_dbl(a1, a2, sim1_dist))

 grid %>% 
-  ggplot(aes(a, b)) +
+  ggplot(aes(a1, a2)) +
  geom_point(data = filter(grid, rank(dist) < 10), size = 4, colour = "red") +
  geom_point(aes(colour = -dist)) 
 ```
@ -159,12 +177,12 @@ When you overlay the best 10 models back on the original data, they all look pre
 ggplot(sim1, aes(x, y)) + 
  geom_point(size = 2, colour = "grey30") + 
  geom_abline(
-    aes(slope = a, intercept = b, colour = -dist), 
+    aes(intercept = a1, slope = a2, colour = -dist), 
    data = filter(grid, rank(dist) <= 10)
  )
 ```

-Now you could imagine iteratively making the grid finer and finer until you narrowed in on the best model. But there's a better way to tackle that problem: a numerical minimisation tool called Newton-Raphson search. The intuition of Newton-Raphson is pretty simple: you pick a starting point and look around for the steepest slope. You then ski down that slope a little way, and then repeat again and again, until you can't go any lower. In R, we can do that with `optim()`:
+You could imagine iteratively making the grid finer and finer until you narrowed in on the best model. But there's a better way to tackle that problem: a numerical minimisation tool called Newton-Raphson search. The intuition of Newton-Raphson is pretty simple: you pick a starting point and look around for the steepest slope. You then ski down that slope a little way, and then repeat again and again, until you can't go any lower. In R, we can do that with `optim()`:

 ```{r}
 best <- optim(c(0, 0), measure_distance, data = sim1)
@ -172,19 +190,19 @@ best$par

 ggplot(sim1, aes(x, y)) + 
  geom_point(size = 2, colour = "grey30") + 
-  geom_abline(slope = best$par[1], intercept = best$par[2])
+  geom_abline(intercept = best$par[1], slope = best$par[2])
 ```

-Don't worry too much about the details - it's the intuition that's important here. If you have a function that defines the distance between a model and a dataset you can use existing mathematical tools to find the best model. The neat thing about this approach is that it will work for any family of models that you can write an equation for. 
+Don't worry too much about the details of how `optim()` works. It's the intuition that's important here. If you have a function that defines the distance between a model and a dataset, an algorthim that can minimise that distance by modifying the parameters of the model, you can find the best model. The neat thing about this approach is that it will work for any family of models that you can write an equation for. 

-However, this particular model is a special case of a broader family: linear models. A linear model has the general form `y = a_1 * x_1 + a_2 * x_2 + ... + a_n * x_n`. So this simple model is equivalent to a general linear model where n is 2, `a_1` is `a`, `x_1` is `x`, `a_2` is `b` and `x_2` is a constant, 1. R has a tool specifically designed for linear models called `lm()`. `lm()` has a special way to specify the model family: a formula like `y ~ x` which `lm()` translates to `y = a * x + b`. We can fit the model and look at the output:
+There's one more approach that we can use for this model, because it's is a special case of a broader family: linear models. A linear model has the general form `y = a_1 + a_2 * x_1 + a_3 * x_2 + ... + a_n * x_(n - 1)`. So this simple model is equivalent to a general linear model where n is 2 and `x_1` is `x`. R has a tool specifically designed for fitting linear models called `lm()`. `lm()` has a special way to specify the model family: formulas. Formulas look like `y ~ x`, which `lm()` will translate to a function like `y = a_1 + a_2 * x`. We can fit the model and look at the output:

 ```{r}
 sim1_mod <- lm(y ~ x, data = sim1)
 coef(sim1_mod)
 ```

-These are exactly the same values we got with `optim()`! However, behind the scenes `lm()` doesn't use `optim()` but instead takes advantage of the mathematical structure of linear models. Using some connection between geometry, calculus, and  linear algebra, it actually finds the closest model by (effectively) inverting a matrix.
+These are exactly the same values we got with `optim()`! Behind the scenes `lm()` doesn't use `optim()` but instead takes advantage of the mathematical structure of linear models. Using some connections between geometry, calculus, and linear algebra, `lm()` actually finds the closest model by (effectively) inverting a matrix.

 ### Exercises

@ -214,11 +232,11 @@ These are exactly the same values we got with `optim()`! However, behind the sce
    Use `optim()` to fit this model to the simulated data above and compare it 
    to the linear model.

-## Linear models
+## Visualising models

-For simple models, like the one above, you can figure out what the model says about the data by carefully studying the coefficients. And if you ever take a statistics course on modelling, you're likely to spend a lot of time doing just that. Here, however, we're going to take a different tack. In this book, we're going to focus on understanding a model by looking at its predictions. This has a big advantage: every type of model makes predictions (otherwise what use would it be?) so we can use the same set of techniques to understand simple linear models or complex random forrests. We'll see that advantage later on when we explore some other families of models.
+For simple models, like the one above, you can figure out what pattern the model captures by carefully studying the model family and the fitted coefficients. And if you ever take a statistics course on modelling, you're likely to spend a lot of time doing just that. Here, however, we're going to take a different tack. We're going to focus on understanding a model by looking at its predictions. This has a big advantage: every type of predictive model makes predictions (otherwise what use would it be?) so we can use the same set of techniques to understand any type of predictive model.

-We are also going to take advantage of a powerful feature of linear models: they are additive. That means you can partition the data into patterns and residuals. This allows us to see what subtler patterns remain after we have removed the biggest trend.
+It's also useful to see what the model doesn't capture, the so called residuals which are left after subsracting the predictions from the data. Residuals are a powerful because they allow us to use models to remove striking patterns so we can study the subtler trends that remain.

 ### Predictions

@ -231,14 +249,17 @@ grid

 (This will get more interesting when we start to add more variables to our model.)

-Next we add predictions. We'll use `modelr::add_predictions()` which takes a data frame and a model. It adds the predictions from the model to the data frame:
+Next we add predictions. We'll use `modelr::add_predictions()` which takes a data frame and a model. It adds the predictions from the model to a new column in the data frame:

 ```{r}
 grid <- grid %>% add_predictions(sim1_mod) 
 grid
 ```

-Next, we plot the predictions. Here the choice of plot is easy because we only have two variables. In general, however, figuring out how to the visualise the predictions can be quite challenging, and you'll often need to try a few alternatives before you get the most useful plot. For more complex models, you're likely to need multiple plots, not just one.
+(You can also use this function to add predictions to your original dataset.)
+
+Next, we plot the predictions. You might wonder about all this extra work compared to just using `geom_abline()`. But the advantage of this approach is that it will work with _any_ model in R, from the simplest to the most complex. You're only limited by your visualisation skills.
+

 ```{r}
 ggplot(sim1, aes(x)) +
@ -246,16 +267,15 @@ ggplot(sim1, aes(x)) +
  geom_line(aes(y = pred), data = grid, colour = "red", size = 1)
 ```

-Note that this plot takes advantage of ggplot2's ability to plot different datasets with different aesthetics on different layers.
-
 ### Residuals

-The flip-side of predictions are __residuals__. The predictions tells you the pattern that the model has captured, and the residuals tell you what the model has missed. The residuals are just the distances between the observed and predicted values that we computed above!
+The flip-side of predictions are __residuals__. The predictions tells you the pattern that the model has captured, and the residuals tell you what the model has missed. The residuals are just the distances between the observed and predicted values that we computed above. 

-We add residuals to the data with `add_residuals()`, which works much like `add_predictions()`. Note, however, that we use the original dataset, not a manufactured grid. Otherwise where would you get the value of the response?
+We add residuals to the data with `add_residuals()`, which works much like `add_predictions()`. Note, however, that we use the original dataset, not a manufactured grid. This is because to compute residuals we need actual y values.

 ```{r}
 sim1 <- sim1 %>% add_residuals(sim1_mod)
+sim1
 ```

 There are a few different ways to understand what the residuals tell us about the model. One way is to simply draw a frequency polygon to help us understand the spread of the residuals:
@ -265,16 +285,17 @@ ggplot(sim1, aes(resid)) +
  geom_freqpoly(binwidth = 0.5)
 ```

-(I prefer the frequency polygon over the histogram here because it makes it easier to display other categorical variables on the same plot.)
+This helps you calibrate the quality of the model: how far away are the predictions from the observed values?  Note that the average of the residual will always be 0.

-Another way is to plot the residuals against the other variables in the dataset:
+You'll often want to recreate plots using the residuals instead of the original predictor. You'll see a lot of that in the next chapter.

 ```{r}
 ggplot(sim1, aes(x, resid)) + 
-  geom_point()
+  geom_ref_line(h = 0) +
+  geom_point() 
 ```

-There doesn't seem to be any pattern remaining here.
+This looks like random noise, suggesting that our model has done a good job of capturing the patterns in the dataset.

 ### Exercises

@ -283,58 +304,78 @@ There doesn't seem to be any pattern remaining here.
    grid generation, predictions, and visualisation on `sim1` using 
    `loess()` instead of `lm()`. How does the result compare to 
    `geom_smooth()`?
+    
+1.  `add_predictions()` is paired with `gather_predictions()` and 
+    `spread_predictions()`. How do these three functions differ?
+    
+1.  What does `geom_ref_line()` do? What package does it come from?
+    Why is displaying a reference line in plots showing residuals
+    useful and important?
+    
+1.  Why might you want to look at a frequency polygon of absolute residuals?
+    What are the advantages? What are the disadvantages?

 ## Formulas and model families

-You've seen formulas before when using `facet_wrap()` and `facet_grid()`. Formulas provide a general way of getting "special behaviour" in R. Rather than evaluating the values of the variables right away, they capture them so they can be processed more by the function.
+You've seen formulas before when using `facet_wrap()` and `facet_grid()`. In R, formulas provide a general way of getting "special behaviour". Rather than evaluating the values of the variables right away, they capture them so they can be interpreted by the function.

-The majority of modelling functions in R use a standard conversion from formulas to functions. You've seen one simple conversion already: `y ~ x` is translated to `y = a_0 + a_1 * x`. You can make this arbitrarily more complicated by adding more variables: `y ~ x1 + x2 + x3` is translated to `y = a_0 + a_1 * x1 + a_2 * x2 + a_3 * x3`.  Note that R always adds an intercept term. If you don't want it, you have to deliberately remove it: `y ~ x - 1` is translated to `y = a_1 * x`. 
+The majority of modelling functions in R use a standard conversion from formulas to functions. You've seen one simple conversion already: `y ~ x` is translated to `y = a_1 + a_2 * x`. You can make this arbitrarily more complicated by adding more variables: `y ~ x1 + x2 + x3` is translated to `y = a_1 + a_2 * x1 + a_3 * x2 + a_4 * x3`. R always adds an intercept term unless you specifically remove it: `y ~ x - 1` is translated to `y = a_1 * x`. 

-When you add variables with `+`, the model will estimate each effect independent of all the others. It's possible to fit the so-called interaction by using `*`. For example, `y ~ x1 * x2` is translated to `y = a_0 + a_1 * a1 + a_2 * a2 + a_12 * a1 * a2`. Note that whenever you use `*`, both the interaction and the individual components are included in the model.
-
-You can also perform transformations inside the model formula. For example, `log(y) ~ sqrt(x1) + x2` is transformed to `y = a_0 + x1 * sqrt(x) + a_2 * x2`. If your transformation involves `+` or `*` you'll need to wrap it in `I()` so R doesn't treat it like part of the model specification. For example, `y ~ x + I(x ^ 2)` is translated to `y = a_0 + a_1 * x + a_2 * x^2`. If you forget the `I()` and specify `y ~ x^2 + x`, R considers this to be the same as `y ~ x * x + x`. `x * x` means the interaction of `x` with itself, so it's the same as `x`. R automatically drops redundant variables so `x + x` become `x`, meaning that `y ~ x ^ 2 + x` specifies the function `y = a_0 + a_1 * x`. 
-
-This formula notation is known as "Wilkinson-Rogers notation", and was initially described in _Symbolic Description of Factorial Models for Analysis of Variance_, G. N. Wilkinson and C. E. Rogers. Journal of the Royal Statistical Society. Series C (Applied Statistics). Vol. 22, No. 3 (1973), pp. 392-399. <https://www.jstor.org/stable/2346786>.  It's worth digging up and reading the original paper if you'd like to understand the full details of the underlying algebra.
+The following sections explore how this plays out in more detail.

 ### Categorical variables

 Generating a function from a formula is straight forward when the predictor is continuous, but things get a bit more complicated when the predictor is categorical. Imagine you have a formula like `y ~ sex`, where sex could either be male or female. It doesn't make sense to convert that to a formula like `y = x_0 + x_1 * sex` because `sex` isn't a number - you can't multiply it! Instead what R does is convert it to `y = x_0 + x_1 * sex_male` where `sex_male` is one if `sex` is male and 0 otherwise.

-You might wonder why R doesn't convert it to `y = x_0 + x_1 * sex_male + x_2 * sex_female`. This unfortunately can't work because don't have enough information to figure out what the "intercept" is in this case. Understanding why this doesn't work in general requires someone linear algebra ideas (like full rank), but in this simple case you can understand it with a small thought experiment. The estimates `x_0` corresponds to the global mean, `x_1` to the effect for males, and `x_2` the effect for females. All you can measure is the average response for males and the average response for females. What does the global average mean? 
+If you want to see what R actually does, you can use the `model.matrix()` function. It takes similar inputs to `lm()` but returns the numeric matrix that R uses to fit the model. This is useful if you ever want to understand exactly which equation is generated by your formula.

-With the simpler function, `y = x_0 + x_1 * sex_male`, `x_0` corresponds to the average for females, and `x_1` difference between females and males. You'll always see this if you look at the coefficients of a model with a categorical variable: there is always one less coefficient than you expect. Fortunately, however, if you stick with visualising predictions from the model you don't need to worry about this.
+```{r, echo = FALSE}
+df <- frame_data(
+  ~ sex, ~ response,
+  "male", 1,
+  "female", 2,
+  "male", 1
+)
+as_tibble(model.matrix(response ~ sex, data = df))
+```

 The process of turning a categorical variable into a 0-1 matrix has different names. Sometimes the individual 0-1 columns are called dummy variables. In machine learning, it's called one-hot encoding. In statistics, the process is called creating a contrast matrix.  General example of "feature generation": taking things that aren't continuous variables and figuring out how to represent them.

-Let's look at some data and models to make that concrete.
+You might wonder why R doesn't convert it to `y = x_0 + x_1 * sex_male + x_2 * sex_female`. This unfortunately can't work because don't have enough information to figure out what the "intercept" is in this case. Understanding why this doesn't work in general requires someone linear algebra ideas (like full rank), but in this simple case you can understand it with a small thought experiment. The estimates `x_0` corresponds to the global mean, `x_1` to the effect for males, and `x_2` the effect for females. All you can measure is the average response for males and the average response for females. What does the global average mean? 
+
+With the simpler function, `y = x_0 + x_1 * sex_male`, `x_0` corresponds to the average for females, and `x_1` difference between females and males. You'll always see this if you look at the coefficients of a model with a categorical variable: there is always one less coefficient than you expect. Fortunately, however, if you focus on visualising predictions you don't need to worry about these details. Let's look at some data and models to make that concrete. Here's the `sim2` dataset from modelr:

 ```{r}
 ggplot(sim2) + 
  geom_point(aes(x, y))
 ```

+We can fit a model to it, and generate predictions:
+
 ```{r}
 mod2 <- lm(y ~ x, data = sim2)

 grid <- sim2 %>% 
  expand(x) %>% 
  add_predictions(mod2)
+grid
+```

+Effectively, a model with a categorical `x` will predict the mean value for each category. (Why? Because the mean minimise the root-mean-squared distance.) That's easy to see if we overlay the predictions on top of the orignal data:
+
+```{r}
 ggplot(sim2, aes(x)) + 
  geom_point(aes(y = y)) +
  geom_point(data = grid, aes(y = pred), colour = "red", size = 4)
-
 ```

-Effectively, a model with a categorical `x` will predict the mean value for each category. (Why? Because the mean minimise the root-mean-squared distance.)
-
-Note that (obviously) you can't make predictions about levels that you didn't observe. Sometimes you'll do this by accident so it's good to recognise this error message:
+You can't make predictions about levels that you didn't observe. Sometimes you'll do this by accident so it's good to recognise this error message:

 ```{r, error = TRUE}
 tibble(x = "e") %>% add_predictions(mod2)
 ```

-### A continuous and categorical
+### Interactions

 What happens when you combine a continuous an a categorical variable?  `sim3` contains a categorical predictor and a continuous predictor. We can visualise it with a simple plot:

@ -350,6 +391,8 @@ mod1 <- lm(y ~ x1 + x2, data = sim3)
 mod2 <- lm(y ~ x1 * x2, data = sim3)
 ```

+When you add variables with `+`, the model will estimate each effect independent of all the others. It's possible to fit the so-called interaction by using `*`. For example, `y ~ x1 * x2` is translated to `y = a_0 + a_1 * a1 + a_2 * a2 + a_12 * a1 * a2`. Note that whenever you use `*`, both the interaction and the individual components are included in the model.
+
 To visualise these models we need two new tricks:

 1.  We have two predictors, so we need to give `expand()` two variables. 
@ -394,8 +437,6 @@ ggplot(sim3, aes(x1, resid, colour = x2)) +

 There is little obvious pattern in the residuals for `mod2`. The residuals for `mod1` show that the model has clearly missed some pattern in `b`, and less so, but still present is pattern in `c`, and `d`. You might wonder if there's a precise way to tell which of `mod1` or `mod2` is better. There is, but it requires a lot of mathematical background, and we don't really care. Here, we're interested in a qualitative assessment of whether or not the model has captured the pattern that we're interested in. 

-### Two continuous variables
-
 Let's take a look at the equivalent model for two continuous variables. Initially things proceed almost identically to the previous example:

 ```{r}
@ -438,6 +479,13 @@ You can see that even with just two continuous variables, coming up with good vi

 I spent some time looking at the residuals to see if I could figure if `mod2` did `mod1`. I think it does but it's pretty subtle. You'll have a chance to work on it in the exercises.

+### Transformations
+
+You can also perform transformations inside the model formula. For example, `log(y) ~ sqrt(x1) + x2` is transformed to `y = a_1 + a_2 * x1 * sqrt(x) + a_3 * x2`. If your transformation involves `+`, `*`, `^`, or `-`, you'll need to wrap it in `I()` so R doesn't treat it like part of the model specification. For example, `y ~ x + I(x ^ 2)` is translated to `y = a_1 + a_2 * x + a_3 * x^2`. If you forget the `I()` and specify `y ~ x ^ 2 + x`, R will compute `y ~ x * x + x`. `x * x` means the interaction of `x` with itself, which is the same as `x`. R automatically drops redundant variables so `x + x` become `x`, meaning that `y ~ x ^ 2 + x` specifies the function `y = a_1 + a_2 * x`. That's probably not what you intended!
+
+This formula notation is sometimes called "Wilkinson-Rogers notation", and was initially described in _Symbolic Description of Factorial Models for Analysis of Variance_, by G. N. Wilkinson and C. E. Rogers <https://www.jstor.org/stable/2346786>. It's worth digging up and reading the original paper if you'd like to understand the full details of the modelling algebra.
+
+
 ### Exercises

 1.  Using the basic principles, convert the formuals in the following two
@ -485,7 +533,6 @@ options(na.option = na.exclude)

 I don't really understand why it's called `na.exclude` when it causes missing values to be include in the predictions but there you have it.

-
 ## Other model families

 Here we've focussed on linear models, which is a fairly limited space (but it does include a first-order linear approximation of any more complicated model).