Use new model_matrix

This commit is contained in:
hadley 2016-07-25 07:24:10 -05:00
parent 3836b6b352
commit f1cc2088f9
1 changed files with 2 additions and 2 deletions

View File

@ -325,7 +325,7 @@ The following sections explore how this plays out in more detail.
Generating a function from a formula is straight forward when the predictor is continuous, but things get a bit more complicated when the predictor is categorical. Imagine you have a formula like `y ~ sex`, where sex could either be male or female. It doesn't make sense to convert that to a formula like `y = x_0 + x_1 * sex` because `sex` isn't a number - you can't multiply it! Instead what R does is convert it to `y = x_0 + x_1 * sex_male` where `sex_male` is one if `sex` is male and zero otherwise.
If you want to see what R actually does, you can use the `model.matrix()` function. It takes similar inputs to `lm()` but returns the numeric matrix that R uses to fit the model. This is useful if you ever want to understand exactly which equation is generated by your formula.
If you want to see what R actually does, you can use the `model_matrix()` function. It takes a data frame and a formula and returns a tibble that defines the model equation: each column in the output is associated with one coefficient in the model. This is useful if you ever want to understand exactly which equation is generated by your formula.
```{r, echo = FALSE}
df <- frame_data(
@ -334,7 +334,7 @@ df <- frame_data(
"female", 2,
"male", 1
)
as_tibble(model.matrix(response ~ sex, data = df))
model_matrix(df, response ~ sex)
```
The process of turning a categorical variable into a 0-1 matrix has different names. Sometimes the individual 0-1 columns are called dummy variables. In machine learning, it's called one-hot encoding. In statistics, the process is called creating a contrast matrix. General example of "feature generation": taking things that aren't continuous variables and figuring out how to represent them.