diff --git a/model-basics.Rmd b/model-basics.Rmd index 3998291..bd955d6 100644 --- a/model-basics.Rmd +++ b/model-basics.Rmd @@ -325,7 +325,7 @@ The following sections explore how this plays out in more detail. Generating a function from a formula is straight forward when the predictor is continuous, but things get a bit more complicated when the predictor is categorical. Imagine you have a formula like `y ~ sex`, where sex could either be male or female. It doesn't make sense to convert that to a formula like `y = x_0 + x_1 * sex` because `sex` isn't a number - you can't multiply it! Instead what R does is convert it to `y = x_0 + x_1 * sex_male` where `sex_male` is one if `sex` is male and zero otherwise. -If you want to see what R actually does, you can use the `model.matrix()` function. It takes similar inputs to `lm()` but returns the numeric matrix that R uses to fit the model. This is useful if you ever want to understand exactly which equation is generated by your formula. +If you want to see what R actually does, you can use the `model_matrix()` function. It takes a data frame and a formula and returns a tibble that defines the model equation: each column in the output is associated with one coefficient in the model. This is useful if you ever want to understand exactly which equation is generated by your formula. ```{r, echo = FALSE} df <- frame_data( @@ -334,7 +334,7 @@ df <- frame_data( "female", 2, "male", 1 ) -as_tibble(model.matrix(response ~ sex, data = df)) +model_matrix(df, response ~ sex) ``` The process of turning a categorical variable into a 0-1 matrix has different names. Sometimes the individual 0-1 columns are called dummy variables. In machine learning, it's called one-hot encoding. In statistics, the process is called creating a contrast matrix. General example of "feature generation": taking things that aren't continuous variables and figuring out how to represent them.