* spelling in text

'parameters', 'intuition', 'subtracting', 'original', 'formulas', 'slightly',  'focused', and 'forests'

* spelling of 'randomForest::randomForest()'

Instead of `randomForest::randomForrest()`

https://cran.r-project.org/web/packages/randomForest/randomForest.pdf
This commit is contained in:
Will Beasley 2016-07-22 21:14:24 +02:00 committed by Hadley Wickham
parent da29b50540
commit 544766d0fd
1 changed files with 8 additions and 8 deletions

View File

@ -72,7 +72,7 @@ ggplot(sim1, aes(x, y)) +
geom_point()
```
You can see a strong pattern in the data. Let's use a model to capture that pattern and make it explicit. It's our job to supply the basic form of the model. In this case, the relationship looks linear, i.e. `y = a_0 + a_1 * x`. Let's start by getting a feel for what models from that family look like by randomly generating a few and overlaying them on the data. For this simple case, we can use `geom_abline()` which takes a slope and intercept as paramaters. Later on we'll learn more general techniques that work with any model.
You can see a strong pattern in the data. Let's use a model to capture that pattern and make it explicit. It's our job to supply the basic form of the model. In this case, the relationship looks linear, i.e. `y = a_0 + a_1 * x`. Let's start by getting a feel for what models from that family look like by randomly generating a few and overlaying them on the data. For this simple case, we can use `geom_abline()` which takes a slope and intercept as parameters. Later on we'll learn more general techniques that work with any model.
```{r}
models <- tibble(
@ -85,7 +85,7 @@ ggplot(sim1, aes(x, y)) +
geom_point()
```
There are 250 models on this plot, but a lot are really bad! We need to find the good models by making precise our intution that a good model is "close" to the data. We need a way to quantify the distance between the data and a model. Then we can fit the model by finding the value of `a_0` and `a_1` that generate the model with the smallest distance from this data.
There are 250 models on this plot, but a lot are really bad! We need to find the good models by making precise our intuition that a good model is "close" to the data. We need a way to quantify the distance between the data and a model. Then we can fit the model by finding the value of `a_0` and `a_1` that generate the model with the smallest distance from this data.
One easy place to start is to find the vertical distance between each point and the model, as in the following diagram. (Note that I've shifted the x values slightly so you can see the individual distances.)
@ -236,7 +236,7 @@ These are exactly the same values we got with `optim()`! Behind the scenes `lm()
For simple models, like the one above, you can figure out what pattern the model captures by carefully studying the model family and the fitted coefficients. And if you ever take a statistics course on modelling, you're likely to spend a lot of time doing just that. Here, however, we're going to take a different tack. We're going to focus on understanding a model by looking at its predictions. This has a big advantage: every type of predictive model makes predictions (otherwise what use would it be?) so we can use the same set of techniques to understand any type of predictive model.
It's also useful to see what the model doesn't capture, the so called residuals which are left after subsracting the predictions from the data. Residuals are a powerful because they allow us to use models to remove striking patterns so we can study the subtler trends that remain.
It's also useful to see what the model doesn't capture, the so called residuals which are left after subtracting the predictions from the data. Residuals are a powerful because they allow us to use models to remove striking patterns so we can study the subtler trends that remain.
### Predictions
@ -360,7 +360,7 @@ grid <- sim2 %>%
grid
```
Effectively, a model with a categorical `x` will predict the mean value for each category. (Why? Because the mean minimise the root-mean-squared distance.) That's easy to see if we overlay the predictions on top of the orignal data:
Effectively, a model with a categorical `x` will predict the mean value for each category. (Why? Because the mean minimise the root-mean-squared distance.) That's easy to see if we overlay the predictions on top of the original data:
```{r}
ggplot(sim2, aes(x)) +
@ -487,7 +487,7 @@ This formula notation is sometimes called "Wilkinson-Rogers notation", and was i
### Exercises
1. Using the basic principles, convert the formuals in the following two
1. Using the basic principles, convert the formulas in the following two
models into functions. (Hint: start by converting the categorical variable
into 0-1 variables.)
@ -497,7 +497,7 @@ This formula notation is sometimes called "Wilkinson-Rogers notation", and was i
```
1. For `sim4`, which of `mod1` and `mod2` is better? I think `mod2` does a
slighty better job at removing patterns, but it's pretty subtle. Can you
slightly better job at removing patterns, but it's pretty subtle. Can you
come up with a plot to support my claim?
## Missing values
@ -534,7 +534,7 @@ I don't really understand why it's called `na.exclude` when it causes missing va
## Other model families
Here we've focussed on linear models, which is a fairly limited space (but it does include a first-order linear approximation of any more complicated model).
Here we've focused on linear models, which is a fairly limited space (but it does include a first-order linear approximation of any more complicated model).
Some extensions of linear models are:
@ -564,6 +564,6 @@ Some extensions of linear models are:
way to linear models. They fit a piece-wise constant model, splitting the
data into progressively smaller and smaller pieces. Trees aren't terribly
effective by themselves, but they are very powerful when used in aggregated
by models like random forrests (e.g. `randomForest::randomForrest()`) or
by models like random forests (e.g. `randomForest::randomForest()`) or
in gradient boosting machines (e.g. `xgboost::xgboost`.)