Fix typos (#432)

* Fix typos

* Fix typos
This commit is contained in:
behrman 2016-10-03 06:37:48 -07:00 committed by Hadley Wickham
parent 5872fd7811
commit 5aa30a729c
2 changed files with 32 additions and 31 deletions

View File

@ -4,7 +4,7 @@
In the previous chapter you learned how linear models worked, and learned some basic tools for understanding what a model is telling you about your data. The previous chapter focussed on simulated datasets to help you learn about how models work. This chapter will focus on real data, showing you how you can progressively build up a model to aid your understanding of the data.
We will take advantage of the fact that you can think about a model partitioning your data into pattern and residuals. We'll find patterns with visualisation, then make them concrete and precise with a model. We'll them repeat the process, but replace the old response variable with the residuals from the model. The goal is to transition from implicit knowledge in the data and your head to explicit knowledge in a quantitative model. This makes it easier to apply to new domains, and easier for others to use.
We will take advantage of the fact that you can think about a model partitioning your data into pattern and residuals. We'll find patterns with visualisation, then make them concrete and precise with a model. We'll then repeat the process, but replace the old response variable with the residuals from the model. The goal is to transition from implicit knowledge in the data and your head to explicit knowledge in a quantitative model. This makes it easier to apply to new domains, and easier for others to use.
For very large and complex datasets this will be a lot of work. There are certainly alternative approaches - a more machine learning approach is simply to focus on the predictive ability of the model. These approaches tend to produce black boxes: the model does a really good job at generating predictions, but you don't know why. This is a totally reasonable approach, but it does make it hard to apply your real world knowledge to the model. That, in turn, makes it difficult to assess whether or not the model will continue to work in the long-term, as fundamentals change. For most real models, I'd expect you to use some combination of this approach and a more classic automated approach.
@ -51,7 +51,7 @@ Note that the worst diamond color is J (slightly yellow), and the worst clarity
### Price and carat
It looks like lower quality diamonds have higher prices because there is an important cofounding variable: the weight (`carat`) of the diamond. The weight of the diamond is the single most important factor for determining the price of the diamond, and lower quality diamonds tend to be larger.
It looks like lower quality diamonds have higher prices because there is an important confounding variable: the weight (`carat`) of the diamond. The weight of the diamond is the single most important factor for determining the price of the diamond, and lower quality diamonds tend to be larger.
```{r}
ggplot(diamonds, aes(carat, price)) +
@ -118,7 +118,7 @@ ggplot(diamonds2, aes(clarity, lresid)) + geom_boxplot()
Now we see the relationship we expect: as the quality of the diamond increases, so to does it's relative price. To interpret the `y` axis, we need to think about what the residuals are telling us, and what scale they are on. A residual of -1 indicates that `lprice` was 1 unit lower than a prediction based solely on its weight. $2^{-1}$ is 1/2, points with a value of -1 are half the expected price, and residuals with value 1 are twice the predicted price.
### A model complicated model
### A more complicated model
If we wanted to, we could continue to build up our model, moving the effects we've observed into the model to make them explicit. For example, we could include `color`, `cut`, and `clarity` into the model so that we also make explicit the effect of these three categorical variables:
@ -126,7 +126,7 @@ If we wanted to, we could continue to build up our model, moving the effects we'
mod_diamond2 <- lm(lprice ~ lcarat + color + cut + clarity, data = diamonds2)
```
This model now includes four predictors, so it's getting harder to visualise. Fortunately, they're currently all independent which means that we can plot them individually in four plots. To make the process a little easier, we're going to use the `model` argument to `data_grid`:
This model now includes four predictors, so it's getting harder to visualise. Fortunately, they're currently all independent which means that we can plot them individually in four plots. To make the process a little easier, we're going to use the `.model` argument to `data_grid`:
```{r}
grid <- diamonds2 %>%
@ -138,7 +138,7 @@ ggplot(grid, aes(cut, pred)) +
geom_point()
```
If the model needs variables that you haven't explicitly supplied, `data_grid()` will automatically fill them in with "typical" value. For continous variables, it uses the median, and categorical variables it uses the most common value (or values, if there's a tie).
If the model needs variables that you haven't explicitly supplied, `data_grid()` will automatically fill them in with "typical" value. For continuous variables, it uses the median, and categorical variables it uses the most common value (or values, if there's a tie).
```{r}
diamonds2 <- diamonds2 %>%
@ -148,7 +148,7 @@ ggplot(diamonds2, aes(lcarat, lresid2)) +
geom_hex(bins = 50)
```
This plot indicates that there are some diamonds with quite large residuals - remember a residual of 4 indicates that the diamond is 4x the price that we expected. It's often useful to look at unusual values individually:
This plot indicates that there are some diamonds with quite large residuals - remember a residual of 2 indicates that the diamond is 4x the price that we expected. It's often useful to look at unusual values individually:
```{r}
diamonds2 %>%
@ -167,10 +167,10 @@ Nothing really jumps out at me here, but it's probably worth spending time consi
strips. What do they represent?
1. If `log(price) = a_0 + a_1 * log(carat)`, what does that say about
the relationship between `price` and `carat?
the relationship between `price` and `carat`?
1. Extract the diamonds that have very high and very low residuals.
Is there any unusual about these diamonds? Are the particularly bad
Is there anything unusual about these diamonds? Are the particularly bad
or good, or do you think these are pricing errors?
1. Does the final model, `mod_diamonds2`, do a good job of predicting
@ -183,7 +183,7 @@ Let's work through a similar process for a dataset that seems even simpler at fi
```{r}
daily <- flights %>%
mutate(date = make_datetime(year, month, day)) %>%
mutate(date = make_date(year, month, day)) %>%
group_by(date) %>%
summarise(n = n())
daily
@ -244,7 +244,8 @@ Note the change in the y-axis: now we are seeing the deviation from the expected
Our model fails to accurately predict the number of flights on Saturday:
during summer there are more flights than we expect, and during Fall there
are fewer. We'll see how we can do capture this pattern in the next section.
are fewer. We'll see how we can do better to capture this pattern in the
next section.
1. There are some days with far fewer flights than expected:
@ -284,7 +285,7 @@ daily %>%
ggplot(aes(date, n)) +
geom_point() +
geom_line() +
scale_x_datetime(NULL, date_breaks = "1 month", date_labels = "%b")
scale_x_date(NULL, date_breaks = "1 month", date_labels = "%b")
```
(I've used both points and lines to make it more clear what is data and what is interpolation.)
@ -298,7 +299,7 @@ Lets create a "term" variable that roughly captures the three school terms, and
```{r}
term <- function(date) {
cut(date,
breaks = as.POSIXct(ymd(20130101, 20130605, 20130825, 20140101)),
breaks = ymd(20130101, 20130605, 20130825, 20140101),
labels = c("spring", "summer", "fall")
)
}
@ -311,7 +312,7 @@ daily %>%
ggplot(aes(date, n, colour = term)) +
geom_point(alpha = 1/3) +
geom_line() +
scale_x_datetime(NULL, date_breaks = "1 month", date_labels = "%b")
scale_x_date(NULL, date_breaks = "1 month", date_labels = "%b")
```
(I manually tweaked the dates to get nice breaks in the plot. Using a visualisation to help you understand what your function is doing is a really powerful and general technique.)
@ -389,7 +390,7 @@ Either approach is reasonable. Making the transformed variable explicit is usefu
### Time of year: an alternative approach
In the previous section we used our domain knowledge (how the US school term affects travel) to improve the model. An alternative to using making our knowledge explicit in the model is to give the data more room to speak. We could use a more flexible model and allow that to capture the pattern we're interested in. A simple linear trend isn't adeqaute, so we could try using a natural spline to fit a smooth curve across the year:
In the previous section we used our domain knowledge (how the US school term affects travel) to improve the model. An alternative to using making our knowledge explicit in the model is to give the data more room to speak. We could use a more flexible model and allow that to capture the pattern we're interested in. A simple linear trend isn't adequate, so we could try using a natural spline to fit a smooth curve across the year:
```{r}
library(splines)
@ -408,7 +409,7 @@ We see a strong pattern in the numbers of Saturday flights. This is reassuring,
### Exercises
1. Use your google sleuthing skills to brainstorm why there were fewer than
1. Use your Google sleuthing skills to brainstorm why there were fewer than
expected flights on Jan 20, May 26, and Sep 9. (Hint: they all have the
same explanation.) How would these days generalise to another year?
@ -417,7 +418,7 @@ We see a strong pattern in the numbers of Saturday flights. This is reassuring,
```{r}
daily %>%
filter(resid > 80)
top_n(3, resid)
```
1. Create a new variable that splits the `wday` variable into terms, but only
@ -438,17 +439,17 @@ We see a strong pattern in the numbers of Saturday flights. This is reassuring,
1. We hypothesised that people leaving on Sundays are more likely to be
business travellers who need to be somewhere on Monday. Explore that
hypothesis by seeing how it breaks down based on distance and itme: if
hypothesis by seeing how it breaks down based on distance and time: if
it's true, you'd expect to see more Sunday evening flights to places that
are far away.
1. It's a little frustrating that Sunday and Saturday are on separate ends
of the plot. Write a small function to set the manipulate the levels of the
of the plot. Write a small function to set the levels of the
factor so that the week starts on Monday.
## Learning more about models
We have only scratched the absolute surface of modelling, but you have hopefully gained some simple, but general purpose tools that you can use to improve your own data analyses. It's ok to start simple! As you've seen, even very simple models can make a dramatic difference in your ability to tease out interactions between variables.
We have only scratched the absolute surface of modelling, but you have hopefully gained some simple, but general-purpose tools that you can use to improve your own data analyses. It's OK to start simple! As you've seen, even very simple models can make a dramatic difference in your ability to tease out interactions between variables.
These modelling chapters are even more opinionated than the rest of the book. I approach modelling from a somewhat different perspective to most others, and there is relatively little space devoted to it. Modelling really deserves a book on its own, so I'd highly recommend that you read at least one of these three books:
@ -470,5 +471,5 @@ These modelling chapters are even more opinionated than the rest of the book. I
* *Applied Predictive Modeling* by Max Kuhn and Kjell Johnson,
<http://appliedpredictivemodeling.com>. This book is a companion to the
__caret__ package, and provides practical tools for dealing with real-life
__caret__ package and provides practical tools for dealing with real-life
predictive modelling challenges.

View File

@ -15,7 +15,7 @@ In this chapter you're going to learn three powerful ideas that help you to work
because once you have tidy data, you can apply all of the techniques that
you've learned about in earlier in the book.
We'll start by diving in to a motivating example using data about life expectancy around the world. It's a small dataset but it illustrates how important modelling can be for improving your visualisations. We'll use a large number of simple models to partition out some of the strongest signal so we can see the subtler signals that remain. We'll also see how model summaries can help us pick out outliers and unusual trends.
We'll start by diving into a motivating example using data about life expectancy around the world. It's a small dataset but it illustrates how important modelling can be for improving your visualisations. We'll use a large number of simple models to partition out some of the strongest signal so we can see the subtler signals that remain. We'll also see how model summaries can help us pick out outliers and unusual trends.
The following sections will dive into more detail about the individual techniques:
@ -56,7 +56,7 @@ library(tidyr)
To motivate the power of many simple models, we're going to look into the "gapminder" data. This data was popularised by Hans Rosling, a Swedish doctor and statistician. If you've never heard of him, stop reading this chapter right now and go watch one of his videos! He is a fantastic data presenter and illustrates how you can use data to present a compelling story. A good place to start is this short video filmed in conjunction with the BBC: <https://www.youtube.com/watch?v=jbkSRLYSojo>.
The gapminder data summarises the progression of countries over time, looking at statistics like life expentancy and GDP. The data is easy to access in R, thanks to Jenny Bryan who created the gapminder package:
The gapminder data summarises the progression of countries over time, looking at statistics like life expectancy and GDP. The data is easy to access in R, thanks to Jenny Bryan who created the gapminder package:
```{r}
library(gapminder)
@ -71,7 +71,7 @@ gapminder %>%
geom_line(alpha = 1/3)
```
This is a small dataset: it only has ~1,700 observations and 3 variables. But it's still hard to see what's going on! Overall, it looks like life expectency has been steadily improving. However, if you look closely, you might notice some countries that don't follow this pattern. How can we make those countries easier to see?
This is a small dataset: it only has ~1,700 observations and 3 variables. But it's still hard to see what's going on! Overall, it looks like life expectancy has been steadily improving. However, if you look closely, you might notice some countries that don't follow this pattern. How can we make those countries easier to see?
One way is to use the same approach as in the last chapter: there's a strong signal (overall linear growth) that makes it hard to see subtler trends. We'll tease these factors apart by fitting a model with a linear trend. The model captures steady growth over time, and the residuals will show what's left.
@ -202,7 +202,7 @@ resids %>%
facet_wrap(~continent)
```
It looks like we've missed some mild quadratic pattern. There's also something intersting going on in Africa: we see some very large residuals which suggests our model isn't fitting so well there. We'll explore that more in the next section, attacking it from a slightly different angle.
It looks like we've missed some mild quadratic pattern. There's also something interesting going on in Africa: we see some very large residuals which suggests our model isn't fitting so well there. We'll explore that more in the next section, attacking it from a slightly different angle.
### Model quality
@ -267,7 +267,7 @@ We see two main effects here: the tragedies of the HIV/AIDS epidemic and the Rwa
`year` so that it has mean zero.)
1. Explore other methods for visualising the distribution of $R^2$ per
continent. You might want to try the ggbeeswarm pakage, which provides
continent. You might want to try the ggbeeswarm package, which provides
similar methods for avoiding overlaps as jitter, but uses deterministic
methods.
@ -279,7 +279,7 @@ We see two main effects here: the tragedies of the HIV/AIDS epidemic and the Rwa
## List-columns
Now that you've seen a basic workflow for managing many models, lets dive back into some of the details. In this section, we'll explore the list-column data structure in a little more detail. It's only recently that I've really appreciated the idea of the list-column. List-columns are implicit in the defintion of the data frame: a data frame is a named list of equal length vectors. A list is a vector, so it's always been legitimate to put use a list as a column of a data frame. However, base R doesn't make it easy to create list-columns, and `data.frame()` treats a list as a list of columns:.
Now that you've seen a basic workflow for managing many models, let's dive back into some of the details. In this section, we'll explore the list-column data structure in a little more detail. It's only recently that I've really appreciated the idea of the list-column. List-columns are implicit in the definition of the data frame: a data frame is a named list of equal length vectors. A list is a vector, so it's always been legitimate to put use a list as a column of a data frame. However, base R doesn't make it easy to create list-columns, and `data.frame()` treats a list as a list of columns:.
```{r}
data.frame(x = list(1:3, 3:5))
@ -354,7 +354,7 @@ gapminder %>%
### From vectorised functions
Some useful fuctions take an atomic vector and return a list. For example, in [strings] you learned about `stringr::str_split()` which takes a character vector and returns a list of charcter vectors. If you use that inside mutate, you'll get a list-column:
Some useful functions take an atomic vector and return a list. For example, in [strings] you learned about `stringr::str_split()` which takes a character vector and returns a list of character vectors. If you use that inside mutate, you'll get a list-column:
```{r}
df <- tibble(x1 = c("a,b,c", "d,e,f,g"))
@ -371,7 +371,7 @@ df %>%
unnest()
```
(If you find yourself using this pattern alot, make sure to check out `tidyr:separate_rows()` which is a wrapper around this common pattern).
(If you find yourself using this pattern a lot, make sure to check out `tidyr:separate_rows()` which is a wrapper around this common pattern).
Another example of this pattern is using the `map()`, `map2()`, `pmap()` from purrr. For example, we could take the final example from [Invoking different functions] and rewrite it to use `mutate()`:
@ -387,7 +387,7 @@ sim %>%
mutate(sims = invoke_map(f, params, n = 10))
```
Note that technically `sim` isn't homogenous because it contains both double and integer vectors. However, this is unlikely to cause many problems since integers and doubles are both numeric vectors.
Note that technically `sim` isn't homogeneous because it contains both double and integer vectors. However, this is unlikely to cause many problems since integers and doubles are both numeric vectors.
### From multivalued summaries
@ -501,7 +501,7 @@ df %>% mutate(
)
```
This is the same basic information that you get from the default tbl print method, but now you can use it for filtering. This is a useful technique if you have a heterogenous list, and want to filter out the parts aren't working for you.
This is the same basic information that you get from the default tbl print method, but now you can use it for filtering. This is a useful technique if you have a heterogeneous list, and want to filter out the parts aren't working for you.
Don't forget about the `map_*()` shortcuts - you can use `map_chr(x, "apple")` to extract the string stored in `apple` for each element of `x`. This is useful for pulling apart nested lists into regular columns. Use the `.null` argument to provide a value to use if the element is missing (instead of returning `NULL`):
@ -561,7 +561,7 @@ The same principle applies when unnesting list-columns of data frames. You can u
## Making tidy data with broom
The broom package provides three general tools for turning models in to tidy data frames:
The broom package provides three general tools for turning models into tidy data frames:
1. `broom::glance(model)` returns a row for each model. Each column gives a
model summary: either a measure of model quality, or complexity, or a