From 144382510ef0de32c1033fd7895630d035fe130a Mon Sep 17 00:00:00 2001 From: Curtis Alexander Date: Wed, 23 Mar 2016 09:41:34 -0500 Subject: [PATCH 01/19] typos --- iteration.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/iteration.Rmd b/iteration.Rmd index 70583e1..ded5952 100644 --- a/iteration.Rmd +++ b/iteration.Rmd @@ -509,7 +509,7 @@ Never feel bad about using a for loop instead of a function. The map functions a ### Shortcuts -There are a few shortcuts that you can use with `.f` in order to save a little typing. Imagine you want to fit a linear model to each group in a dataset. The following toy example splits the up the `mtcars` dataset in to three pieces (one for each value of cylinder) and fits the same linear model to each piece: +There are a few shortcuts that you can use with `.f` in order to save a little typing. Imagine you want to fit a linear model to each group in a dataset. The following toy example splits up the `mtcars` dataset into three pieces (one for each value of cylinder) and fits the same linear model to each piece: ```{r} models <- mtcars %>% From d44c641c3f45b6b5323eaa80d4453c0ff5493a7f Mon Sep 17 00:00:00 2001 From: Ian Lyttle Date: Sun, 27 Mar 2016 15:26:57 -0500 Subject: [PATCH 02/19] just a couple of typos --- iteration.Rmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/iteration.Rmd b/iteration.Rmd index bd7e004..d9b2b78 100644 --- a/iteration.Rmd +++ b/iteration.Rmd @@ -17,7 +17,7 @@ In [functions], we talked about how important it is to reduce duplication in you 1. You're likely to have fewer bugs because each line of code is used in more places. -One part of reducing duplication is writing functions. Functions allow you to identify repeated patterns of code and extract them out into indepdent pieces that you can reuse and easily update as code changes. Iteration helps you when you need to do the same thing to multiple inputs: repeating the same operation on different columns, or on different datasets. (Generally, you won't need to use explicit iteration to deal with different subsets of your data: in most cases the implicit iteration in dplyr will take care of that problem for you.) +One part of reducing duplication is writing functions. Functions allow you to identify repeated patterns of code and extract them out into independent pieces that you can reuse and easily update as code changes. Iteration helps you when you need to do the same thing to multiple inputs: repeating the same operation on different columns, or on different datasets. (Generally, you won't need to use explicit iteration to deal with different subsets of your data: in most cases the implicit iteration in dplyr will take care of that problem for you.) In this chapter you'll learn about two important iteration paradigms: imperative programming and functional programming, and the machinary each provides. On the imperative side you have things like for loops and while loops, which are a great place to start because they make iteration very explicit, so it's obvious what's happening. However, for loops are quite verbose, and include quite a bit of book-keeping code, that is duplicated for every for loop. Functional programming (FP) offers tools to extract out this duplicated code, so each common for loop pattern gets its own function. Once you master the vocabulary of FP, you can solve many common iteration problems with less code, more ease, and fewer errors. @@ -114,7 +114,7 @@ That's all there is to the for loop! Now is a good time to practice creating som 1. Compute the mean of every column in the `mtcars`. 1. Determine the type of each column in `nycflights13::flights`. 1. Compute the number of unique values in each column of `iris`. - 1. Generate 10 random normals for each of $mu = -10$, $0$, $10$, and $100$. + 1. Generate 10 random normals for each of $\mu = -10$, $0$, $10$, and $100$. Think about output, sequence, and body, __before__ you start writing the loop. From 03e81bfee67284b920edf1980bed5edea3b9b937 Mon Sep 17 00:00:00 2001 From: OaCantona Date: Mon, 25 Apr 2016 08:39:48 +0200 Subject: [PATCH 03/19] Typo --- datetimes.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/datetimes.Rmd b/datetimes.Rmd index 65ca13d..5baf26b 100644 --- a/datetimes.Rmd +++ b/datetimes.Rmd @@ -400,7 +400,7 @@ datetimes %>% ### Setting dates -You can also use each accessor funtion to set the components of a date or datetime. +You can also use each accessor function to set the components of a date or datetime. ```{r} datetime From 91d49fb470acf7f889dd780daf01303be1bc3db4 Mon Sep 17 00:00:00 2001 From: hadley Date: Sun, 1 May 2016 07:57:13 +0100 Subject: [PATCH 04/19] Can now use CRAN nycflights13 --- DESCRIPTION | 1 - 1 file changed, 1 deletion(-) diff --git a/DESCRIPTION b/DESCRIPTION index cc6964e..3c19c7a 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -38,6 +38,5 @@ Remotes: hadley/purrr, hadley/stringr, hadley/ggplot2, - hadley/nycflights13, yihui/knitr, rstudio/bookdown From 53b6bc09e3615b9fe264a0bdc37abd5e7cac0eac Mon Sep 17 00:00:00 2001 From: hadley Date: Thu, 5 May 2016 09:27:06 -0500 Subject: [PATCH 05/19] Start dumping ideas into model-vis --- model-vis.Rmd | 146 +++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 145 insertions(+), 1 deletion(-) diff --git a/model-vis.Rmd b/model-vis.Rmd index ac429e7..3f14a46 100644 --- a/model-vis.Rmd +++ b/model-vis.Rmd @@ -1,3 +1,147 @@ +```{r setup, include = FALSE} +library(broom) +library(ggplot2) +library(dplyr) +``` + # Model visualisation -Gap minder +In this chapter we will explore model visualisation from two different sides: + +1. Use a model to make it easier to see important patterns in our data. + +1. Use visualisation to understand what a model is telling us about our data. + +We're going to give you a basic strategy, and point you to places to learn more. The key is to think about data generated from your model as regular data - you're going to want to manipulate it and visualise it in many different ways. + +In the next chapter, you'll also learn about how to visualisation the model-level summaries, and the model parameters. + +## Residuals + +To motivate the use of models we're going to start with an interesting pattern from the NYC flights dataset: the number of flights per day. + +```{r} +library(nycflights13) +library(lubridate) +library(dplyr) + +daily <- flights %>% + mutate(date = make_datetime(year, month, day)) %>% + group_by(date) %>% + summarise(n = n()) + +ggplot(daily, aes(date, n)) + + geom_line() +``` + +Understand this pattern is challenging because there's a very strong day-of-week effect that dominates the subtler patterns: + +```{r} +daily <- daily %>% + mutate(wday = wday(date, label = TRUE)) + +ggplot(daily, aes(wday, n)) + + geom_boxplot() +``` + +The explanation for the low number of flights on Saturdays is because this dataset only has departures: we're only seeing people leaving New York. The majority of air travellers are travelling for business, not pleasure, and most people avoid leaving on the weekend. (The explanation for Sunday is that sometimes you need to be somewhere for a meeting on Monday morning and you have to leave the day before to get there.) + +One way to remove this strong pattern is to fit a model that "explains" the day of week effect, and then look at the residuals: + +```{r} +mod <- lm(n ~ wday, data = daily) +daily$n_resid <- resid(mod) + +ggplot(daily, aes(date, n_resid)) + + geom_line() +``` + +Note the change in the y-axis: now we are seeing the deviation from the expected number of flights, given the day of week. This plot is interesting because now that we've removed the very day-of-week effect, we can see some of the subtler patterns that remain + +1. There are some with very few flights. If you're familiar with American + public holidays, you might spot New Year's day, July 4th, Thanksgiving + and Christmas. There are some others that dont' seem to correspond to + + + ```{r} + daily %>% filter(n_resid < -100) + ``` + +1. There seems to be some smoother long term trend over the course of a year: + there are fewer flights in January, and more in summer (May-Sep). We can't + do much more with this trend than note it because we only have a single + year of data. + +1. Our day of week adjustment seems to fail starting around June: you can + still see a strong regular pattern that our model hasn't removed. + +We'll tackle the day of week effect first. Let's start by tweaking our plot drawing one line for each day of the week. + +```{r} +ggplot(daily, aes(date, n_resid, colour = wday)) + + geom_line() +``` + +This makes it clear that the problem with our model is Saturdays: it seems like during some there are more flights on Saturdays than we expect, and during Fall there are fewer. I suspect this is because of summer holidays: many people going on holiday in the summer, and people don't mind travelling on Saturdays for vacation. + +Let's zoom in on that pattern, this time looking at the raw numbers: + +```{r} +daily %>% + filter(wday == "Sat") %>% + ggplot(aes(date, n)) + + geom_line() + + scale_x_datetime(date_breaks = "1 month", date_labels = "%d-%b") +``` + +So it looks like summer holidays is from early June to late August. And that seems to line up fairly well with the state's school holidays : Jun 26 - Sep 9. So lets add a "school" variable to attemp to control for that. + +```{r} +daily <- daily %>% + mutate(school = cut(date, + breaks = as.POSIXct(ymd(20130101, 20130605, 20130825, 20140101)), + labels = c("spring", "summer", "fall") + )) + +daily %>% + filter(wday == "Sat") %>% + ggplot(aes(date, n, colour = school)) + + geom_line() + + scale_x_datetime(date_breaks = "1 month", date_labels = "%d-%b") +``` + +There are many ways we could incorporate this term into our model, but I'm going to do something quick-and-dirty: I'll use it as an interaction with `wday`. This is overkill because we don't have any evidence to suggest that the other days vary in the same way as a Saturdays, but so we end up overspending our degrees of freedom. + +mean vs. median. + +```{r} +mod2 <- MASS::rlm(n ~ wday * school, data = daily) +daily$n_resid2 <- resid(mod2) + +ggplot(daily, aes(date, n_resid2)) + + # geom_line(aes(y = n_resid), colour = "red") + + geom_line() +``` + +### Exercises + + +1. Use your google sleuthing skills to brainstorm why there were fewer than + expected flights on Jan 20, May 26, and Sep 9. (Hint: they all have the + same explanation.) + +1. Above we made the hypothesis that people leaving on Sundays are more + likely to be business travellers who need to be somewhere on Monday. + Explore that hypothesis by seeing how it breaks down based on distance: + if it's true, you'd expect to see more Sunday flights to places that + are far away. + +## Predictions + +Focus on predictions from a model because this works for any type of model. Visualising parameters can also be useful, but tends to be most useful when you have many similar models. Visualising predictions works regardless of the model family. + +```{r} + +``` + +Visualising high-dimensional models is challenging. You'll need to partition off a useable slice at a time. From 5cad675fd0f9ef1f2274375d0ac012221f409fdc Mon Sep 17 00:00:00 2001 From: hadley Date: Thu, 5 May 2016 10:34:11 -0500 Subject: [PATCH 06/19] Attempt to fix build failure --- communicate-plots.Rmd | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/communicate-plots.Rmd b/communicate-plots.Rmd index 8a2950c..ab1755d 100644 --- a/communicate-plots.Rmd +++ b/communicate-plots.Rmd @@ -1,12 +1,11 @@ +# Communication with plots + *Section 2* will show you how to prepare your plots for communication. You'll learn how to make your plots more legible with titles, labels, zooming, and default visual themes. -```{r echo = FALSE, messages = FALSE, warning=FALSE} +```{r include = FALSE} library(ggplot2) ``` - -# Communication with plots - The previous sections showed you how to make plots that you can use as a tools for _exploration_. When you made these plots, you knew---even before you looked at them---which variables the plot would display and which data sets the variables would come from. You might have even known what to look for in the completed plots, assuming that you made each plot with a goal in mind. As a result, it was not very important to put a title or a useful set of labels on your plots. The importance of titles and labels changes once you use your plots for _communication_. Your audience will not share your background knowledge. In fact, they may not know anything about your plots except what the plots themselves display. If you want your plots to communicate your findings effectively, you will need to make them as self-explanatory as possible. From df434083dcd5c73770daee117606718c64975c0c Mon Sep 17 00:00:00 2001 From: hadley Date: Thu, 5 May 2016 11:04:52 -0500 Subject: [PATCH 07/19] Try adding content to chapters --- dynamic-documents.Rmd | 1 + reproducible-research.Rmd | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/dynamic-documents.Rmd b/dynamic-documents.Rmd index 126869e..10e663c 100644 --- a/dynamic-documents.Rmd +++ b/dynamic-documents.Rmd @@ -1,3 +1,4 @@ # Dynamic Documents with R Markdown +... diff --git a/reproducible-research.Rmd b/reproducible-research.Rmd index 417f03a..2626085 100644 --- a/reproducible-research.Rmd +++ b/reproducible-research.Rmd @@ -1,3 +1,3 @@ # Reproducible Research with R Markdown - +... From e856f927cae0037757edeb230804bf4df73e2ecd Mon Sep 17 00:00:00 2001 From: hadley Date: Fri, 6 May 2016 09:21:24 -0500 Subject: [PATCH 08/19] More on model vis --- model-vis.Rmd | 162 +++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 148 insertions(+), 14 deletions(-) diff --git a/model-vis.Rmd b/model-vis.Rmd index 3f14a46..46275c6 100644 --- a/model-vis.Rmd +++ b/model-vis.Rmd @@ -14,8 +14,44 @@ In this chapter we will explore model visualisation from two different sides: We're going to give you a basic strategy, and point you to places to learn more. The key is to think about data generated from your model as regular data - you're going to want to manipulate it and visualise it in many different ways. +Being good at modelling is a mixture of having some good general principles and having a big toolbox of techniques. Here we'll focus on general techniques to help you undertand what your model is telling you. + +Focus on constructing models that help you better understand the data. This will generally lead to models that predict better. But you have to beware of overfitting the data - in the next section we'll discuss some formal methods. But a healthy dose of scepticism is also a powerful: do you believe that a pattern you see in your sample is going to generalise to a wider population? + +Transition from implicit knowledge in your head and in data to explicit knowledge in the model. In other words, you want to make explicit your knowledge of the data and capture it explicitly in a model. This makes it easier to apply to new domains, and easier for others to use. But you must always remember that your knowledge is incomplete. + +For very large and complex datasets this is going to be a lot of + In the next chapter, you'll also learn about how to visualisation the model-level summaries, and the model parameters. +```{r} +# Helper functions +add_predictions <- function(data, ...) { + models <- list(...) + for (nm in names(models)) { + data[[nm]] <- predict(models[[nm]], data) + } + data +} + +add_residuals <- function(data, ...) { + models <- list(...) + + for (nm in names(models)) { + y <- eval(predictor(models[[nm]]), data) + yhat <- predict(models[[nm]], data) + + data[[nm]] <- y - yhat + } + data +} + +predictor <- function(model) { + terms(model)[[2]] +} +``` + + ## Residuals To motivate the use of models we're going to start with an interesting pattern from the NYC flights dataset: the number of flights per day. @@ -39,20 +75,22 @@ Understand this pattern is challenging because there's a very strong day-of-week ```{r} daily <- daily %>% mutate(wday = wday(date, label = TRUE)) - +library(lvplot) ggplot(daily, aes(wday, n)) + geom_boxplot() ``` -The explanation for the low number of flights on Saturdays is because this dataset only has departures: we're only seeing people leaving New York. The majority of air travellers are travelling for business, not pleasure, and most people avoid leaving on the weekend. (The explanation for Sunday is that sometimes you need to be somewhere for a meeting on Monday morning and you have to leave the day before to get there.) +Why are there so few flights on Saturdays? My hypthosis is that most travel is for business, and you generally don't want to spend all of Sunday away from home. Sunday is in between Saturday and Monday because sometimes you have to leave Sunday night in order to arrive in time for a meeting on Monday morning. One way to remove this strong pattern is to fit a model that "explains" the day of week effect, and then look at the residuals: ```{r} mod <- lm(n ~ wday, data = daily) -daily$n_resid <- resid(mod) +daily <- daily %>% add_residuals(n_resid = mod) -ggplot(daily, aes(date, n_resid)) + +daily %>% + ggplot(aes(date, n_resid)) + + geom_hline(yintercept = 0, size = 2, colour = "white") + geom_line() ``` @@ -62,7 +100,6 @@ Note the change in the y-axis: now we are seeing the deviation from the expected public holidays, you might spot New Year's day, July 4th, Thanksgiving and Christmas. There are some others that dont' seem to correspond to - ```{r} daily %>% filter(n_resid < -100) ``` @@ -79,10 +116,11 @@ We'll tackle the day of week effect first. Let's start by tweaking our plot draw ```{r} ggplot(daily, aes(date, n_resid, colour = wday)) + - geom_line() + geom_hline(yintercept = 0, size = 2, colour = "white") + + geom_line() ``` -This makes it clear that the problem with our model is Saturdays: it seems like during some there are more flights on Saturdays than we expect, and during Fall there are fewer. I suspect this is because of summer holidays: many people going on holiday in the summer, and people don't mind travelling on Saturdays for vacation. +This makes it clear that the problem with our model is mostly Saturdays: it seems like during some there are more flights on Saturdays than we expect, and during Fall there are fewer. I suspect this is because of summer holidays: many people going on holiday in the summer, and people don't mind travelling on Saturdays for vacation. Let's zoom in on that pattern, this time looking at the raw numbers: @@ -90,8 +128,8 @@ Let's zoom in on that pattern, this time looking at the raw numbers: daily %>% filter(wday == "Sat") %>% ggplot(aes(date, n)) + - geom_line() + - scale_x_datetime(date_breaks = "1 month", date_labels = "%d-%b") + geom_line() + + scale_x_datetime(date_breaks = "1 month", date_labels = "%d-%b") ``` So it looks like summer holidays is from early June to late August. And that seems to line up fairly well with the state's school holidays : Jun 26 - Sep 9. So lets add a "school" variable to attemp to control for that. @@ -110,25 +148,62 @@ daily %>% scale_x_datetime(date_breaks = "1 month", date_labels = "%d-%b") ``` -There are many ways we could incorporate this term into our model, but I'm going to do something quick-and-dirty: I'll use it as an interaction with `wday`. This is overkill because we don't have any evidence to suggest that the other days vary in the same way as a Saturdays, but so we end up overspending our degrees of freedom. +It's useful to see how this new variable affects the other days of the week: -mean vs. median. +```{r} +daily %>% + ggplot(aes(wday, n, colour = school)) + + geom_boxplot() +``` + +It looks like there is significant variation, so fitting a separate day of week effect for each term is reasonable. This improves our model, but not as much as we might hope: + +```{r} +mod2 <- lm(n ~ wday * school, data = daily) +daily$n_resid2 <- resid(mod2) + +ggplot(daily, aes(date, n_resid2)) + + geom_line() +``` + +That's because this model is basically calculating an average for each combination of wday and school term. We have a lot of big outliers, so they tend to drag the mean far away from the typical value. + +```{r} +mean <- daily %>% + group_by(wday, school) %>% + summarise(n = mean(n)) + + +daily %>% + ggplot(aes(wday, n, colour = school)) + + geom_boxplot() + + geom_point(data = mean, size = 3, shape = 17, position = position_dodge(width = 0.75)) + +``` + +We can reduce this problem by switch to a robust model, fit by `MASS::rlm()`. A robust model is a variation of the linear which you can think of a fitting medians, instead of means. This greatly reduces the impact of the outliers on our estimates, and gives a result that does a good job of removing the day of week pattern: ```{r} mod2 <- MASS::rlm(n ~ wday * school, data = daily) daily$n_resid2 <- resid(mod2) ggplot(daily, aes(date, n_resid2)) + - # geom_line(aes(y = n_resid), colour = "red") + + geom_hline(yintercept = 0, size = 2, colour = "white") + geom_line() ``` -### Exercises +It's now much easier to see the long term trend, and the positive and negative outliers. +### Exercises 1. Use your google sleuthing skills to brainstorm why there were fewer than expected flights on Jan 20, May 26, and Sep 9. (Hint: they all have the - same explanation.) + same explanation.) How would these days generalise to another year? + +1. What do the days with high positive residuals represent? + +1. What happens if you fit a day of week effect that varies by month? + Why is this not very helpful? 1. Above we made the hypothesis that people leaving on Sundays are more likely to be business travellers who need to be somewhere on Monday. @@ -136,6 +211,11 @@ ggplot(daily, aes(date, n_resid2)) + if it's true, you'd expect to see more Sunday flights to places that are far away. +1. It's a little frustrating that Sunday and Saturday are on separate ends + of the plot. Write a small function to set the levels of the factor so + that the week starts on Monday. + + ## Predictions Focus on predictions from a model because this works for any type of model. Visualising parameters can also be useful, but tends to be most useful when you have many similar models. Visualising predictions works regardless of the model family. @@ -145,3 +225,57 @@ Focus on predictions from a model because this works for any type of model. Visu ``` Visualising high-dimensional models is challenging. You'll need to partition off a useable slice at a time. + + +```{r} +library(tidyr) + +date_vars <- function(df) { + df %>% mutate( + school = cut(date, + breaks = as.POSIXct(ymd(20130101, 20130605, 20130825, 20140101)), + labels = c("spring", "summer", "fall") + ), + wday = wday(date, label = TRUE) + ) +} + +daily %>% + expand(date) %>% + date_vars() %>% + add_predictions(pred = mod2) %>% + ggplot(aes(date, pred)) + + geom_line() + +daily %>% + expand(date, wday = "Sat", school = "spring") %>% + add_predictions(pred = mod2) %>% + ggplot(aes(date, pred)) + + geom_line() + + +daily %>% + expand(wday, school) %>% + add_predictions(pred = mod2) %>% + ggplot(aes(wday, pred, colour = school)) + + geom_point() + + geom_line(aes(group = school)) + +``` + +## Delays and weather + +```{r} +hourly <- flights %>% + group_by(origin, time_hour) %>% + summarise( + delay = mean(dep_delay, na.rm = TRUE) + ) %>% + inner_join(weather, by = c("origin", "time_hour")) + +ggplot(hourly, aes(time_hour, delay)) + + geom_point() + +ggplot(hourly, aes(hour(time_hour), delay)) + + geom_boxplot(aes(group = hour(time_hour))) +``` From 93ebaec3632efda54e482df4a04f4a06fc03061d Mon Sep 17 00:00:00 2001 From: hadley Date: Mon, 9 May 2016 07:39:16 -0500 Subject: [PATCH 09/19] Install modelr package --- DESCRIPTION | 2 ++ 1 file changed, 2 insertions(+) diff --git a/DESCRIPTION b/DESCRIPTION index 3c19c7a..a995447 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -20,6 +20,7 @@ Imports: jsonlite, Lahman, lubridate, + modelr, knitr, maps, microbenchmark, @@ -35,6 +36,7 @@ Imports: Remotes: gaborcsardi/rcorpora, garrettgman/DSR, + hadley/modelr, hadley/purrr, hadley/stringr, hadley/ggplot2, From 92478ae037ca617271993d5f6c9f3313cdc489fc Mon Sep 17 00:00:00 2001 From: hadley Date: Mon, 9 May 2016 07:39:26 -0500 Subject: [PATCH 10/19] Couple of model assessment exercises --- model-assess.Rmd | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/model-assess.Rmd b/model-assess.Rmd index 48ebb0e..3ef230e 100644 --- a/model-assess.Rmd +++ b/model-assess.Rmd @@ -85,3 +85,10 @@ ggplot(, aes(mse)) + geom_histogram(binwidth = 0.25) + geom_vline(xintercept = base_mse, colour = "red") ``` + + +### Exercises + +1. Given a list of formulas, use purr to fit them. + +1. Given a list of model functions, and parameters, use purrr to fit them. From 46d80495fd97669e78f8764340d3f0dd31596f77 Mon Sep 17 00:00:00 2001 From: hadley Date: Mon, 9 May 2016 09:33:30 -0500 Subject: [PATCH 11/19] More on model-vis (prediction) --- model-vis.Rmd | 184 +++++++++++++++++++++++++++----------------------- 1 file changed, 101 insertions(+), 83 deletions(-) diff --git a/model-vis.Rmd b/model-vis.Rmd index 46275c6..429d092 100644 --- a/model-vis.Rmd +++ b/model-vis.Rmd @@ -2,6 +2,7 @@ library(broom) library(ggplot2) library(dplyr) +library(lubridate) ``` # Model visualisation @@ -20,35 +21,14 @@ Focus on constructing models that help you better understand the data. This will Transition from implicit knowledge in your head and in data to explicit knowledge in the model. In other words, you want to make explicit your knowledge of the data and capture it explicitly in a model. This makes it easier to apply to new domains, and easier for others to use. But you must always remember that your knowledge is incomplete. -For very large and complex datasets this is going to be a lot of +For very large and complex datasets this is going to be a lot of work. There are certainly alternative approaches - a more machine learning approach is simply to focus on improving the predictive ability of the model, being careful to fairly assess it (i.e. not assessing the model on the data that was used to train it). These approaches tend to produce black boxes - i.e. the model does a really good job, but you don't know why. This is fine, but the main problem is that you can't apply your real world knowledge to the model to think about whether or not it's likely to work in the long-term, as fundamentals change. For most real models, I'd expect you to use some combination of this approach and a ML model building approach. If prediction is important, get to a good point, and then use visulisation to understand the most important parts of the model. -In the next chapter, you'll also learn about how to visualisation the model-level summaries, and the model parameters. +In the next chapter, you'll also learn about how to visualise the model-level summaries, and the model parameters. + +To do this we're going to use some helper functions from the modelr package. This package provides some wrappers around the traditional base R modelling functions that make them easier to use in data manipulation pipelines. Currently at but will need to be on CRAN before the book is published. ```{r} -# Helper functions -add_predictions <- function(data, ...) { - models <- list(...) - for (nm in names(models)) { - data[[nm]] <- predict(models[[nm]], data) - } - data -} - -add_residuals <- function(data, ...) { - models <- list(...) - - for (nm in names(models)) { - y <- eval(predictor(models[[nm]]), data) - yhat <- predict(models[[nm]], data) - - data[[nm]] <- y - yhat - } - data -} - -predictor <- function(model) { - terms(model)[[2]] -} +library(modelr) ``` @@ -70,19 +50,18 @@ ggplot(daily, aes(date, n)) + geom_line() ``` -Understand this pattern is challenging because there's a very strong day-of-week effect that dominates the subtler patterns: +Understanding this pattern is challenging because there's a very strong day-of-week effect that dominates the subtler patterns: ```{r} daily <- daily %>% mutate(wday = wday(date, label = TRUE)) -library(lvplot) ggplot(daily, aes(wday, n)) + geom_boxplot() ``` -Why are there so few flights on Saturdays? My hypthosis is that most travel is for business, and you generally don't want to spend all of Sunday away from home. Sunday is in between Saturday and Monday because sometimes you have to leave Sunday night in order to arrive in time for a meeting on Monday morning. +There are fewer flights on weekends because a very large proportion of travel is for business. You might sometimes have to less on Sunday for an early flight, but it's very rare that you'd leave on Saturday: you'd much rather be home with your family. -One way to remove this strong pattern is to fit a model that "explains" the day of week effect, and then look at the residuals: +One way to remove this strong pattern is to fit a model that "explains" (i.e. attempts to predict) the day of week effect, and then look at the residuals: ```{r} mod <- lm(n ~ wday, data = daily) @@ -94,105 +73,131 @@ daily %>% geom_line() ``` -Note the change in the y-axis: now we are seeing the deviation from the expected number of flights, given the day of week. This plot is interesting because now that we've removed the very day-of-week effect, we can see some of the subtler patterns that remain +Note the change in the y-axis: now we are seeing the deviation from the expected number of flights, given the day of week. This plot is interesting because now that we've removed much of the large day-of-week effect, we can see some of the subtler patterns that remain: + +1. Our day of week adjustment seems to fail starting around June: you can + still see a strong regular pattern that our model hasn't removed. Drawing + a plot with one line for each day of the week makes the cause easier + to see: + + ```{r} + ggplot(daily, aes(date, n_resid, colour = wday)) + + geom_hline(yintercept = 0, size = 2, colour = "white") + + geom_line() + ``` + + The problem appears to be Saturdays: it seems like during summer there are + more flights on Saturdays than we expect, and during Fall there are fewer. + I suspect this is because of summer holidays: many people go on holiday + in the summer, and people don't mind travelling on Saturdays for vacation. + (This doesn't, however, explain why there are more Satruday flights in + spring than fall). + +1. There are some day with much fewer flights than expected: + -1. There are some with very few flights. If you're familiar with American - public holidays, you might spot New Year's day, July 4th, Thanksgiving - and Christmas. There are some others that dont' seem to correspond to - ```{r} daily %>% filter(n_resid < -100) ``` -1. There seems to be some smoother long term trend over the course of a year: - there are fewer flights in January, and more in summer (May-Sep). We can't - do much more with this trend than note it because we only have a single - year of data. + If you're familiar with American public holidays, you might spot New Year's + day, July 4th, Thanksgiving and Christmas. There are some others that don't + seem to correspond immediately to public holidays. You'll figure those out + in the exercise below. -1. Our day of week adjustment seems to fail starting around June: you can - still see a strong regular pattern that our model hasn't removed. +1. There seems to be some smoother long term trend over the course of a year. + We can highlight that trend with `geom_smooth()`: + + ```{r} + daily %>% + ggplot(aes(date, n_resid)) + + geom_hline(yintercept = 0, size = 2, colour = "white") + + geom_line(colour = "grey50") + + geom_smooth(se = FALSE, span = 0.20) + ``` + + There are fewer flights in January (and December), and more in summer + (May-Sep). We can't do much more with this trend than brainstorm possible + explanations because we only have a single year's worth of data. -We'll tackle the day of week effect first. Let's start by tweaking our plot drawing one line for each day of the week. - -```{r} -ggplot(daily, aes(date, n_resid, colour = wday)) + - geom_hline(yintercept = 0, size = 2, colour = "white") + - geom_line() -``` - -This makes it clear that the problem with our model is mostly Saturdays: it seems like during some there are more flights on Saturdays than we expect, and during Fall there are fewer. I suspect this is because of summer holidays: many people going on holiday in the summer, and people don't mind travelling on Saturdays for vacation. - -Let's zoom in on that pattern, this time looking at the raw numbers: +We'll tackle the day of week effect first. Let's zoom in on Saturdays, going back to raw numbers: ```{r} daily %>% filter(wday == "Sat") %>% ggplot(aes(date, n)) + geom_line() + - scale_x_datetime(date_breaks = "1 month", date_labels = "%d-%b") + geom_point(alpha = 1/3) + + scale_x_datetime(NULL, date_breaks = "1 month", date_labels = "%b") ``` -So it looks like summer holidays is from early June to late August. And that seems to line up fairly well with the state's school holidays : Jun 26 - Sep 9. So lets add a "school" variable to attemp to control for that. +So it looks like summer holidays are from early June to late August. That seems to line up fairly well with the [state's school terms](http://schools.nyc.gov/Calendar/2013-2014+School+Year+Calendars.htm): summer break is Jun 26 - Sep 9. So lets add a "term" variable to attemp to control for that. I manually tweaked the dates to get nice breaks in the plot. ```{r} daily <- daily %>% - mutate(school = cut(date, - breaks = as.POSIXct(ymd(20130101, 20130605, 20130825, 20140101)), + mutate(term = cut(date, + breaks = as.POSIXct(ymd(20130101, 20130601, 20130825, 20140101)), labels = c("spring", "summer", "fall") )) daily %>% filter(wday == "Sat") %>% - ggplot(aes(date, n, colour = school)) + + ggplot(aes(date, n, colour = term)) + + geom_point(alpha = 1/3) + geom_line() + - scale_x_datetime(date_breaks = "1 month", date_labels = "%d-%b") + scale_x_datetime(NULL, date_breaks = "1 month", date_labels = "%b") ``` It's useful to see how this new variable affects the other days of the week: ```{r} daily %>% - ggplot(aes(wday, n, colour = school)) + + ggplot(aes(wday, n, colour = term)) + geom_boxplot() ``` -It looks like there is significant variation, so fitting a separate day of week effect for each term is reasonable. This improves our model, but not as much as we might hope: +It looks like there is significant variation across the terms, so fitting a separate day of week effect for each term is reasonable. This improves our model, but not as much as we might hope: ```{r} -mod2 <- lm(n ~ wday * school, data = daily) +mod2 <- lm(n ~ wday * term, data = daily) daily$n_resid2 <- resid(mod2) -ggplot(daily, aes(date, n_resid2)) + - geom_line() +ggplot(daily, aes(date)) + + geom_line(aes(y = n_resid, colour = "mod1")) + + geom_line(aes(y = n_resid2, colour = "mod2")) + + scale_colour_manual(values = c(mod1 = "grey50", mod2 = "black")) ``` That's because this model is basically calculating an average for each combination of wday and school term. We have a lot of big outliers, so they tend to drag the mean far away from the typical value. ```{r} -mean <- daily %>% - group_by(wday, school) %>% - summarise(n = mean(n)) - - -daily %>% - ggplot(aes(wday, n, colour = school)) + - geom_boxplot() + - geom_point(data = mean, size = 3, shape = 17, position = position_dodge(width = 0.75)) +middles <- daily %>% + group_by(wday, term) %>% + summarise( + mean = mean(n), + median = median(n) + ) +middles %>% + ggplot(aes(wday, colour = term)) + + geom_point(aes(y = mean, shape = "mean")) + + geom_point(aes(y = median, shape = "median")) + + facet_wrap(~ term) ``` -We can reduce this problem by switch to a robust model, fit by `MASS::rlm()`. A robust model is a variation of the linear which you can think of a fitting medians, instead of means. This greatly reduces the impact of the outliers on our estimates, and gives a result that does a good job of removing the day of week pattern: +We can reduce this problem by switch to a robust model fitted by `MASS::rlm()`. A robust model is a variation of the linear model which you can think of a fitting medians, instead of means (it's a bit more complicated than that, but that's a reasonable intuition). This greatly reduces the impact of the outliers on our estimates, and gives a result that does a good job of removing the day of week pattern: ```{r} -mod2 <- MASS::rlm(n ~ wday * school, data = daily) +mod2 <- MASS::rlm(n ~ wday * term, data = daily) daily$n_resid2 <- resid(mod2) ggplot(daily, aes(date, n_resid2)) + geom_hline(yintercept = 0, size = 2, colour = "white") + - geom_line() + geom_line() + + geom_smooth(span = 0.25, se = FALSE) ``` -It's now much easier to see the long term trend, and the positive and negative outliers. +It's now much easier to see the long-term trend, and the positive and negative outliers. ### Exercises @@ -200,7 +205,21 @@ It's now much easier to see the long term trend, and the positive and negative o expected flights on Jan 20, May 26, and Sep 9. (Hint: they all have the same explanation.) How would these days generalise to another year? -1. What do the days with high positive residuals represent? +1. What do the three days with high positive residuals represent? + How would these days generalise to another year? + + ```{r} + daily %>% filter(n_resid2 > 80) + ``` + +1. Create a new variable that splits the `wday` variable in to terms only + for Saturdays, i.e. it should have `Sat-summer`, `Sat-spring`, + `Sat-fall`. How does this model compare with the model with every + combination of `wday` and `term`? + +1. Create a new wday variable that combines the day of week, term + (for Saturdays), and public holidays. What do the residuals of + that model look like? 1. What happens if you fit a day of week effect that varies by month? Why is this not very helpful? @@ -215,7 +234,6 @@ It's now much easier to see the long term trend, and the positive and negative o of the plot. Write a small function to set the levels of the factor so that the week starts on Monday. - ## Predictions Focus on predictions from a model because this works for any type of model. Visualising parameters can also be useful, but tends to be most useful when you have many similar models. Visualising predictions works regardless of the model family. @@ -232,7 +250,7 @@ library(tidyr) date_vars <- function(df) { df %>% mutate( - school = cut(date, + term = cut(date, breaks = as.POSIXct(ymd(20130101, 20130605, 20130825, 20140101)), labels = c("spring", "summer", "fall") ), @@ -248,18 +266,18 @@ daily %>% geom_line() daily %>% - expand(date, wday = "Sat", school = "spring") %>% + expand(date, wday = "Sat", term = "spring") %>% add_predictions(pred = mod2) %>% ggplot(aes(date, pred)) + geom_line() daily %>% - expand(wday, school) %>% + expand(wday, term) %>% add_predictions(pred = mod2) %>% - ggplot(aes(wday, pred, colour = school)) + + ggplot(aes(wday, pred, colour = term)) + geom_point() + - geom_line(aes(group = school)) + geom_line(aes(group = term)) ``` From 759a767ab560eb95c80f197997fa7c51c2ee9e95 Mon Sep 17 00:00:00 2001 From: adi pradhan Date: Tue, 10 May 2016 16:19:49 -0400 Subject: [PATCH 12/19] 'plce' corrected to 'place' --- visualize.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/visualize.Rmd b/visualize.Rmd index 484f604..43a403f 100644 --- a/visualize.Rmd +++ b/visualize.Rmd @@ -156,7 +156,7 @@ ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy), color = "blue") ``` -Here, the color doesn't convey information about a variable. It only changes the appearance of the plot. To set an aesthetic manually, do not plce it in the `aes()` function. Call the aesthetic by name as an argument of your geom function. Then pass the aesthetic a value that R will recognize, such as +Here, the color doesn't convey information about a variable. It only changes the appearance of the plot. To set an aesthetic manually, do not place it in the `aes()` function. Call the aesthetic by name as an argument of your geom function. Then pass the aesthetic a value that R will recognize, such as * the name of a color as a character string * the size of a point as a cex expansion factor (see `?par`) From f62008d075f4b38fcf944037217d8d88f1de30a5 Mon Sep 17 00:00:00 2001 From: hadley Date: Tue, 10 May 2016 20:21:23 -0500 Subject: [PATCH 13/19] More on model vis --- model-vis.Rmd | 166 +++++++++++++++++++++++++++++++++++++------------- 1 file changed, 122 insertions(+), 44 deletions(-) diff --git a/model-vis.Rmd b/model-vis.Rmd index 429d092..67f8b72 100644 --- a/model-vis.Rmd +++ b/model-vis.Rmd @@ -3,6 +3,8 @@ library(broom) library(ggplot2) library(dplyr) library(lubridate) +library(tidyr) +library(nycflights13) ``` # Model visualisation @@ -15,6 +17,8 @@ In this chapter we will explore model visualisation from two different sides: We're going to give you a basic strategy, and point you to places to learn more. The key is to think about data generated from your model as regular data - you're going to want to manipulate it and visualise it in many different ways. +Centered around looking at residuals and looking at predictions. You'll see those here applied to linear models (and some minor variations), but it's a flexible technique since every model can generate predictions and residuals. + Being good at modelling is a mixture of having some good general principles and having a big toolbox of techniques. Here we'll focus on general techniques to help you undertand what your model is telling you. Focus on constructing models that help you better understand the data. This will generally lead to models that predict better. But you have to beware of overfitting the data - in the next section we'll discuss some formal methods. But a healthy dose of scepticism is also a powerful: do you believe that a pattern you see in your sample is going to generalise to a wider population? @@ -102,7 +106,7 @@ Note the change in the y-axis: now we are seeing the deviation from the expected If you're familiar with American public holidays, you might spot New Year's day, July 4th, Thanksgiving and Christmas. There are some others that don't - seem to correspond immediately to public holidays. You'll figure those out + seem to correspond immediately to public holidays. You'll work on those in the exercise below. 1. There seems to be some smoother long term trend over the course of a year. @@ -131,7 +135,7 @@ daily %>% scale_x_datetime(NULL, date_breaks = "1 month", date_labels = "%b") ``` -So it looks like summer holidays are from early June to late August. That seems to line up fairly well with the [state's school terms](http://schools.nyc.gov/Calendar/2013-2014+School+Year+Calendars.htm): summer break is Jun 26 - Sep 9. So lets add a "term" variable to attemp to control for that. I manually tweaked the dates to get nice breaks in the plot. +So it looks like summer holidays are from early June to late August. That seems to line up fairly well with the [state's school terms](http://schools.nyc.gov/Calendar/2013-2014+School+Year+Calendars.htm): summer break is Jun 26 - Sep 9. So lets add a "term" variable to attemp to control for that. ```{r} daily <- daily %>% @@ -148,6 +152,8 @@ daily %>% scale_x_datetime(NULL, date_breaks = "1 month", date_labels = "%b") ``` +(I manually tweaked the dates to get nice breaks in the plot.) + It's useful to see how this new variable affects the other days of the week: ```{r} @@ -179,17 +185,18 @@ middles <- daily %>% ) middles %>% - ggplot(aes(wday, colour = term)) + - geom_point(aes(y = mean, shape = "mean")) + - geom_point(aes(y = median, shape = "median")) + + ggplot(aes(wday)) + + geom_linerange(aes(ymin = mean, ymax = median), colour = "grey70") + + geom_point(aes(y = mean, colour = "mean")) + + geom_point(aes(y = median, colour = "median")) + facet_wrap(~ term) ``` -We can reduce this problem by switch to a robust model fitted by `MASS::rlm()`. A robust model is a variation of the linear model which you can think of a fitting medians, instead of means (it's a bit more complicated than that, but that's a reasonable intuition). This greatly reduces the impact of the outliers on our estimates, and gives a result that does a good job of removing the day of week pattern: +We can reduce this problem by switching to a robust model fitted by `MASS::rlm()`. A robust model is a variation of the linear model which you can think of a fitting medians, instead of means (it's a bit more complicated than that, but that's a reasonable intuition). This greatly reduces the impact of the outliers on our estimates, and gives a result that does a good job of removing the day of week pattern: -```{r} +```{r, warn=FALSE} mod2 <- MASS::rlm(n ~ wday * term, data = daily) -daily$n_resid2 <- resid(mod2) +daily <- daily %>% add_residuals(n_resid2 = mod2) ggplot(daily, aes(date, n_resid2)) + geom_hline(yintercept = 0, size = 2, colour = "white") + @@ -212,10 +219,10 @@ It's now much easier to see the long-term trend, and the positive and negative o daily %>% filter(n_resid2 > 80) ``` -1. Create a new variable that splits the `wday` variable in to terms only - for Saturdays, i.e. it should have `Sat-summer`, `Sat-spring`, - `Sat-fall`. How does this model compare with the model with every - combination of `wday` and `term`? +1. Create a new variable that splits the `wday` variable into terms, but only + for Saturdays, i.e. it should have `Thurs`, `Fri`, but `Sat-summer`, + `Sat-spring`, `Sat-fall`. How does this model compare with the model with + every combination of `wday` and `term`? 1. Create a new wday variable that combines the day of week, term (for Saturdays), and public holidays. What do the residuals of @@ -238,49 +245,109 @@ It's now much easier to see the long-term trend, and the positive and negative o Focus on predictions from a model because this works for any type of model. Visualising parameters can also be useful, but tends to be most useful when you have many similar models. Visualising predictions works regardless of the model family. -```{r} - -``` - Visualising high-dimensional models is challenging. You'll need to partition off a useable slice at a time. +### `rlm()` vs `lm()` + +Let's start by exploring the difference between the `lm()` and `rlm()` predictions for the day of week effects. We'll first re-fit the models, just so we have them handy: ```{r} -library(tidyr) +mod1 <- lm(n ~ wday * term, data = daily) +mod2 <- MASS::rlm(n ~ wday * term, data = daily) +``` -date_vars <- function(df) { - df %>% mutate( +Next, we need to generate a grid of values to compute predictions for. The easiest way to do that is to use `tidyr::expand()`. It's first argument is a data frame, and for each subsequent argument it finds the unique variables and then generates all combinations: + +```{r} +grid <- + daily %>% + tidyr::expand(wday, term) + +grid +``` + +Next we add predicitons. We'll use `modelr::add_predictions()` which works in exactly the same way as `add_residuals()`, but just compute predictions (so doesn't need a data frame that contains the response variable:) + +```{r} +grid <- + grid %>% + add_predictions(linear = mod1, robust = mod2) +grid +``` + +And then we plot the predictions. Plotting predictions is usually the hardest bit and you'll need to try a few times before you get a plot that is most informative. Depending on your model it's quite possible that you'll need multiple plots to fully convey what the model is telling you about the data. + +```{r} +grid %>% + ggplot(aes(wday)) + + geom_linerange(aes(ymin = linear, ymax = robust), colour = "grey70") + + geom_point(aes(y = linear, colour = "linear")) + + geom_point(aes(y = robust, colour = "robust")) + + facet_wrap(~ term) +``` + +### Computed variables + +```{r} +daily %>% + expand(date) %>% + mutate( term = cut(date, breaks = as.POSIXct(ymd(20130101, 20130605, 20130825, 20140101)), labels = c("spring", "summer", "fall") ), wday = wday(date, label = TRUE) - ) -} - -daily %>% - expand(date) %>% - date_vars() %>% + ) %>% add_predictions(pred = mod2) %>% ggplot(aes(date, pred)) + geom_line() - -daily %>% - expand(date, wday = "Sat", term = "spring") %>% - add_predictions(pred = mod2) %>% - ggplot(aes(date, pred)) + - geom_line() - - -daily %>% - expand(wday, term) %>% - add_predictions(pred = mod2) %>% - ggplot(aes(wday, pred, colour = term)) + - geom_point() + - geom_line(aes(group = term)) - ``` +If you're experimenting with many models and many visualisations, it's a good idea to bundle the creation of variables up into a function so there's no chance of accidentally applying a different transformation in different places. + +### Nested variables + +Another case that occassionally crops up is nested variables: you have an identifier that is locally unique, not globally unique. For example you might have this data about students in schools: + +```{r} +students <- tibble::frame_data( + ~student_id, ~school_id, + 1, 1, + 2, 1, + 1, 2, + 1, 3, + 2, 3, + 3, 3 +) +``` + +The student id only makes sense in the context of the school: it doesn't make sense to generate every combination of student and school. You can use `nesting()` for this case: + +```{r} +students %>% expand(nesting(school_id, student_id)) +``` + +### Continuous variables + +```{r} +grid <- nlme::Oxboys %>% + as_data_frame() %>% + tidyr::expand(Subject, age = seq_range(age, 2)) + +mod <- nlme::lme(height ~ age, random = ~1 | Subject, data = nlme::Oxboys) + +grid %>% + add_predictions(mod = mod) %>% + ggplot(aes(age, mod)) + + geom_line(aes(group = Subject)) +``` + +### Exercises + +1. How does the model of model coefficients compare to the plot of means + and medians computed "by hand" in the previous chapter. Create a plot + the highlights the differences and similarities. + ## Delays and weather ```{r} @@ -291,9 +358,20 @@ hourly <- flights %>% ) %>% inner_join(weather, by = c("origin", "time_hour")) -ggplot(hourly, aes(time_hour, delay)) + - geom_point() +# ggplot(hourly, aes(time_hour, delay)) + +# geom_point() +# +# ggplot(hourly, aes(hour(time_hour), sign(delay) * sqrt(abs(delay)))) + +# geom_boxplot(aes(group = hour(time_hour))) +# +# hourly %>% +# filter(wind_speed < 999) %>% +# ggplot(aes(temp, delay)) + +# geom_point() + +# geom_smooth() -ggplot(hourly, aes(hour(time_hour), delay)) + - geom_boxplot(aes(group = hour(time_hour))) ``` + +## Learning more + + From e8473755c09c0797afb949355d1fc70ad3c41b2b Mon Sep 17 00:00:00 2001 From: hadley Date: Mon, 16 May 2016 08:10:55 -0500 Subject: [PATCH 14/19] Moved data to tidyr --- DESCRIPTION | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/DESCRIPTION b/DESCRIPTION index a995447..6c27730 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -11,7 +11,6 @@ Imports: bookdown, broom, dplyr, - DSR, ggplot2, hexbin, htmltools, @@ -38,6 +37,7 @@ Remotes: garrettgman/DSR, hadley/modelr, hadley/purrr, + hadley/tidyr, hadley/stringr, hadley/ggplot2, yihui/knitr, From 1ac7131bbcd2acb94ab58e6403a6c88d68652461 Mon Sep 17 00:00:00 2001 From: hadley Date: Mon, 16 May 2016 08:11:16 -0500 Subject: [PATCH 15/19] More model-vis brainstorming --- model-vis.Rmd | 133 ++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 97 insertions(+), 36 deletions(-) diff --git a/model-vis.Rmd b/model-vis.Rmd index 67f8b72..0925721 100644 --- a/model-vis.Rmd +++ b/model-vis.Rmd @@ -17,16 +17,31 @@ In this chapter we will explore model visualisation from two different sides: We're going to give you a basic strategy, and point you to places to learn more. The key is to think about data generated from your model as regular data - you're going to want to manipulate it and visualise it in many different ways. +What is a good model? We'll think about that more in the next chapter. For now, a good model captures the majority of the patterns that are generated by the underlying mechanism of interest, and captures few patterns that are not generated by that mechanism. Another way to frame that is that you want your model to be good at inference, not just description. Inference is one of the most important parts of a model - you want to not just make statements about the data you have observed, but data that you have not observed (like things that will happen in the future). + Centered around looking at residuals and looking at predictions. You'll see those here applied to linear models (and some minor variations), but it's a flexible technique since every model can generate predictions and residuals. Being good at modelling is a mixture of having some good general principles and having a big toolbox of techniques. Here we'll focus on general techniques to help you undertand what your model is telling you. Focus on constructing models that help you better understand the data. This will generally lead to models that predict better. But you have to beware of overfitting the data - in the next section we'll discuss some formal methods. But a healthy dose of scepticism is also a powerful: do you believe that a pattern you see in your sample is going to generalise to a wider population? -Transition from implicit knowledge in your head and in data to explicit knowledge in the model. In other words, you want to make explicit your knowledge of the data and capture it explicitly in a model. This makes it easier to apply to new domains, and easier for others to use. But you must always remember that your knowledge is incomplete. +Transition from implicit knowledge in your head and in data to explicit knowledge in the model. In other words, you want to make explicit your knowledge of the data and capture it explicitly in a model. This makes it easier to apply to new domains, and easier for others to use. But you must always remember that your knowledge is incomplete. Subtract patterns from the data, and add patterns to the model. + +When do you stop? + +> A long time ago in art class, my teacher told me "An artist needs to know +> when a piece is done. You can't tweak something into perfection - wrap it up. +> If you don't like it, do it over again. Otherwise begin something new". Later +> in life, I heard "A poor seamstress makes many mistake. A good seamstress +> works hard to correct those mistakes. A great seamstress isn't afraid to +> throw out the garment and start over." + +-- Reddit user Broseidon241, https://www.reddit.com/r/datascience/comments/4irajq/mistakes_made_by_beginningaspiring_data_scientists/ For very large and complex datasets this is going to be a lot of work. There are certainly alternative approaches - a more machine learning approach is simply to focus on improving the predictive ability of the model, being careful to fairly assess it (i.e. not assessing the model on the data that was used to train it). These approaches tend to produce black boxes - i.e. the model does a really good job, but you don't know why. This is fine, but the main problem is that you can't apply your real world knowledge to the model to think about whether or not it's likely to work in the long-term, as fundamentals change. For most real models, I'd expect you to use some combination of this approach and a ML model building approach. If prediction is important, get to a good point, and then use visulisation to understand the most important parts of the model. + + In the next chapter, you'll also learn about how to visualise the model-level summaries, and the model parameters. To do this we're going to use some helper functions from the modelr package. This package provides some wrappers around the traditional base R modelling functions that make them easier to use in data manipulation pipelines. Currently at but will need to be on CRAN before the book is published. @@ -35,6 +50,7 @@ To do this we're going to use some helper functions from the modelr package. Thi library(modelr) ``` +In the course of modelling, you'll often discover data quality problems. Maybe a missing value is recorded as 999. Whenever you discover a problem like this, you'll need to review an update your import scripts. You'll often discover a problem with one variable, but you'll need to think about it for all variables. This is often frustrating, but it's typical. ## Residuals @@ -138,11 +154,14 @@ daily %>% So it looks like summer holidays are from early June to late August. That seems to line up fairly well with the [state's school terms](http://schools.nyc.gov/Calendar/2013-2014+School+Year+Calendars.htm): summer break is Jun 26 - Sep 9. So lets add a "term" variable to attemp to control for that. ```{r} -daily <- daily %>% - mutate(term = cut(date, - breaks = as.POSIXct(ymd(20130101, 20130601, 20130825, 20140101)), +term <- function(date) { + cut(date, + breaks = as.POSIXct(ymd(20130101, 20130605, 20130825, 20140101)), labels = c("spring", "summer", "fall") - )) + ) +} + +daily <- daily %>% mutate(term = term(date)) daily %>% filter(wday == "Sat") %>% @@ -152,7 +171,7 @@ daily %>% scale_x_datetime(NULL, date_breaks = "1 month", date_labels = "%b") ``` -(I manually tweaked the dates to get nice breaks in the plot.) +(I manually tweaked the dates to get nice breaks in the plot. Using a visualisation to help you understand what your function is doing is a really powerful and general technique.) It's useful to see how this new variable affects the other days of the week: @@ -195,17 +214,18 @@ middles %>% We can reduce this problem by switching to a robust model fitted by `MASS::rlm()`. A robust model is a variation of the linear model which you can think of a fitting medians, instead of means (it's a bit more complicated than that, but that's a reasonable intuition). This greatly reduces the impact of the outliers on our estimates, and gives a result that does a good job of removing the day of week pattern: ```{r, warn=FALSE} -mod2 <- MASS::rlm(n ~ wday * term, data = daily) -daily <- daily %>% add_residuals(n_resid2 = mod2) +mod3 <- MASS::rlm(n ~ wday * term, data = daily) +daily <- daily %>% add_residuals(n_resid3 = mod3) -ggplot(daily, aes(date, n_resid2)) + +ggplot(daily, aes(date, n_resid3)) + geom_hline(yintercept = 0, size = 2, colour = "white") + - geom_line() + - geom_smooth(span = 0.25, se = FALSE) + geom_line() ``` It's now much easier to see the long-term trend, and the positive and negative outliers. +Very common to use residual plots when figuring out if a model is ok. But it's easy to get the impression that there's just one type of residual plot you should do, when in fact there are infinite. + ### Exercises 1. Use your google sleuthing skills to brainstorm why there were fewer than @@ -275,7 +295,7 @@ grid <- grid ``` -And then we plot the predictions. Plotting predictions is usually the hardest bit and you'll need to try a few times before you get a plot that is most informative. Depending on your model it's quite possible that you'll need multiple plots to fully convey what the model is telling you about the data. +And then we plot the predictions. Plotting predictions is usually the hardest bit and you'll need to try a few times before you get a plot that is most informative. Depending on your model it's quite possible that you'll need multiple plots to fully convey what the model is telling you about the data. Here's my attempt - it took me a few tries before I got something that I was happy with. ```{r} grid %>% @@ -292,10 +312,7 @@ grid %>% daily %>% expand(date) %>% mutate( - term = cut(date, - breaks = as.POSIXct(ymd(20130101, 20130605, 20130825, 20140101)), - labels = c("spring", "summer", "fall") - ), + term = term(date), wday = wday(date, label = TRUE) ) %>% add_predictions(pred = mod2) %>% @@ -305,6 +322,24 @@ daily %>% If you're experimenting with many models and many visualisations, it's a good idea to bundle the creation of variables up into a function so there's no chance of accidentally applying a different transformation in different places. +Another option is to wrap it ito the model formula: + +```{r} +term <- function(date) { + cut(date, + breaks = as.POSIXct(ymd(20130101, 20130605, 20130825, 20140101)), + labels = c("spring", "summer", "fall") + ) +} + +mod3 <- lm(n ~ wday(date, label = TRUE) * term(date), data = daily) +daily %>% + expand(date) %>% + add_predictions(pred = mod3) +``` + +I think this is fine to do provided that you've carefully checked that the functions do what you think they do (i.e. with a visualisation). + ### Nested variables Another case that occassionally crops up is nested variables: you have an identifier that is locally unique, not globally unique. For example you might have this data about students in schools: @@ -342,36 +377,62 @@ grid %>% geom_line(aes(group = Subject)) ``` +### Interpolation vs extrapolation + +Also want to show "nearby data" + ### Exercises 1. How does the model of model coefficients compare to the plot of means and medians computed "by hand" in the previous chapter. Create a plot the highlights the differences and similarities. -## Delays and weather +## Case study: predicting flight delays + +Can't predict delays for next year. Why not? Instead we'll focus on predicting the amount that your flight will be delayed if it's leaving soon. + +We'll start with some exploratory analysis, and then work on the model: + +* time of day +* weather ```{r} -hourly <- flights %>% - group_by(origin, time_hour) %>% - summarise( - delay = mean(dep_delay, na.rm = TRUE) - ) %>% - inner_join(weather, by = c("origin", "time_hour")) -# ggplot(hourly, aes(time_hour, delay)) + -# geom_point() -# -# ggplot(hourly, aes(hour(time_hour), sign(delay) * sqrt(abs(delay)))) + -# geom_boxplot(aes(group = hour(time_hour))) -# -# hourly %>% -# filter(wind_speed < 999) %>% -# ggplot(aes(temp, delay)) + -# geom_point() + -# geom_smooth() + + +delays <- flights %>% + mutate(date = make_datetime(year, month, day)) %>% + group_by(date) %>% + summarise(delay = mean(arr_delay, na.rm = TRUE), cancelled = mean(is.na(dep_time)), n = n()) + +# delays %>% +# ggplot(aes(wday(date, label = TRUE), delay)) + +# geom_boxplot() + +delays %>% + ggplot(aes(n, delay)) + + geom_point() + + geom_smooth(se = F) ``` -## Learning more - + +## Linear model extensions + +### Non-linearity with splines + +help() + +### Transformations to "stabilise" variance + +glm + +Predicting probability of cancellation + +### Robustness + +### Mixed effects models + +### Shrinkage + From e0c62570fd1146f0eded7aab30d345a9584cd7ff Mon Sep 17 00:00:00 2001 From: hadley Date: Mon, 16 May 2016 09:58:07 -0500 Subject: [PATCH 16/19] Fill in some missing prediction pieces --- model-vis.Rmd | 109 ++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 84 insertions(+), 25 deletions(-) diff --git a/model-vis.Rmd b/model-vis.Rmd index 0925721..4ebca22 100644 --- a/model-vis.Rmd +++ b/model-vis.Rmd @@ -5,6 +5,7 @@ library(dplyr) library(lubridate) library(tidyr) library(nycflights13) +library(modelr) ``` # Model visualisation @@ -151,7 +152,7 @@ daily %>% scale_x_datetime(NULL, date_breaks = "1 month", date_labels = "%b") ``` -So it looks like summer holidays are from early June to late August. That seems to line up fairly well with the [state's school terms](http://schools.nyc.gov/Calendar/2013-2014+School+Year+Calendars.htm): summer break is Jun 26 - Sep 9. So lets add a "term" variable to attemp to control for that. +So it looks like summer holidays are from early June to late August. That seems to line up fairly well with the [state's school terms](http://schools.nyc.gov/Calendar/2013-2014+School+Year+Calendars.htm): summer break is Jun 26 - Sep 9. Few families travel in the fall because of the big Thanksgiving and Christmas holidays. So lets add a "term" variable to attemp to control for that. ```{r} term <- function(date) { @@ -267,8 +268,6 @@ Focus on predictions from a model because this works for any type of model. Visu Visualising high-dimensional models is challenging. You'll need to partition off a useable slice at a time. -### `rlm()` vs `lm()` - Let's start by exploring the difference between the `lm()` and `rlm()` predictions for the day of week effects. We'll first re-fit the models, just so we have them handy: ```{r} @@ -306,6 +305,82 @@ grid %>% facet_wrap(~ term) ``` +### Exercises + +1. How does the model of model coefficients compare to the plot of means + and medians computed "by hand" in the previous chapter. Create a plot + the highlights the differences and similarities. + +## Generating prediction grids + +### Continuous variables + +When you have a continuous variable in the model, rather than using the unique values that you've seen, it's often more useful to generate an evenly spaced grid. One convenient way to do this is with `modelr::seq_range()` which takes a continuous variable, calculates its range, and then generates an evenly spaced points between the minimum and maximum. + +```{r} +mod <- MASS::rlm(n ~ wday * date, data = daily) + +grid <- daily %>% + tidyr::expand(wday, date = seq_range(date, n = 13)) %>% + add_predictions(mod = mod) + +ggplot(grid, aes(date, mod, colour = wday)) + + geom_line() + + geom_point() +``` + +We're going to be using this pattern for a few examples, so lets wrap it up into a function: + +```{r} +vis_flights <- function(mod) { + daily %>% + tidyr::expand(wday, date = seq_range(date, n = 13)) %>% + add_predictions(mod = mod) %>% + ggplot(aes(date, mod, colour = wday)) + + geom_line() + + geom_point() +} +``` + +This is more useful if you have a model that includes non-linear components. One way to get that is to include non-linear terms like `I(x ^ 2)`, `I(x ^ 3)` etc. You can't just use `X ^ 2` because of the way the modelling algebra works. `x ^ 2` is equivalent to `x * x` which in the modelling algebra is equivalent to `x + x + x:x` which is the same as `x`. This is useful because `(x + y + z)^2` fit all all major terms and second order interactions of x, y, and z. + +But rather than using this laborious formulation, a better solution is to the use `poly(x, n)` which generates `n` polynomials. (They are Legendre polynomials which means that each is uncorrelated with any of the previous which also makes model fitting a bit easier) + +```{r} +MASS::rlm(n ~ wday * poly(date, 5), data = daily) %>% vis_flights() +``` + +One problem with polynomials is that they have bad tail behaviour - outside of the range of the data they will rapidly shoot towards either positive or negative infinity. One solution to this is splines. + +I'm not going to explain them in any detail here, but they're useful whenever you want to fit irregular patterns. + +```{r} +library(splines) +MASS::rlm(n ~ wday * ns(date, 5), data = daily) %>% vis_flights() +``` + +Other useful arguments to `seq_range()`: + +* `pretty = TRUE` will generate a "pretty" sequence, i.e. something that looks + nice to the human eye: + + ```{r} + seq_range(c(0.0123, 0.923423), n = 5) + seq_range(c(0.0123, 0.923423), n = 5, pretty = TRUE) + ``` + +* `trim = 0.1` will trim off 10% of the tail values. This is useful if the + variables has an long tailed distribution and you want to focus on generating + values near the center: + + ```{r} + x <- rcauchy(100) + seq_range(x, n = 5) + seq_range(x, n = 5, trim = 0.10) + seq_range(x, n = 5, trim = 0.25) + seq_range(x, n = 5, trim = 0.50) + ``` + ### Computed variables ```{r} @@ -331,14 +406,15 @@ term <- function(date) { labels = c("spring", "summer", "fall") ) } +wday2 <- function(x) wday(x, label = TRUE) -mod3 <- lm(n ~ wday(date, label = TRUE) * term(date), data = daily) +mod3 <- lm(n ~ wday2(date) * term(date), data = daily) daily %>% expand(date) %>% add_predictions(pred = mod3) ``` -I think this is fine to do provided that you've carefully checked that the functions do what you think they do (i.e. with a visualisation). +I think this is fine to do provided that you've carefully checked that the functions do what you think they do (i.e. with a visualisation). The main disadvantage is that if you're looking at the coefficients, their values are longer and harder to read. (But this is a general problem with the way that linear models report categorical coefficients in R, not a specific problem with this case.) ### Nested variables @@ -362,30 +438,13 @@ The student id only makes sense in the context of the school: it doesn't make se students %>% expand(nesting(school_id, student_id)) ``` -### Continuous variables - -```{r} -grid <- nlme::Oxboys %>% - as_data_frame() %>% - tidyr::expand(Subject, age = seq_range(age, 2)) - -mod <- nlme::lme(height ~ age, random = ~1 | Subject, data = nlme::Oxboys) - -grid %>% - add_predictions(mod = mod) %>% - ggplot(aes(age, mod)) + - geom_line(aes(group = Subject)) -``` - ### Interpolation vs extrapolation -Also want to show "nearby data" +One danger with prediction plots is that it's easy to make predictions that are far away from the original data. This is dangerous because it's quite possible that the model (which is a simplification of reality) will no longer apply far away from observed values. -### Exercises +To help avoid this problem, it's good practice to include "nearby" observed data points in any prediction plot. These help you see if you're interpolating, making prediction "in between" existing data points, or extrapolating, making predictions about preivously unobserved slices of the data. -1. How does the model of model coefficients compare to the plot of means - and medians computed "by hand" in the previous chapter. Create a plot - the highlights the differences and similarities. +One way to do this is to use `condvis::visualweight()`. ## Case study: predicting flight delays From 8cfd653441526c563b8face540116fbf3f2d3b07 Mon Sep 17 00:00:00 2001 From: hadley Date: Tue, 17 May 2016 08:02:00 -0500 Subject: [PATCH 17/19] Contining to think about model prediction --- model-vis.Rmd | 118 +++++++++++++++++++++++++++++++++++++++----------- 1 file changed, 93 insertions(+), 25 deletions(-) diff --git a/model-vis.Rmd b/model-vis.Rmd index 4ebca22..959f152 100644 --- a/model-vis.Rmd +++ b/model-vis.Rmd @@ -49,6 +49,11 @@ To do this we're going to use some helper functions from the modelr package. Thi ```{r} library(modelr) + +options( + contrasts = c("contr.treatment", "contr.treatment"), + na.option = na.exclude +) ``` In the course of modelling, you'll often discover data quality problems. Maybe a missing value is recorded as 999. Whenever you discover a problem like this, you'll need to review an update your import scripts. You'll often discover a problem with one variable, but you'll need to think about it for all variables. This is often frustrating, but it's typical. @@ -318,7 +323,7 @@ grid %>% When you have a continuous variable in the model, rather than using the unique values that you've seen, it's often more useful to generate an evenly spaced grid. One convenient way to do this is with `modelr::seq_range()` which takes a continuous variable, calculates its range, and then generates an evenly spaced points between the minimum and maximum. ```{r} -mod <- MASS::rlm(n ~ wday * date, data = daily) +mod <- MASS::rlm(n ~ wday * yday(date), data = daily) grid <- daily %>% tidyr::expand(wday, date = seq_range(date, n = 13)) %>% @@ -329,7 +334,9 @@ ggplot(grid, aes(date, mod, colour = wday)) + geom_point() ``` -We're going to be using this pattern for a few examples, so lets wrap it up into a function: +(Why use `yday(date)` instead of `date`? That's saying we think that the pattern depends only the day of the year, so we might expect it to be the same in other years. Again, since we only have a single year of data we can't test that hypothesis.) + +We're going to be using this pattern for a few examples, so lets wrap it up into a function. This is a rather inflexible function, but it serves its purpose here. In my experience, it's challenging to write functions that apply to wide swathes of your modelling process because you tend to do a lot of experimentation and your models can differ quite radically in form, and hence in how you need to visualise them. ```{r} vis_flights <- function(mod) { @@ -342,33 +349,86 @@ vis_flights <- function(mod) { } ``` -This is more useful if you have a model that includes non-linear components. One way to get that is to include non-linear terms like `I(x ^ 2)`, `I(x ^ 3)` etc. You can't just use `X ^ 2` because of the way the modelling algebra works. `x ^ 2` is equivalent to `x * x` which in the modelling algebra is equivalent to `x + x + x:x` which is the same as `x`. This is useful because `(x + y + z)^2` fit all all major terms and second order interactions of x, y, and z. +This is more useful if you have a model that includes non-linear components. You might wonder how you can use a linear model to handle non-linear patterns! But basically we're going to do something like a Taylor series expansion - we're going to approximate a non-linear pattern with a series of transformation. Technically, it's still "linear" in the linear model sense because each parameter multiplies an input variable. This is different from true non-linear models where the coefficients appear in other places like $\alpha_1 + (\alpha_2-\alpha_1) * exp(-exp(\alpha_3) * x)$, the asymptotic regression model. -But rather than using this laborious formulation, a better solution is to the use `poly(x, n)` which generates `n` polynomials. (They are Legendre polynomials which means that each is uncorrelated with any of the previous which also makes model fitting a bit easier) +One way to get that is to include non-linear terms like `I(x ^ 2)`, `I(x ^ 3)` etc. You can't just use `X ^ 2` because of the way the modelling algebra works. `x ^ 2` is equivalent to `x * x` which in the modelling algebra is equivalent to `x + x + x:x` which is the same as `x`. This is useful because `(x + y + z)^2` fit all all major terms and second order interactions of x, y, and z. + +But rather than using this laborious formulation (which also doesn't work here because our `x` is a date, and you can't raise a date to a power), a better solution is to the use `poly(x, n)` which generates `n` polynomials. (They are Legendre polynomials which means that each is uncorrelated with any of the previous which also makes model fitting a bit easier) ```{r} -MASS::rlm(n ~ wday * poly(date, 5), data = daily) %>% vis_flights() +MASS::rlm(n ~ wday * poly(yday(date), 5), data = daily) %>% vis_flights() ``` -One problem with polynomials is that they have bad tail behaviour - outside of the range of the data they will rapidly shoot towards either positive or negative infinity. One solution to this is splines. +(Note the warning message that the model failed to converge - this typically suggests a model misspecification problem. We should be very cautious about interpreting this model. You'll explore the model more in the exercises.) -I'm not going to explain them in any detail here, but they're useful whenever you want to fit irregular patterns. +One problem with polynomials is that they have bad tail behaviour - outside of the range of the data they will rapidly shoot towards either positive or negative infinity. One solution to this is splines. Let's look at this with a simpler example: + +```{r} +df <- data_frame( + x = rep(1:10, each = 3), + y = 30 - (x - 5) ^ 2 + rnorm(30, sd = 6) +) +ggplot(df, aes(x, y)) + geom_point() +``` + +Then we fit the models and predict both to the data. + +```{r} +library(splines) +mod_poly <- lm(y ~ poly(x, 2), data = df) +mod_ns <- lm(y ~ ns(x, 2), data = df) + +preds <- df %>% + expand(x = seq_range(x, 20)) %>% + add_predictions(poly = mod_poly, ns = mod_ns) +``` + +They look equally good in the range of the data: + +```{r} +ggplot(df, aes(x, y)) + + geom_point() + + geom_line(data = preds, aes(y = poly, colour = "poly")) + + geom_line(data = preds, aes(y = ns, colour = "ns")) +``` + +But when we extend past the existing range of the data, we see: + +```{r} +preds <- df %>% + expand(x = seq(-5, 15, length = 30)) %>% + add_predictions(poly = mod_poly, ns = mod_ns) + +ggplot(df, aes(x, y)) + + geom_point() + + geom_line(data = preds, aes(y = poly, colour = "poly")) + + geom_line(data = preds, aes(y = ns, colour = "ns")) +``` + +The natural splines are designed to only linearly interpolate outside the range of the data. In general, this leads to somewhat safer behaviour, but you in either case you should be cautious about extrapolating. These are not generative models - they're just pragmatic fits to patterns. We shouldn't expect them to do a good job outside the data we've observed. + +In general, splines are a useful tool when you have patterns that aren't linear, but you don't have a good explicit model for. Interestingly, using a natural spline instead of poly also fixes our convergence problem, while yielding much the same results: ```{r} library(splines) MASS::rlm(n ~ wday * ns(date, 5), data = daily) %>% vis_flights() ``` +Particularly, we see the strongly pattern in Saturdays that we identified when coming in the opposite direction. It's always a good sign when you see the same signal from multiple approaches. (But note our previous model was explanatory - this is just predictatory.) + +How many degrees of freedom to use? Either pick manually to extract the shape of the data, or you can use one of the model assessment techniques in the following chapter to pick algorithmically. Here we're most interested in explanation, so picking by hand (with a little though and plenty of scepticism) is typically fine. + Other useful arguments to `seq_range()`: * `pretty = TRUE` will generate a "pretty" sequence, i.e. something that looks - nice to the human eye: + nice to the human eye. This is useful if you want to produce tables of + output: ```{r} seq_range(c(0.0123, 0.923423), n = 5) seq_range(c(0.0123, 0.923423), n = 5, pretty = TRUE) ``` - + * `trim = 0.1` will trim off 10% of the tail values. This is useful if the variables has an long tailed distribution and you want to focus on generating values near the center: @@ -383,29 +443,20 @@ Other useful arguments to `seq_range()`: ### Computed variables +If you're experimenting with many models and many visualisations, it's a good idea to bundle the creation of variables up into a function so there's no chance of accidentally applying a different transformation in different places. + ```{r} -daily %>% - expand(date) %>% - mutate( +compute_vars <- function(data) { + data %>% mutate( term = term(date), wday = wday(date, label = TRUE) - ) %>% - add_predictions(pred = mod2) %>% - ggplot(aes(date, pred)) + - geom_line() + ) +} ``` -If you're experimenting with many models and many visualisations, it's a good idea to bundle the creation of variables up into a function so there's no chance of accidentally applying a different transformation in different places. - Another option is to wrap it ito the model formula: ```{r} -term <- function(date) { - cut(date, - breaks = as.POSIXct(ymd(20130101, 20130605, 20130825, 20140101)), - labels = c("spring", "summer", "fall") - ) -} wday2 <- function(x) wday(x, label = TRUE) mod3 <- lm(n ~ wday2(date) * term(date), data = daily) @@ -414,7 +465,14 @@ daily %>% add_predictions(pred = mod3) ``` -I think this is fine to do provided that you've carefully checked that the functions do what you think they do (i.e. with a visualisation). The main disadvantage is that if you're looking at the coefficients, their values are longer and harder to read. (But this is a general problem with the way that linear models report categorical coefficients in R, not a specific problem with this case.) +I think this is fine to do provided that you've carefully checked that the functions do what you think they do (i.e. with a visualisation). There are two disadvantages: + +1. You may need to add the variables back in anyway if you want to use + them in a visualsiation. + +1. When looking at the coefficients, their values are longer and harder to + read. (But this is a general problem with the way that linear models report + categorical coefficients in R, not a specific problem with this case.) ### Nested variables @@ -446,8 +504,18 @@ To help avoid this problem, it's good practice to include "nearby" observed data One way to do this is to use `condvis::visualweight()`. +### Exercises + +1. In the use of `rlm` with `poly()`, the model didn't converge. Carefully + read the documentation for `rlm` and figure out which parameter controls + the number of iterations. Can you increase the iterations so that the + model converges? If so, how does the model differ from the model that + didn't converge? + ## Case study: predicting flight delays +Finish off with a somewhat more realistic case study where we combine the techniques of visualising predictions and residuals to attack the problem of predicting flight delays. + Can't predict delays for next year. Why not? Instead we'll focus on predicting the amount that your flight will be delayed if it's leaving soon. We'll start with some exploratory analysis, and then work on the model: From 1ae26bdbe9e0ac0236fa58b0fef73c73bc17dc17 Mon Sep 17 00:00:00 2001 From: hadley Date: Tue, 17 May 2016 08:07:59 -0500 Subject: [PATCH 18/19] Drop extensions (This chapter is already going to be big) --- model-vis.Rmd | 21 --------------------- 1 file changed, 21 deletions(-) diff --git a/model-vis.Rmd b/model-vis.Rmd index 959f152..883d7de 100644 --- a/model-vis.Rmd +++ b/model-vis.Rmd @@ -542,24 +542,3 @@ delays %>% geom_smooth(se = F) ``` - - - -## Linear model extensions - -### Non-linearity with splines - -help() - -### Transformations to "stabilise" variance - -glm - -Predicting probability of cancellation - -### Robustness - -### Mixed effects models - -### Shrinkage - From d101d7e62a224dc4722f350aad5b8f03a36b6bd5 Mon Sep 17 00:00:00 2001 From: hadley Date: Tue, 17 May 2016 09:03:04 -0500 Subject: [PATCH 19/19] Tweaking intro order --- model-vis.Rmd | 23 +++++++++++++---------- 1 file changed, 13 insertions(+), 10 deletions(-) diff --git a/model-vis.Rmd b/model-vis.Rmd index 883d7de..759a7e8 100644 --- a/model-vis.Rmd +++ b/model-vis.Rmd @@ -16,19 +16,26 @@ In this chapter we will explore model visualisation from two different sides: 1. Use visualisation to understand what a model is telling us about our data. -We're going to give you a basic strategy, and point you to places to learn more. The key is to think about data generated from your model as regular data - you're going to want to manipulate it and visualise it in many different ways. +We're going to give you a basic strategy, and point you to places to learn more. The key is to think about data generated from your model as regular data - you're going to want to manipulate it and visualise it in many different ways. Being good at modelling is a mixture of having some good general principles and having a big toolbox of techniques. Here we'll focus on general techniques to help you undertand what your model is telling you. -What is a good model? We'll think about that more in the next chapter. For now, a good model captures the majority of the patterns that are generated by the underlying mechanism of interest, and captures few patterns that are not generated by that mechanism. Another way to frame that is that you want your model to be good at inference, not just description. Inference is one of the most important parts of a model - you want to not just make statements about the data you have observed, but data that you have not observed (like things that will happen in the future). + Centered around looking at residuals and looking at predictions. You'll see those here applied to linear models (and some minor variations), but it's a flexible technique since every model can generate predictions and residuals. -Being good at modelling is a mixture of having some good general principles and having a big toolbox of techniques. Here we'll focus on general techniques to help you undertand what your model is telling you. - -Focus on constructing models that help you better understand the data. This will generally lead to models that predict better. But you have to beware of overfitting the data - in the next section we'll discuss some formal methods. But a healthy dose of scepticism is also a powerful: do you believe that a pattern you see in your sample is going to generalise to a wider population? +Attack the problem from two directions: building up from a simple model, and subtracting off the full dataset. Transition from implicit knowledge in your head and in data to explicit knowledge in the model. In other words, you want to make explicit your knowledge of the data and capture it explicitly in a model. This makes it easier to apply to new domains, and easier for others to use. But you must always remember that your knowledge is incomplete. Subtract patterns from the data, and add patterns to the model. -When do you stop? + + + +What is a good model? We'll think about that more in the next chapter. For now, a good model captures the majority of the patterns that are generated by the underlying mechanism of interest, and captures few patterns that are not generated by that mechanism. Another way to frame that is that you want your model to be good at inference, not just description. Inference is one of the most important parts of a model - you want to not just make statements about the data you have observed, but data that you have not observed (like things that will happen in the future). + +Focus on constructing models that help you better understand the data. This will generally lead to models that predict better. But you have to beware of overfitting the data - in the next section we'll discuss some formal methods. But a healthy dose of scepticism is also a powerful: do you believe that a pattern you see in your sample is going to generalise to a wider population? + + + +For very large and complex datasets this is going to be a lot of work. There are certainly alternative approaches - a more machine learning approach is simply to focus on improving the predictive ability of the model, being careful to fairly assess it (i.e. not assessing the model on the data that was used to train it). These approaches tend to produce black boxes - i.e. the model does a really good job, but you don't know why. This is fine, but the main problem is that you can't apply your real world knowledge to the model to think about whether or not it's likely to work in the long-term, as fundamentals change. For most real models, I'd expect you to use some combination of this approach and a ML model building approach. If prediction is important, get to a good point, and then use visulisation to understand the most important parts of the model. > A long time ago in art class, my teacher told me "An artist needs to know > when a piece is done. You can't tweak something into perfection - wrap it up. @@ -39,10 +46,6 @@ When do you stop? -- Reddit user Broseidon241, https://www.reddit.com/r/datascience/comments/4irajq/mistakes_made_by_beginningaspiring_data_scientists/ -For very large and complex datasets this is going to be a lot of work. There are certainly alternative approaches - a more machine learning approach is simply to focus on improving the predictive ability of the model, being careful to fairly assess it (i.e. not assessing the model on the data that was used to train it). These approaches tend to produce black boxes - i.e. the model does a really good job, but you don't know why. This is fine, but the main problem is that you can't apply your real world knowledge to the model to think about whether or not it's likely to work in the long-term, as fundamentals change. For most real models, I'd expect you to use some combination of this approach and a ML model building approach. If prediction is important, get to a good point, and then use visulisation to understand the most important parts of the model. - - - In the next chapter, you'll also learn about how to visualise the model-level summaries, and the model parameters. To do this we're going to use some helper functions from the modelr package. This package provides some wrappers around the traditional base R modelling functions that make them easier to use in data manipulation pipelines. Currently at but will need to be on CRAN before the book is published.