More model-vis brainstorming

This commit is contained in:
hadley 2016-05-16 08:11:16 -05:00
parent e8473755c0
commit 1ac7131bbc
1 changed files with 97 additions and 36 deletions

View File

@ -17,16 +17,31 @@ In this chapter we will explore model visualisation from two different sides:
We're going to give you a basic strategy, and point you to places to learn more. The key is to think about data generated from your model as regular data - you're going to want to manipulate it and visualise it in many different ways.
What is a good model? We'll think about that more in the next chapter. For now, a good model captures the majority of the patterns that are generated by the underlying mechanism of interest, and captures few patterns that are not generated by that mechanism. Another way to frame that is that you want your model to be good at inference, not just description. Inference is one of the most important parts of a model - you want to not just make statements about the data you have observed, but data that you have not observed (like things that will happen in the future).
Centered around looking at residuals and looking at predictions. You'll see those here applied to linear models (and some minor variations), but it's a flexible technique since every model can generate predictions and residuals.
Being good at modelling is a mixture of having some good general principles and having a big toolbox of techniques. Here we'll focus on general techniques to help you undertand what your model is telling you.
Focus on constructing models that help you better understand the data. This will generally lead to models that predict better. But you have to beware of overfitting the data - in the next section we'll discuss some formal methods. But a healthy dose of scepticism is also a powerful: do you believe that a pattern you see in your sample is going to generalise to a wider population?
Transition from implicit knowledge in your head and in data to explicit knowledge in the model. In other words, you want to make explicit your knowledge of the data and capture it explicitly in a model. This makes it easier to apply to new domains, and easier for others to use. But you must always remember that your knowledge is incomplete.
Transition from implicit knowledge in your head and in data to explicit knowledge in the model. In other words, you want to make explicit your knowledge of the data and capture it explicitly in a model. This makes it easier to apply to new domains, and easier for others to use. But you must always remember that your knowledge is incomplete. Subtract patterns from the data, and add patterns to the model.
When do you stop?
> A long time ago in art class, my teacher told me "An artist needs to know
> when a piece is done. You can't tweak something into perfection - wrap it up.
> If you don't like it, do it over again. Otherwise begin something new". Later
> in life, I heard "A poor seamstress makes many mistake. A good seamstress
> works hard to correct those mistakes. A great seamstress isn't afraid to
> throw out the garment and start over."
-- Reddit user Broseidon241, https://www.reddit.com/r/datascience/comments/4irajq/mistakes_made_by_beginningaspiring_data_scientists/
For very large and complex datasets this is going to be a lot of work. There are certainly alternative approaches - a more machine learning approach is simply to focus on improving the predictive ability of the model, being careful to fairly assess it (i.e. not assessing the model on the data that was used to train it). These approaches tend to produce black boxes - i.e. the model does a really good job, but you don't know why. This is fine, but the main problem is that you can't apply your real world knowledge to the model to think about whether or not it's likely to work in the long-term, as fundamentals change. For most real models, I'd expect you to use some combination of this approach and a ML model building approach. If prediction is important, get to a good point, and then use visulisation to understand the most important parts of the model.
<https://cran.rstudio.com/web/packages/condvis/>
In the next chapter, you'll also learn about how to visualise the model-level summaries, and the model parameters.
To do this we're going to use some helper functions from the modelr package. This package provides some wrappers around the traditional base R modelling functions that make them easier to use in data manipulation pipelines. Currently at <https://github.com/hadley/modelr> but will need to be on CRAN before the book is published.
@ -35,6 +50,7 @@ To do this we're going to use some helper functions from the modelr package. Thi
library(modelr)
```
In the course of modelling, you'll often discover data quality problems. Maybe a missing value is recorded as 999. Whenever you discover a problem like this, you'll need to review an update your import scripts. You'll often discover a problem with one variable, but you'll need to think about it for all variables. This is often frustrating, but it's typical.
## Residuals
@ -138,11 +154,14 @@ daily %>%
So it looks like summer holidays are from early June to late August. That seems to line up fairly well with the [state's school terms](http://schools.nyc.gov/Calendar/2013-2014+School+Year+Calendars.htm): summer break is Jun 26 - Sep 9. So lets add a "term" variable to attemp to control for that.
```{r}
daily <- daily %>%
mutate(term = cut(date,
breaks = as.POSIXct(ymd(20130101, 20130601, 20130825, 20140101)),
term <- function(date) {
cut(date,
breaks = as.POSIXct(ymd(20130101, 20130605, 20130825, 20140101)),
labels = c("spring", "summer", "fall")
))
)
}
daily <- daily %>% mutate(term = term(date))
daily %>%
filter(wday == "Sat") %>%
@ -152,7 +171,7 @@ daily %>%
scale_x_datetime(NULL, date_breaks = "1 month", date_labels = "%b")
```
(I manually tweaked the dates to get nice breaks in the plot.)
(I manually tweaked the dates to get nice breaks in the plot. Using a visualisation to help you understand what your function is doing is a really powerful and general technique.)
It's useful to see how this new variable affects the other days of the week:
@ -195,17 +214,18 @@ middles %>%
We can reduce this problem by switching to a robust model fitted by `MASS::rlm()`. A robust model is a variation of the linear model which you can think of a fitting medians, instead of means (it's a bit more complicated than that, but that's a reasonable intuition). This greatly reduces the impact of the outliers on our estimates, and gives a result that does a good job of removing the day of week pattern:
```{r, warn=FALSE}
mod2 <- MASS::rlm(n ~ wday * term, data = daily)
daily <- daily %>% add_residuals(n_resid2 = mod2)
mod3 <- MASS::rlm(n ~ wday * term, data = daily)
daily <- daily %>% add_residuals(n_resid3 = mod3)
ggplot(daily, aes(date, n_resid2)) +
ggplot(daily, aes(date, n_resid3)) +
geom_hline(yintercept = 0, size = 2, colour = "white") +
geom_line() +
geom_smooth(span = 0.25, se = FALSE)
geom_line()
```
It's now much easier to see the long-term trend, and the positive and negative outliers.
Very common to use residual plots when figuring out if a model is ok. But it's easy to get the impression that there's just one type of residual plot you should do, when in fact there are infinite.
### Exercises
1. Use your google sleuthing skills to brainstorm why there were fewer than
@ -275,7 +295,7 @@ grid <-
grid
```
And then we plot the predictions. Plotting predictions is usually the hardest bit and you'll need to try a few times before you get a plot that is most informative. Depending on your model it's quite possible that you'll need multiple plots to fully convey what the model is telling you about the data.
And then we plot the predictions. Plotting predictions is usually the hardest bit and you'll need to try a few times before you get a plot that is most informative. Depending on your model it's quite possible that you'll need multiple plots to fully convey what the model is telling you about the data. Here's my attempt - it took me a few tries before I got something that I was happy with.
```{r}
grid %>%
@ -292,10 +312,7 @@ grid %>%
daily %>%
expand(date) %>%
mutate(
term = cut(date,
breaks = as.POSIXct(ymd(20130101, 20130605, 20130825, 20140101)),
labels = c("spring", "summer", "fall")
),
term = term(date),
wday = wday(date, label = TRUE)
) %>%
add_predictions(pred = mod2) %>%
@ -305,6 +322,24 @@ daily %>%
If you're experimenting with many models and many visualisations, it's a good idea to bundle the creation of variables up into a function so there's no chance of accidentally applying a different transformation in different places.
Another option is to wrap it ito the model formula:
```{r}
term <- function(date) {
cut(date,
breaks = as.POSIXct(ymd(20130101, 20130605, 20130825, 20140101)),
labels = c("spring", "summer", "fall")
)
}
mod3 <- lm(n ~ wday(date, label = TRUE) * term(date), data = daily)
daily %>%
expand(date) %>%
add_predictions(pred = mod3)
```
I think this is fine to do provided that you've carefully checked that the functions do what you think they do (i.e. with a visualisation).
### Nested variables
Another case that occassionally crops up is nested variables: you have an identifier that is locally unique, not globally unique. For example you might have this data about students in schools:
@ -342,36 +377,62 @@ grid %>%
geom_line(aes(group = Subject))
```
### Interpolation vs extrapolation
Also want to show "nearby data"
### Exercises
1. How does the model of model coefficients compare to the plot of means
and medians computed "by hand" in the previous chapter. Create a plot
the highlights the differences and similarities.
## Delays and weather
## Case study: predicting flight delays
Can't predict delays for next year. Why not? Instead we'll focus on predicting the amount that your flight will be delayed if it's leaving soon.
We'll start with some exploratory analysis, and then work on the model:
* time of day
* weather
```{r}
hourly <- flights %>%
group_by(origin, time_hour) %>%
summarise(
delay = mean(dep_delay, na.rm = TRUE)
) %>%
inner_join(weather, by = c("origin", "time_hour"))
# ggplot(hourly, aes(time_hour, delay)) +
# geom_point()
#
# ggplot(hourly, aes(hour(time_hour), sign(delay) * sqrt(abs(delay)))) +
# geom_boxplot(aes(group = hour(time_hour)))
#
# hourly %>%
# filter(wind_speed < 999) %>%
# ggplot(aes(temp, delay)) +
# geom_point() +
# geom_smooth()
delays <- flights %>%
mutate(date = make_datetime(year, month, day)) %>%
group_by(date) %>%
summarise(delay = mean(arr_delay, na.rm = TRUE), cancelled = mean(is.na(dep_time)), n = n())
# delays %>%
# ggplot(aes(wday(date, label = TRUE), delay)) +
# geom_boxplot()
delays %>%
ggplot(aes(n, delay)) +
geom_point() +
geom_smooth(se = F)
```
## Learning more
<https://cran.rstudio.com/web/packages/condvis/>
## Linear model extensions
### Non-linearity with splines
help()
### Transformations to "stabilise" variance
glm
Predicting probability of cancellation
### Robustness
### Mixed effects models
### Shrinkage