Rewrite modelling intro

This commit is contained in:
hadley 2016-07-18 13:08:49 -05:00
parent 11294f5d0c
commit 10fbd4ce9f
3 changed files with 43 additions and 50 deletions

View File

@ -62,23 +62,6 @@ If you're competing in competitions, like Kaggle, that are predominantly about c
There is a closely related family that uses a similar idea: model ensembles. However, instead of trying to find the best models, ensembles make use of all the models, acknowledging that even models that don't fit all the data particularly well can still model some subsets well. In general, you can think of model ensemble techniques as functions that take a list of models, and a return a single model that attempts to take the best part of each.
### Confirmatory analysis
Split between exploratory and confirmatory analysis. The focus of this book is on using data to generate hypotheses and explore them.
Either is fine, but confirmatory is much much harder. If you want your confirmatory analysis to be correct, you need to take a stricter approach:
1. 60% of your data goes into a __training__ set. You're allowed to do
anything you like with this data: visualise it, fit tons of models to it,
cross-validate it.
1. 20% of goes into a __query__ set. You can use this data
to compare models by hand, but you're not allowed to use vit automatically.
1. 20% goes into amount back for a __test__ set. You can only use this
data ONCE, to test your final model. If you use this data more than
once you're no longer doing confirmatory analysis, you're doing exploratory
analysis.
### Prerequisites

View File

@ -1,9 +1,10 @@
# Model
# Model basics
We're going to give you a basic strategy, and point you to places to learn more. The key is to think about data generated from your model as regular data - you're going to want to manipulate it and visualise it in many different ways. Being good at modelling is a mixture of having some good general principles and having a big toolbox of techniques. Here we'll focus on general techniques to help you undertand what your model is telling you.
The goal of a fitted model is to provide a simple low-dimensional summary of a dataset. Ideally, the fitted model will capture true "signals" (i.e. patterns generated by the phenomenon of interest), and ignore "noise" (i.e. random variation that you're not interested in).
A model is a tool for making predictions. Goal of a model is to be simple and useful.
This is a hard problem because any fitted dataset is just the "best" (closest) model from a family of models. Just because it's the best model doesn't make it good. And it certainly doesn't imply that the model is true. But a model doesn't need to be true to be useful. You've probably heard George Box's famous aphorism:
> All models are worng, but some are useful.
@ -720,6 +721,8 @@ There are other types of modeling algorithms; each provides a valid description
Which description will be best? Does the relationship have a known form? Does the data have a known structure? Are you going to attempt hypothesis testing that imposes its own constraints?
In the course of modelling, you'll often discover data quality problems. Maybe a missing value is recorded as 999. Whenever you discover a problem like this, you'll need to review an update your import scripts. You'll often discover a problem with one variable, but you'll need to think about it for all variables. This is often frustrating, but it's typical.

View File

@ -1,46 +1,53 @@
```{r include=FALSE, cache=FALSE}
set.seed(1014)
options(digits = 3)
knitr::opts_chunk$set(
comment = "#>",
collapse = TRUE,
cache = TRUE
)
options(dplyr.print_min = 6, dplyr.print_max = 6)
```
# (PART) Model {-}
# Introduction
The scientific method guides data science. Data science solves known problems with the scientific method.
The goal of a model is to provide a simple low-dimensional summary of a dataset. Ideally, the model will capture true "signals" (i.e. patterns generated by the phenomenon of interest), and ignore "noise" (i.e. random variation that you're not interested in). Here we only cover "predictive" models, which, as the name suggests, generate predictions. There is another type of model that we're not going to discuss: "data discovery" models. These models don't make predictions, but instead help you discover interesting relationships within your data.
In this book, we'll focus on deep tools for rather simple models. We want to give you tools to help you build your intuition for what models do. There's little mathematical formalism, and only a relatively small overlap with how modelling is normally presented in statistics. Modelling is a huge topic and we can only scratch the surface here. You'll definitely need to consult other resources as the complexity of your models grow.
This book is not going to give you a deep understanding of the mathematical theory that underlies models. It will, however, build your intution about how statisitcal models work, and give you a family of useful tools that allow you to use models to better understand your data:
We're going to give you a basic strategy, and point you to places to learn more. The key is to think about data generated from your model as regular data - you're going to want to manipulate it and visualise it in many different ways. Being good at modelling is a mixture of having some good general principles and having a big toolbox of techniques. Here we'll focus on general techniques to help you undertand what your model is telling you.
* In [model basics], you'll learn how models work, focussing on the important
family of linear models. You'll learn general tools for gaining insight
into what a predictive model tells you about your data, focussing on simple
simulated datasets.
In the course of modelling, you'll often discover data quality problems. Maybe a missing value is recorded as 999. Whenever you discover a problem like this, you'll need to review an update your import scripts. You'll often discover a problem with one variable, but you'll need to think about it for all variables. This is often frustrating, but it's typical.
* In [model building], you'll learn how to use models to pull out known
patterns in real data. Once you have recognised an important pattern
it's useful to make it explicit it in a model, because then you can
more easily see the subtler signals that remina.
<https://blog.engineyard.com/2014/pets-vs-cattle>.
<https://en.wikipedia.org/wiki/R/K_selection_theory>
* In [many models], you'll learn how to use many simple models to help
understand complex datasets. This is a powerful technique, but to access
it you'll need to combine modelling and programming tools.
## Fitted models vs. families of models
* In [model assessment], you'll learn a little a bit about how you might
quantitatively assess whether a model is good or not. You'll learn two
powerful techniques, cross-validation and bootstrapping, that are built
on the idea of generating many random datasets which you fit many
models to.
Family of models vs fitted model. Set of possible values, vs. one specific model. A fitted model = family of models plus a dataset.
In this book, we are going to use models as a tool for exploration, completing the trifacta of EDA tools introduced in Part 1. This is not how models are usually taught, but they make for a particularly useful tool in this context. Every exploratory analysis will involve some transformation, modelling, and visualisation.
## Exploring vs. confirming
Models are more common taught as tools for doing inference, or for confirming that an hypothesis is true. Doing this correctly is not complicated, but it is hard. There is a pair of ideas that you must understand in order to do inference correctly:
In this book we are going to focus on models primarily as tools for description. This is rather non-standard because we're normally interested in models for their inferential power: their ability to make accurate predictions for observations that we haven't seen yet.
1. Each observation can either be used for exploration or confirmation,
not both.
In other words, in this book, we're typically going to think about a good model as a model that well captures the patterns that we see in the data. For now, a good model captures the majority of the patterns that are generated by the underlying mechanism of interest, and captures few patterns that are not generated by that mechanism. When you go on from this book and learn other ways of thinking about models this will stand you in good stead: if you can't capture patterns in the data that you can see, it's unlikely you'll be able to make good predictions about data that you haven't seen.
1. You can use an observation as many times as you like for exploration,
but you can only use it once for confirmation. As soon as you use an
observation twice, you've switched from confirmation to exploration.
This is necessary because to confirm a hypothesis you must use data this is independent of the data that you used to generate the hypothesis. Otherwise you will be over optimistic. There is absolutely nothing wrong with exploration, but you should never sell an exploratory analysis as a confirmatory analysis because it is fundamentally misleading. If you are serious about doing an confirmatory analysis, before you begin the analysis you should split your data up into three piecese:
It's not possible to do both on the same dataset.
1. 60% of your data goes into a __training__ (or exploration) set. You're
allowed to do anything you like with this data: visualise it and fit tons
of models to it.
1. 20% goes into a __query__ set. You can use this data to compare models
or visualisations by hand, but you're not allowed to use it as part of
an automated process.
Doing correct inference is hard!
Generally, however, this will tend to make us over-optimistic about the quality of our model. Chapter XXX you'll start to learn more about how we can judge the quality of a model on data that it was 't fit it. But you have to beware of overfitting the data - in the next section we'll discuss some formal methods. But a healthy dose of scepticism is also as powerful as precise quantitative methods: do you believe that a pattern you see in your sample is going to generalise to a wider population?
## Prediction vs. data discovery
PCA, clustering, ...
1. 20% is held back for a __test__ set. You can only use this data ONCE, to
test your final model.
This partitioning allows you to explore the training data, occassionally generating candidate hypotheses that you check with the query set. When you are confident you have the right model, you can check it once with the test data.