r4ds/model.Rmd

# (PART) Model {-}

# Introduction {#model-intro}

Now that you are equipped with powerful programming tools we can finally return to modelling. You'll use your new tools of data wrangling and programming, to fit many models and understand how they work. The focus of this book is on exploration, not confirmation or formal inference. But you'll learn a few basic tools that help you understand the variation within your models.

```{r echo = FALSE, out.width = "75%"}
knitr::include_graphics("diagrams/data-science-model.png")
```

The goal of a model is to provide a simple low-dimensional summary of a dataset. Ideally, the model will capture true "signals" (i.e. patterns generated by the phenomenon of interest), and ignore "noise" (i.e. random variation that you're not interested in). Here we only cover "predictive" models, which, as the name suggests, generate predictions. There is another type of model that we're not going to discuss: "data discovery" models. These models don't make predictions, but instead help you discover interesting relationships within your data. (These two categories of models are sometimes called supervised and unsupervised, but I don't think that terminology is particularly illuminating.)

This book is not going to give you a deep understanding of the mathematical theory that underlies models. It will, however, build your intution about how statistical models work, and give you a family of useful tools that allow you to use models to better understand your data:

* In [model basics], you'll learn how models work mechanistically, focussing on
  the important family of linear models. You'll learn general tools for gaining
  insight into what a predictive model tells you about your data, focussing on
  simple simulated datasets.

* In [model building], you'll learn how to use models to pull out known
  patterns in real data. Once you have recognised an important pattern
  it's useful to make it explicitly in a model, because then you can
  more easily see the subtler signals that remina.

* In [many models], you'll learn how to use many simple models to help 
  understand complex datasets. This is a powerful technique, but to access
  it you'll need to combine modelling and programming tools.

* In [model assessment], you'll learn more about the statistical side of
  modelling. Ideally, you don't just want a model that works just  with the 
  data that you've observe, but also generalises to new situations. You'll 
  learn two powerful techniques, cross-validation and bootstrapping, built 
  on the powerful idea of random resamples. These will help you understand
  how your model will behave on new datasets.

## Hypothesis generation vs. hypothesis confirmation

In this book, we are going to use models as a tool for exploration, completing the trifecta of EDA tools introduced in Part 1. This is not how models are usually taught, but they make for a particularly useful tool in this context. Every exploratory analysis will involve some transformation, modelling, and visualisation. 

Models are more common taught as tools for doing inference, or for confirming that an hypothesis is true. Doing this correctly is not complicated, but it is hard. There is a pair of ideas that you must understand in order to do inference correctly:

1. Each observation can either be used for exploration or confirmation, 
   not both.

1. You can use an observation as many times as you like for exploration,
   but you can only use it once for confirmation. As soon as you use an 
   observation twice, you've switched from confirmation to exploration.
   
This is necessary because to confirm a hypothesis you must use data that is independent of the data that you used to generate the hypothesis. Otherwise you will be over optimistic. There is absolutely nothing wrong with exploration, but you should never sell an exploratory analysis as a confirmatory analysis because it is fundamentally misleading. If you are serious about doing an confirmatory analysis, before you begin the analysis you should split your data up into three pieces:

1.  60% of your data goes into a __training__ (or exploration) set. You're 
    allowed to do anything you like with this data: visualise it and fit tons 
    of models to it.
  
1.  20% goes into a __query__ set. You can use this data to compare models 
    or visualisations by hand, but you're not allowed to use it as part of
    an automated process.

1.  20% is held back for a __test__ set. You can only use this data ONCE, to 
    test your final model. 
    
This partitioning allows you to explore the training data, occassionally generating candidate hypotheses that you check with the query set. When you are confident you have the right model, you can check it once with the test data.

(Note that even when doing confirmatory modelling, you will still need to do EDA. If you don't do any EDA you will remain blind to the quality problems with your data.)
Restructure chapters once more 2016-04-27 15:04:29 +08:00			`# (PART) Model {-}`
Small edits to model.Rmd 2016-04-06 08:56:40 +08:00
Add label to model intro 2016-07-20 23:08:18 +08:00			`# Introduction {#model-intro}`
Restructuring 2015-12-08 04:57:08 +08:00
Consistent part intros 2016-07-19 22:39:00 +08:00			`Now that you are equipped with powerful programming tools we can finally return to modelling. You'll use your new tools of data wrangling and programming, to fit many models and understand how they work. The focus of this book is on exploration, not confirmation or formal inference. But you'll learn a few basic tools that help you understand the variation within your models.`

			```{r echo = FALSE, out.width = "75%"}
			`knitr::include_graphics("diagrams/data-science-model.png")`
			```

Update model.Rmd (#226) typos 2016-08-03 03:02:38 +08:00			The goal of a model is to provide a simple low-dimensional summary of a dataset. Ideally, the model will capture true "signals" (i.e. patterns generated by the phenomenon of interest), and ignore "noise" (i.e. random variation that you're not interested in). Here we only cover "predictive" models, which, as the name suggests, generate predictions. There is another type of model that we're not going to discuss: "data discovery" models. These models don't make predictions, but instead help you discover interesting relationships within your data. (These two categories of models are sometimes called supervised and unsupervised, but I don't think that terminology is particularly illuminating.)
Restructuring model chapters 2016-06-13 22:50:55 +08:00
Update model.Rmd (#226) typos 2016-08-03 03:02:38 +08:00			`This book is not going to give you a deep understanding of the mathematical theory that underlies models. It will, however, build your intution about how statistical models work, and give you a family of useful tools that allow you to use models to better understand your data:`
Restructuring model chapters 2016-06-13 22:50:55 +08:00
Tweak model outline 2016-07-25 03:48:23 +08:00			`* In [model basics], you'll learn how models work mechanistically, focussing on`
			`the important family of linear models. You'll learn general tools for gaining`
			`insight into what a predictive model tells you about your data, focussing on`
			`simple simulated datasets.`
Restructuring model chapters 2016-06-13 22:50:55 +08:00
Rewrite modelling intro 2016-07-19 02:08:49 +08:00			`* In [model building], you'll learn how to use models to pull out known`
			`patterns in real data. Once you have recognised an important pattern`
Update model.Rmd (#226) typos 2016-08-03 03:02:38 +08:00			`it's useful to make it explicitly in a model, because then you can`
Rewrite modelling intro 2016-07-19 02:08:49 +08:00			`more easily see the subtler signals that remina.`
Restructuring model chapters 2016-06-13 22:50:55 +08:00
Rewrite modelling intro 2016-07-19 02:08:49 +08:00			`* In [many models], you'll learn how to use many simple models to help`
			`understand complex datasets. This is a powerful technique, but to access`
			`it you'll need to combine modelling and programming tools.`
Restructuring model chapters 2016-06-13 22:50:55 +08:00
Tweak model outline 2016-07-25 03:48:23 +08:00			`* In [model assessment], you'll learn more about the statistical side of`
			`modelling. Ideally, you don't just want a model that works just with the`
			`data that you've observe, but also generalises to new situations. You'll`
			`learn two powerful techniques, cross-validation and bootstrapping, built`
			`on the powerful idea of random resamples. These will help you understand`
			`how your model will behave on new datasets.`
More model brainstorming 2016-06-20 22:56:46 +08:00
Move learning more to the end of model building 2016-07-28 06:04:38 +08:00			`## Hypothesis generation vs. hypothesis confirmation`

Started reworking model building 2016-07-26 05:47:32 +08:00			`In this book, we are going to use models as a tool for exploration, completing the trifecta of EDA tools introduced in Part 1. This is not how models are usually taught, but they make for a particularly useful tool in this context. Every exploratory analysis will involve some transformation, modelling, and visualisation.`
More model brainstorming 2016-06-20 22:56:46 +08:00
Rewrite modelling intro 2016-07-19 02:08:49 +08:00			`Models are more common taught as tools for doing inference, or for confirming that an hypothesis is true. Doing this correctly is not complicated, but it is hard. There is a pair of ideas that you must understand in order to do inference correctly:`
More modelling thoughts 2016-06-20 21:31:16 +08:00
Rewrite modelling intro 2016-07-19 02:08:49 +08:00			`1. Each observation can either be used for exploration or confirmation,`
			`not both.`
Restructuring model chapters 2016-06-13 22:50:55 +08:00
Rewrite modelling intro 2016-07-19 02:08:49 +08:00			`1. You can use an observation as many times as you like for exploration,`
			`but you can only use it once for confirmation. As soon as you use an`
			`observation twice, you've switched from confirmation to exploration.`

Update model.Rmd (#226) typos 2016-08-03 03:02:38 +08:00			`This is necessary because to confirm a hypothesis you must use data that is independent of the data that you used to generate the hypothesis. Otherwise you will be over optimistic. There is absolutely nothing wrong with exploration, but you should never sell an exploratory analysis as a confirmatory analysis because it is fundamentally misleading. If you are serious about doing an confirmatory analysis, before you begin the analysis you should split your data up into three pieces:`
Restructuring model chapters 2016-06-13 22:50:55 +08:00
Rewrite modelling intro 2016-07-19 02:08:49 +08:00			`1. 60% of your data goes into a __training__ (or exploration) set. You're`
			`allowed to do anything you like with this data: visualise it and fit tons`
			`of models to it.`

			`1. 20% goes into a __query__ set. You can use this data to compare models`
			`or visualisations by hand, but you're not allowed to use it as part of`
			`an automated process.`
More modelling thoughts 2016-06-20 21:31:16 +08:00
Rewrite modelling intro 2016-07-19 02:08:49 +08:00			`1. 20% is held back for a __test__ set. You can only use this data ONCE, to`
			`test your final model.`

			`This partitioning allows you to explore the training data, occassionally generating candidate hypotheses that you check with the query set. When you are confident you have the right model, you can check it once with the test data.`
More shaping of modelling chapters 2016-07-20 07:01:32 +08:00
Move learning more to the end of model building 2016-07-28 06:04:38 +08:00			`(Note that even when doing confirmatory modelling, you will still need to do EDA. If you don't do any EDA you will remain blind to the quality problems with your data.)`