r4ds/model.Rmd

47 lines
3.3 KiB
Plaintext

```{r include=FALSE, cache=FALSE}
set.seed(1014)
options(digits = 3)
knitr::opts_chunk$set(
comment = "#>",
collapse = TRUE,
cache = TRUE
)
options(dplyr.print_min = 6, dplyr.print_max = 6)
```
# (PART) Model {-}
# Introduction
The scientific method guides data science. Data science solves known problems with the scientific method.
In this book, we'll focus on deep tools for rather simple models. We want to give you tools to help you build your intuition for what models do. There's little mathematical formalism, and only a relatively small overlap with how modelling is normally presented in statistics. Modelling is a huge topic and we can only scratch the surface here. You'll definitely need to consult other resources as the complexity of your models grow.
We're going to give you a basic strategy, and point you to places to learn more. The key is to think about data generated from your model as regular data - you're going to want to manipulate it and visualise it in many different ways. Being good at modelling is a mixture of having some good general principles and having a big toolbox of techniques. Here we'll focus on general techniques to help you undertand what your model is telling you.
In the course of modelling, you'll often discover data quality problems. Maybe a missing value is recorded as 999. Whenever you discover a problem like this, you'll need to review an update your import scripts. You'll often discover a problem with one variable, but you'll need to think about it for all variables. This is often frustrating, but it's typical.
<https://blog.engineyard.com/2014/pets-vs-cattle>.
<https://en.wikipedia.org/wiki/R/K_selection_theory>
## Fitted models vs. families of models
Family of models vs fitted model. Set of possible values, vs. one specific model. A fitted model = family of models plus a dataset.
## Exploring vs. confirming
In this book we are going to focus on models primarily as tools for description. This is rather non-standard because we're normally interested in models for their inferential power: their ability to make accurate predictions for observations that we haven't seen yet.
In other words, in this book, we're typically going to think about a good model as a model that well captures the patterns that we see in the data. For now, a good model captures the majority of the patterns that are generated by the underlying mechanism of interest, and captures few patterns that are not generated by that mechanism. When you go on from this book and learn other ways of thinking about models this will stand you in good stead: if you can't capture patterns in the data that you can see, it's unlikely you'll be able to make good predictions about data that you haven't seen.
It's not possible to do both on the same dataset.
Doing correct inference is hard!
Generally, however, this will tend to make us over-optimistic about the quality of our model. Chapter XXX you'll start to learn more about how we can judge the quality of a model on data that it was 't fit it. But you have to beware of overfitting the data - in the next section we'll discuss some formal methods. But a healthy dose of scepticism is also as powerful as precise quantitative methods: do you believe that a pattern you see in your sample is going to generalise to a wider population?
## Prediction vs. data discovery
PCA, clustering, ...