The goal of a model is to provide a simple low-dimensional summary of a dataset.
Ideally, the model will capture true "signals" (i.e. patterns generated by the phenomenon of interest), and ignore "noise" (i.e. random variation that you're not interested in).
Here we only cover "predictive" models, which, as the name suggests, generate predictions.
There is another type of model that we're not going to discuss: "data discovery" models.
These models don't make predictions, but instead help you discover interesting relationships within your data.
(These two categories of models are sometimes called supervised and unsupervised, but I don't think that terminology is particularly illuminating.)
This book is not going to give you a deep understanding of the mathematical theory that underlies models.
It will, however, build your intuition about how statistical models work, and give you a family of useful tools that allow you to use models to better understand your data:
- In [model building], you'll learn how to use models to pull out known patterns in real data.
Once you have recognised an important pattern it's useful to make it explicit in a model, because then you can more easily see the subtler signals that remain.
2. You can use an observation as many times as you like for exploration, but you can only use it once for confirmation.
As soon as you use an observation twice, you've switched from confirmation to exploration.
This is necessary because to confirm a hypothesis you must use data independent of the data that you used to generate the hypothesis.
Otherwise you will be over optimistic.
There is absolutely nothing wrong with exploration, but you should never sell an exploratory analysis as a confirmatory analysis because it is fundamentally misleading.
(Note that even when doing confirmatory modelling, you will still need to do EDA. If you don't do any EDA you will remain blind to the quality problems with your data.)