Writing variation chapter.

This commit is contained in:
Garrett 2016-05-02 17:11:00 -04:00
parent 2e37431e7d
commit 1fe1ca2015
1 changed files with 72 additions and 13 deletions

View File

@ -1,35 +1,81 @@
# Variation
# Exploratory Data Analysis (EDA)
```{r, include = FALSE}
library(ggplot2)
```
If you are like most humans, your brain isn't built to process tables of raw data. Instead, you are more likely to make discoveries if you visualize or transform your data. This chapter will show you the best ways to work with your data to make discoveries, a process known as Exploratory Data Analysis (EDA).
If you measure any quantity twice---and precisely enough, you will get two different results. This is true even for quantities that should be constant, like the speed of light (below).
## The challenge of data
This phenomenon, called _variation_, is the beginning of data science. To understand anything you must decipher patterns of variation. But variation does more than just obscure, it is an incredibly useful tool. Patterns of variation provide evidence of causal relationships.
The human working memory can only attend to a few values at a time. This makes it difficult to discover patterns in raw data because patterns involve many values. To discover even a simple pattern, you must consider many values _at the same time_, which is difficult to do. For example, a simple pattern exists between $X$ and $Y$ in the table below, but it is very difficult to spot.
The best way to study variation is to collect data, particularly rectangular data: data that is made up of variables, observations, and values.
```{r data, echo=FALSE}
x <- rep(seq(0.2, 1.8, length = 5), 2) + runif(10, -0.15, 0.15)
X <- c(0.02, x, 1.94)
Y <- sqrt(1 - (X - 1)^2)
Y[1:6] <- -1 * Y[1:6]
Y <- Y - 1
order <- sample(1:10)
knitr::kable(round(data.frame(X = X[order], Y = Y[order]), 2))
```
While your mind may stumble over raw data, you can easily process visual information. Within your mind is a visual processing system that has been fine-tuned by thousands of years of evolution. As a result, the quickest way to understand your data is to visualize it. Once you plot your data, you can instantly see the relationships between values. Here, we see that the values above fall on a circle.
```{r echo=FALSE, dependson=data}
ggplot2::qplot(X, Y) + ggplot2::coord_fixed(ylim = c(-2.5, 2.5), xlim = c(-2.5, 2.5))
```
Visualization works because it bypasses the bottle neck in your working memory. Your brain processes visual information in a different (and much wider) channel than it processes symbolic information, like words and numbers. However, you can also comprehend data in a second way.
You can comprehend data if you reduce it to a small set of summary values that you can attend to with your working memory. This is why it feels natural to work with averages, e.g. how tall is the average basketball player? How educated is the average politician? An average is a single number that you can attend to. Although averages are quite popular, you can also compare data sets on other summary values, such as the maximum, minimum, median, and so on. Another way to summarize your data is to replace it with a model, a function that describes the realtionship between two or more variables.
These two tactics, visualizing and summarizing your data, are the main tools of Exploratory Data Analysis. Before we look at how to visualize and summarise your data, let's consider what types of information you can hope to find. Data carries two types of useful information: information about _variation_ and information about _covariation_.
Let's define some terms that will make these concepts easier to describe:
* A _variable_ is a quantity, quality, or property that you can measure.
* A _value_ is the state of a variable when you measure it. The value of a
variable may change from measurement to measurement.
* A _value_ is the state of a variable when you measure it. The value of a variable may change from measurement to measurement.
* An _observation_ is a set of measurements you make under similar conditions (you usually make all of the measurements at the same time on the same object). Observations contain values that you measure on different variables.
* An _observation_ is a set of measurements you make under similar conditions
(usually all at the same time or on the same object). Observations contain
values that you measure on different variables.
## Variation
Rectangular data provides a clear record of variation, but that doesn't mean it is easy to understand. The human mind isn't built to process tables of data. This section will show you the best ways to comprehend your own data, which is the most important challenge of data science.
Variation is to the tendency for the values of a variable to change from measurement to measurement.
```{r, echo = FALSE}
Variation is easy to encounter in real life; if you measure any continuous quantity twice---and precisely enough, you will get two different results. Since every measurement includes a small amount of error, this will be true even if you measure quantities that should be constant, like the speed of light (below).
Discrete and quantitative variables can also vary if you measure across different subjects (e.g. the eye colors of different people), or different times (e.g. the energy levels of an electron).
```{r, variation, echo = FALSE}
mat <- as.data.frame(matrix(morley$Speed + 299000, ncol = 10))
knitr::kable(mat, caption = "*The speed of light is* the *universal constant, but variation obscures its value, here demonstrated by Albert Michelson in 1879. Michelson measured the speed of light 100 times and observed 30 different values (in km/sec).*", col.names = rep("", ncol(mat)))
knitr::kable(mat, caption = "*The speed of light is a universal constant, but variation obscures its value. In 1879, Albert Michelson measured the speed of light 100 times and observed 30 different values (in km/sec).*", col.names = rep("", ncol(mat)))
```
Variation is a source of uncertainy. Since values vary from measurement to measurement, you cannot assume that what you measure in one context will be true in another context.
Variation can also be a tool. Every variable exhibits a pattern of variation. If you comprehend the pattern, you can determine which values of the variable are likely to occur, which are unlikely to occur, and which are impossible.
## Covariation
Covariation occurs when the values of two or more variables vary in systematic ways.
You can understand covariation by picturing the growth charts that doctors use with young children (below). The ages and heights of young children covary since a child is likely to be born small and then to grow taller. As a result, a large value of height is unlikely to occur without being associated with a large value of age (and vice versa). In fact, the covariation between age and height is so regular that a doctor can tell if something has gone wrong by comparing the two.
!["Height covaries with age in young children. Chart taken from http://www.cdc.gov/growthcharts"](images/growth-chart.png)
Webs of covariation can be quite complex. Multiple variables can covary together as income, education, and home ownership do. Also, two variables can covary in an inverse relationship as unemployment and presidential approval ratings do. Presidential approval ratings are reliably low at times when unemployment is high, and vice versa.
If variation creates uncertainty, covariation dispells it. You can make an accurate guess about an unobserved variable, if you observe the values of variables that it covaries with.
Covariation is also the first clue that a causal relationship may exist between two variables (or that a hidden causal variable may exist that affects the two).
## Understanding Variation
### Distributions describe variation
### Visualizing distributions
***
@ -272,10 +318,14 @@ Useful arguments that apply to `geom_dotplot()`
In practice, I find that `geom_dotplot()` works best with small data sets and takes a lot of tweaking of the binwidth, dotsize, and stackratio arguments to fit the dots within the graph (the stack heights depend entirely on the organization of the dots, which renders the y axis ambiguous). That said, dotplots can be useful as a learning aid. They provide an intuitive representation of a histogram.
### Compare Distributions
### Summarizing distributions
## Understanding Covariation
### Visualizing covariation
### Visualize Covariation
### Compare Distributions
#### Visualize functions between two variables
Distributions provide useful information about variables, but the information is general. By itself, a distribution cannot tell you how the value of a variable in one set of circumstances will differ from the value of the same variable in a different set of circumstances.
@ -605,3 +655,12 @@ There are two ways to add three (or more) variables to a two dimensional plot. Y
`ggplot2` provides three geoms that are designed to display three variables: `geom_raster()`, `geom_tile()` and `geom_contour()`. These geoms generalize `geom_bin2d()` and `geom_density()` to display a third variable instead of a count, or a density.
`geom_raster()` and `geom_tile()`
### Summarizing covariation: statistics
### Summarizing covariation: models
#### Models have one of the richest literatures of how to select and test, so we've reserved them for their own section. Modelling brings together the various components of data science more so than any other data science task. So we'll postpone its coverage until you can program and wrangle data, two skills that will aid your ability to select models.