Polishing exploration intro

This commit is contained in:
hadley 2016-07-11 15:48:58 -05:00
parent 081f0c1e39
commit 594f663e67
1 changed files with 16 additions and 14 deletions

View File

@ -2,33 +2,35 @@
# Introduction
```{r, include = FALSE}
```{r setup, include = FALSE}
library(ggplot2)
library(dplyr)
```
If you are like most humans, your brain is not designed to work with raw data. The working memory can only attend to a few values at a time, which makes it difficult to discover patterns in raw data. For example, can you spot the striking relationship between $X$ and $Y$ in the table below?
If you are like most humans, your brain is not designed to work with raw data. Your working memory can only pay attention to a few values at a time, which makes it difficult to discover patterns in raw data. For example, can you spot the striking relationship between $X$ and $Y$ in the table below?
```{r data, echo=FALSE}
x <- rep(seq(0.2, 1.8, length = 5), 2) + runif(10, -0.15, 0.15)
X <- c(0.02, x, 1.94)
Y <- sqrt(1 - (X - 1)^2)
Y[1:6] <- -1 * Y[1:6]
Y <- Y - 1
order <- sample(1:10)
knitr::kable(round(data.frame(X = X[order], Y = Y[order]), 2))
circle <- tibble(
pi = sort(rep(seq(0.2, 1.8 * pi, length = 5), 2) + runif(10, -0.15, 0.15)),
x = cos(pi) + 1,
y = sin(pi) - 1
)
circle %>%
select(-pi) %>%
knitr::kable(digits = 2)
```
While we may stumble over raw data, we can easily process visual information. Within your mind is a visual processing system that has been fine-tuned by thousands of years of evolution. As a result, the quickest way to understand your data is to visualize it. Once you plot your data, you can instantly see the relationships between values. Here, we see that the values above fall on a circle.
While we may stumble over raw data, we can easily process visual information. Within your mind is a powerful visual processing system fine-tuned by millions of years of evolution. As a result, often the quickest way to understand your data is to visualize it. Once you plot your data, you can instantly see the relationships between values. Here, we see that the values fall on a circle.
```{r echo=FALSE, dependson=data}
ggplot2::qplot(X, Y) + ggplot2::coord_fixed(ylim = c(-2.5, 2.5), xlim = c(-2.5, 2.5))
ggplot(circle, aes(x, y)) +
geom_point() +
coord_fixed()
```
Visualization works because your brain processes visual information in a different (and much wider) channel than it processes symbolic information, like words and numbers. However, visualization is not the only way to comprehend data.
You can also comprehend data by transforming it. You can easily attend to a small set of summary values, which lets you absorb important information about the data. This is why it feels natural to work with things like averages, maximums, minimums, medians, and so on.
Another way to summarize your data is to replace it with a model, a function that describes the relationships between two or more variables. You can attend to the important parts of a model more easily than you can attend to the raw values in your dataset.
The first problem in Data Science is a cognitive problem: how can you understand your own data? In this part of the book, you'll learn how to use R to discover and understand the information contained in your data.
Together, visualisation and transformation form a powerful set of tools known as exploratory data analysis, or EDA for short. In this part of the book, you'll learn R through EDA, mastering the minimal set of skills to start gaining insight from your data.