Edits the Part intro for explore

This commit is contained in:
Garrett 2016-05-19 17:06:22 -04:00
parent 7cfbd54eed
commit 725ae26de0
1 changed files with 8 additions and 14 deletions

View File

@ -7,13 +7,7 @@ library(ggplot2)
library(dplyr)
```
If you are like most humans, your brain isn't built to process tables of raw data. Instead, it is very good at analyzing images and comparing summary statistics. This part of the book will show you the best ways to visualize and transform your data to make discoveries, a process known as Exploratory Data Analysis (EDA).
## Why read this Part?
You may be surprised to learn that your mind can only attend to a few pieces of new information at a time. According to cognitive scientists, the human working memory can only handle four to seven novel values at once, which creates a bottleneck when you process information.
You probably do not notice this bottleneck in your day-to-day life, but it becomes a big deal when you work with data. The bottleneck makes it difficult to discover patterns in your raw data. To discover even a simple pattern, you must consider many values _at the same time_, which is difficult to do. For example, a simple pattern exists between $X$ and $Y$ in the table below, but it is very difficult to spot.
If you are like most humans, your brain is not designed to work with raw data. The working memory can only attend to a few values at a time, which makes it difficult to discover patterns in raw data. For example, can you spot the striking relationship between $X$ and $Y$ in the table below?
```{r data, echo=FALSE}
x <- rep(seq(0.2, 1.8, length = 5), 2) + runif(10, -0.15, 0.15)
@ -25,16 +19,16 @@ order <- sample(1:10)
knitr::kable(round(data.frame(X = X[order], Y = Y[order]), 2))
```
While your mind may stumble over raw data, you can easily process visual information. Your brain processes visual information in a different (and much wider) channel than it processes symbolic information, like words and numbers. As a result, the quickest way to understand your data is to visualize it. Once you plot your data, you can instantly see the relationships between values. Here, we see that the values above fall on a circle.
While we may stumble over raw data, we can easily process visual information. Within your mind is a visual processing system that has been fine-tuned by thousands of years of evolution. As a result, the quickest way to understand your data is to visualize it. Once you plot your data, you can instantly see the relationships between values. Here, we see that the values above fall on a circle.
```{r echo=FALSE}
ggplot2::qplot(X, Y) + ggplot2::coord_fixed(ylim = c(-2.5, 0.5), xlim = c(-0.5, 2.5))
```{r echo=FALSE, dependson=data}
ggplot2::qplot(X, Y) + ggplot2::coord_fixed(ylim = c(-2.5, 2.5), xlim = c(-2.5, 2.5))
```
Visualization works because your mind contains a visual processing system that has been fine-tuned by thousands of years of evolution. However, visualization is not the only way to comprehend data.
Visualization works because your brain processes visual information in a different (and much wider) channel than it processes symbolic information, like words and numbers. However, visualization is not the only way to comprehend data.
You can also comprehend your data if you transform it into a small set of summary values. You can easily attend to a few summary values, which lets you absorb important information about the data. This is why it feels natural to work with things like averages, e.g. how tall is the average basketball player? An average is a single number that you can attend to. Although averages are quite popular, you can also compare data sets on other summary values, such as maximums, minimums, medians, and so on.
You can also comprehend data by transforming it. You can easily attend to a small set of summary values, which lets you absorb important information about the data. This is why it feels natural to work with things like averages, maximums, minimums, medians, and so on.
Another way to summarize your data is to replace it with a model, a function that describes the relationships between two or more variables. You can comprehend the important parts of a model more easily than you can attend to the raw values in your data set.
Another way to summarize your data is to replace it with a model, a function that describes the relationships between two or more variables. You can attend to the important parts of a model more easily than you can attend to the raw values in your data set.
These tactics, visualizing, transforming, and modeling your data, are the most important tools for exploring data. This part will show you how to visualize and transform your data with R, as well as how to apply these skills to discover insights in your data. You will also learn the basics of modelling in R, which will prepare you for Part 4 of the book.
The first problem in Data Science is a cognitive problem: how can you understand your own data? In this part of the book, you'll learn how to use R to discover and understand the information contained in your data.