Consistent part intros

This commit is contained in:
hadley 2016-07-19 09:39:00 -05:00
parent 58ac405a48
commit 464a643bef
12 changed files with 60 additions and 99 deletions

View File

@ -1,9 +1,9 @@
# Exploratory Data Analysis
```{r include=FALSE}
knitr::opts_chunk$set(fig.height = 2)
```
# Exploratory Data Analysis (EDA)
## Introduction
This chapter will show you how to use visualization and transformation to explore your data in a systematic way, a task that statisticians call Exploratory Data Analysis, or EDA for short. EDA is an interative cycle that involves:
@ -552,3 +552,5 @@ ggplot(data = diamonds2, mapping = aes(x = carat, y = resid)) +
ggplot(data = diamonds2, mapping = aes(x = cut, y = resid)) +
geom_boxplot()
```
Modelling is important because once you have recognised a pattern, a model allows you to make that pattern quantitative and precise, and partition it out from what remains. That supports a powerful interative approach where you indentify a pattern with visualisation, then subtract with a model, allowing you to see the subtler trends that remain. I deliberately chose not to teach modelling yet, because understanding what models are and how they work are easiest once you have some other tools in hand: data wrangling, and programming.

View File

@ -2,6 +2,12 @@
# Introduction
The successful completion of a data science project you will have built up a good understand of what is going on with the data. It doesn't matter how brilliant your understand is unless you can communicate it with others. You will need to share your work in a way that your audience can understand. Your audience might be fellow scientists who will want to reproduce the work, non-scientists who will want to understand your findings in plain terms, or yourself (in the future) who will be thankful if you make your work easy to re-learn and recreate. __Part 5__ discusses communication, and how you can use RMarkdown to generate reproducible artefacts that combine prose and code.
```{r echo = FALSE, out.width = "75%"}
knitr::include_graphics("diagrams/data-science-communicate.png")
```
Reproducible, literate code is the data science equivalent of the Scientific Report (i.e, Intro, Methods and materials, Results, Discussion).
Recommendations for learning more about communication:

Binary file not shown.

After

Width:  |  Height:  |  Size: 14 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 14 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 14 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 13 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 15 KiB

Binary file not shown.

View File

@ -2,6 +2,12 @@
# Introduction
The goal of the first part of this book is to get your up to speed with the basic tools of data exploration as quickly as possible:
```{r echo = FALSE, out.width = "75%"}
knitr::include_graphics("diagrams/data-science-explore.png")
```
```{r setup, include = FALSE}
library(ggplot2)
library(dplyr)
@ -21,7 +27,7 @@ circle %>%
knitr::kable(digits = 2)
```
While we may stumble over raw data, we can easily process visual information. Within your mind is a powerful visual processing system fine-tuned by millions of years of evolution. As a result, often the quickest way to understand your data is to visualize it. Once you plot your data, you can instantly see the relationships between values. Here, we see that the values fall on a circle.
While we may stumble over raw data, we can easily process visual information. Visualization works because your brain processes visual information in a different (and much wider) channel than it processes symbolic information, like words and numbers. Within your brain is a powerful visual processing system fine-tuned by millions of years of evolution. As a result, often the quickest way to understand your data is to visualize it. Once you plot your data, you can instantly see the relationships between values. Here, we see that the values fall on a circle.
```{r echo=FALSE, dependson = data, fig.asp = 1, out.width = "30%", fig.width = 3}
ggplot(circle, aes(x, y)) +
@ -29,8 +35,17 @@ ggplot(circle, aes(x, y)) +
coord_fixed()
```
Visualization works because your brain processes visual information in a different (and much wider) channel than it processes symbolic information, like words and numbers. However, visualization is not the only way to comprehend data.
In the following chapters you will:
You can also comprehend data by transforming it. You can easily attend to a small set of summary values, which lets you absorb important information about the data. This is why it feels natural to work with things like averages, maximums, minimums, medians, and so on.
* Dive into ggplot2 in [data visualisation], learning powerful
and general techniques for turning raw data into visual insights.
Together, visualisation and transformation form a powerful set of tools known as exploratory data analysis, or EDA for short. In this part of the book, you'll learn R through EDA, mastering the minimal set of skills to start gaining insight from your data.
* Visualisation alone is typically not enough, so in [data transformation]
you'll learn the key verbs that allow you select important variables,
filter out key observations, and create new variables and summaries.
* In [exploratory data analysis], you'll combine visualisation and
transformation with your curiosity and scepticism to ask and answer
interesting questions about data.
Modelling is an important part of the exploratory process, but you don't have the skills to effectively learn or apply it yet. We'll come back to modelling in [model], once you're better equipped with more data wrangling and programming tools.

View File

@ -2,6 +2,12 @@
# Introduction
Now that you are equipped with powerful programming tools we can finally return to modelling. You'll use your new tools of data wrangling and programming, to fit many models and understand how they work. The focus of this book is on exploration, not confirmation or formal inference. But you'll learn a few basic tools that help you understand the variation within your models.
```{r echo = FALSE, out.width = "75%"}
knitr::include_graphics("diagrams/data-science-model.png")
```
The goal of a model is to provide a simple low-dimensional summary of a dataset. Ideally, the model will capture true "signals" (i.e. patterns generated by the phenomenon of interest), and ignore "noise" (i.e. random variation that you're not interested in). Here we only cover "predictive" models, which, as the name suggests, generate predictions. There is another type of model that we're not going to discuss: "data discovery" models. These models don't make predictions, but instead help you discover interesting relationships within your data.
This book is not going to give you a deep understanding of the mathematical theory that underlies models. It will, however, build your intution about how statisitcal models work, and give you a family of useful tools that allow you to use models to better understand your data:

View File

@ -2,9 +2,15 @@
# Introduction
In this part of the book, you'll enrich your programming skills. Programming is a cross-cutting skill needed for all data science work. You must use a computer; you cannot do it in your head, nor with paper and pencil. And to work efficiently, you will need to know how to program in a computer language, such as R.
```{r echo = FALSE, out.width = "75%"}
knitr::include_graphics("diagrams/data-science-program.png")
```
Code is a tool of communication, not just to the computer, but to other people. This is important because every project you undertake is fundamentally collaborative. Even if you're not working with other people, you'll definitely be working with future-you. You want to write clear code so that future-you doesn't curse present-you when you look at a project again after several months have passed.
To me, improving your communication skills is a key part of mastering R as a programming language. Over time, you want your code to become more and more clear, and easier to write. Removing duplication is an important part of expressing yourself clearly because it lets the reader (i.e. future-you!) focus on what's different between operations rather than what's the same. The goal is not just to write better functions or to do things that you couldn't do before, but to code with more "ease". As you internalise the ideas in this chapter, you should find it easier to re-tackle problems that you've struggled to solve in the past.
Improving your communication skills is a key part of mastering R as a programming language. Over time, you want your code to become more and more clear, and easier to write. Removing duplication is an important part of expressing yourself clearly because it lets the reader (i.e. future-you!) focus on what's different between operations rather than what's the same. The goal is not just to write better functions or to do things that you couldn't do before, but to code with more "ease". As you internalise the ideas in this chapter, you should find it easier to re-tackle problems that you've struggled to solve in the past.
In the following chapters, you'll learn important programming skills:

View File

@ -2,103 +2,29 @@
# Introduction
Throughout this book we work with "tibbles" instead of the traditional data frame. Tibbles _are_ data frames, but tweak some older behaviours to make life a littler easier. R is an old language, and some things that were true 10 or 20 years ago no longer apply. It's difficult to change base R without breaking existing code, so most innovation occurs in packages. Here we will describe the tibble package, which provides opinionated data frames that make working in the tidyverse a little easier. You can learn more about tibbles in the accompanying vignette: `vignette("tibble")`.
In this part of the book, you'll learn about data wrangling, the art of getting your data into R in a useful form. Data wrangling encompasses three main pieces:
```{r setup}
library(tibble)
```{r echo = FALSE, out.width = "75%"}
knitr::include_graphics("diagrams/data-science-wrangle.png")
```
## Creating tibbles {#tibbles}
* In [data import], you'll learn the art of data import: how to get your data
off of disk and into R.
The majority of the functions that you'll use in this book already produce tibbles. If you're working with functions from other packages, you might need to coerce a regular data frame a tibble. You can do that with `as_tibble()`:
* In [tidy data], you'll learn about tidy data, a consistent way of storing your
data that makes transformation, visualiation, and modelling easier.
```{r}
as_tibble(iris)
```
* You've already learned the basics of data transformation. In this part of the
book we'll dive deeper into tools useful for specific types of data:
`as_tibble()` knows how to convert data frames, lists (provided the elements are equal length vectors), matrices, and tables.
* [Dates and times] will give you the key tools for working with
dates, and date times.
* [Strings] will introduce regular expressions, a powerful tool for
manipulating strings.
* [Relational data] will give you tools for working with multiple
interrelated datasets.
You can create a new tibble from individual vectors with `tibble()`:
Before we get to those chapters we'll take a brief discussion to discuss the "tibble" in more detail, in [tibbles].
```{r}
tibble(x = 1:5, y = 1, z = x ^ 2 + y)
```
`tibble()` automatically recycles inputs of length 1, and you can refer to variables that you just created. Compared to `data.frame()`, `tibble()` does much less: it never changes the type of the inputs (e.g. it never converts strings to factors!), it never changes the names of variables, and it never creates `row.names()`.
Another way to create a tibble is with `frame_data()`, which is customised for data entry in R code. Column headings are defined by formulas (`~`), and entries are separated by commas:
```{r}
frame_data(
~x, ~y, ~z,
"a", 2, 3.6,
"b", 1, 8.5
)
```
## Tibbles vs. data frames
There are two main differences in the usage of a data frame vs a tibble: printing, and subsetting.
### Printing
Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen. This makes it much easier to work with large data. In addition to its name, each column reports its type, a nice feature borrowed from `str()`:
```{r}
tibble(
a = lubridate::now() + runif(1e3) * 60,
b = lubridate::today() + runif(1e3),
c = 1:1e3,
d = runif(1e3),
e = sample(letters, 1e3, replace = TRUE)
)
```
You can control the default appearance with options:
* `options(tibble.print_max = n, tibble.print_min = m)`: if more than `m`
rows, print `n` rows. Use `options(dplyr.print_max = Inf)` to always
show all rows.
* `options(tibble.width = Inf)` will always print all columns, regardless
of the width of the screen.
You can see a complete list of options by looking at the package help: `package?tibble`.
### Subsetting
Tibbles are stricter about subsetting. If you try to access a variable that does not exist, you'll get a warning. Unlike data frames, tibbles do not use partial matching on column names:
```{r}
df <- data.frame(
abc = 1:10,
def = runif(10),
xyz = sample(letters, 10)
)
tb <- as_tibble(df)
df$a
tb$a
```
Tibbles clearly delineate `[` and `[[`: `[` always returns another tibble, `[[` always returns a vector.
```{r}
# With data frames, [ sometimes returns a data frame, and sometimes returns
# a vector
df[, 1]
# With tibbles, [ always returns another tibble
tb[, 1]
# To extract a single element, you should always use [[
tb[[1]]
```
## Interacting with legacy code
Some older functions don't work with tibbles because they expect `df[, 1]` to return a vector, not a data frame. If you encounter one of these functions, use `as.data.frame()` to turn a tibble back to a data frame:
```{r}
class(as.data.frame(tb))
```