Consistent chapter intro layout

This commit is contained in:
hadley 2016-07-19 08:01:50 -05:00
parent 1a36e04a84
commit 8faebedf9f
12 changed files with 113 additions and 111 deletions

View File

@ -1,14 +1,15 @@
# Vectors
```{r setup, include = FALSE}
library(purrr)
library(dplyr)
```
## Introduction
So far this book has focussed on data frames and packages that work with them. But as you start to write your own functions, and dig deeper into R, you need to learn about vectors, the objects that underpin data frames. If you've learned R in a more traditional way, you're probably already familiar with vectors, as most R resources start with vectors and work their way up to data frames. I think it's better to start with data frames because they're immediately useful, and then work your way down to the underlying components.
Vectors are particularly important as its to learn to write functions that work with vectors, rather than data frames. The technology that lets ggplot2, tidyr, dplyr etc work with data frames is considerably more complex and not currently standardised. While I'm currently working on a new standard that will make life much easier, it's unlikely to be ready in time for this book.
### Prerequisites
The focus of this chapter is on base R data structures, so you don't need any extra packages to be loaded.
## Vector overview
There are two types of vectors:
@ -141,7 +142,7 @@ Here I wanted to mention one important feature of the underlying string implemen
x <- "This is a reasonably long string."
pryr::object_size(x)
y <- rep(x, 1000)
y <- rep(x, 1000)
pryr::object_size(y)
```
@ -286,7 +287,7 @@ c(x = 1, y = 2, z = 4)
Or after the fact with `purrr::set_names()`:
```{r}
1:3 %>% set_names(c("a", "b", "c"))
purrr::set_names(1:3, c("a", "b", "c"))
```
Named vectors are most useful for subsetting, described next.

View File

@ -1,15 +1,19 @@
# Dates and times
## Introduction
This chapter will show you how to work with dates and times in R. Dates and times follow their own rules, which can make working with them difficult. For example dates and times are ordered, like numbers; but the timeline is not as orderly as the number line. The timeline repeats itself, and has noticeable gaps due to Daylight Savings Time, leap years, and leap seconds. Datetimes also rely on ambiguous units: How long is a month? How long is a year? Time zones give you another headache when you work with dates and times. The same instant of time will have different "names" in different time zones.
This chapter will focus on R's __lubridate__ package, which makes it much easier to work with dates and times in R. You'll learn the basic date time structures in R and the lubridate functions that make working with them easy. We will also rely on some of the packages that you already know how to use, so load this entire set of packages to begin:
### Prerequisites
This chapter will focus on R's __lubridate__ package, which makes it much easier to work with dates and times in R. You'll learn the basic date time structures in R and the lubridate functions that make working with them easy. We will use `nycflights13` for practice data, and use some packages for EDA.
```{r message = FALSE}
library(lubridate)
```{r message = FALSE, warning = FALSE}
library(nycflights13)
library(dplyr)
library(stringr)
library(ggplot2)
library(lubridate)
```
## Parsing times
@ -33,12 +37,19 @@ With a little work, we can also create arrival times for each flight in flights.
```{r}
(datetimes <- datetimes %>%
mutate(arrival = make_datetime(year = year, month = month, day = day,
hour = str_sub(arr_time, end = -3),
min = str_sub(arr_time, start = -2))) %>%
mutate(arrival = make_datetime(
year = year,
month = month,
day = day,
hour = arr_time %/% 100,
min = arr_time %% 100
)) %>%
filter(!is.na(departure), !is.na(arrival)) %>%
select(departure, arrival, dep_delay, arr_delay, carrier, tailnum,
flight, origin, dest, air_time, distance))
select(
departure, arrival, dep_delay, arr_delay, carrier, tailnum,
flight, origin, dest, air_time, distance
)
)
```
To parse character strings as dates, identify the order in which the year, month, and day appears in your dates. Now arrange "y", "m", and "d" in the same order. This is the name of the function in lubridate that will parse your dates. For example,

View File

@ -1,9 +1,7 @@
```{r setup, include = FALSE}
library(stringr)
```
# Functions
## Introduction
One of the best ways to improve your reach as a data scientist is to write functions. Functions allow you to automate common tasks. Writing a function has three big advantages over using copy-and-paste:
1. You drastically reduce the chances of making incidental mistakes when
@ -19,6 +17,10 @@ Writing good functions is a lifetime journey. Even after using R for many years
As well as practical advice for writing functions, this chapter also gives you some suggestions for how to style your code. Good coding style is like using correct punctuation. You can manage without it, but it sure makes things easier to read. As with styles of punctuation, there are many possible variations. Here we present the style we use in our code, but the most important thing is to be consistent.
### Prerequisites
The focus of this chapter is on writing functions in base R, so you won't need any extra packages.
## When should you write a function?
You should consider writing a function whenever you've copied and pasted a block of code more than twice (i.e. you now have three copies of the same code). For example, take a look at this code. What does it do?
@ -576,7 +578,7 @@ Many functions in R take an arbitrary number of inputs:
```{r}
sum(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
str_c("a", "b", "c", "d", "e", "f")
stringr::str_c("a", "b", "c", "d", "e", "f")
```
How do these functions work? They rely on a special argument: `...` (pronounced dot-dot-dot). This special argument captures any number of arguments that aren't otherwise matched.
@ -584,13 +586,13 @@ How do these functions work? They rely on a special argument: `...` (pronounced
It's useful because you can then send those `...` on to another function. This is a useful catch-all if your function primarily wraps another function. For example, I commonly create these helper functions that wrap around `paste()`:
```{r}
commas <- function(...) str_c(..., collapse = ", ")
commas <- function(...) stringr::str_c(..., collapse = ", ")
commas(letters[1:10])
rule <- function(..., pad = "-") {
title <- paste0(...)
width <- getOption("width") - nchar(title) - 5
cat(title, " ", str_dup(pad, width), "\n", sep = "")
cat(title, " ", stringr::str_dup(pad, width), "\n", sep = "")
}
rule("Important output")
```

View File

@ -1,19 +1,6 @@
# Handling hierarchy {#hierarchy}
```{r setup, include=FALSE}
library(purrr)
```
<!--
## Warm ups
* What does this for loop do?
* How is a data frame like a list?
* What does `mean()` mean? What does `mean` mean?
* How do you get help about the $ function? How do you normally write
`[[`(mtcars, 1) ?
* Argument order
-->
## Introduction
The map functions apply a function to every element in a list. They are the most commonly used part of purrr, but not the only part. Since lists are often used to represent complex hierarchies, purrr also provides tools to work with hierarchy:
@ -24,6 +11,14 @@ The map functions apply a function to every element in a list. They are the most
* You can flip levels of the hierarchy with the transpose function.
### Prerequisites
This chapter focusses mostly on purrr. As well as the tools for iteration that you've already learned about, purrr also provides a number of tools specifically designed to manipulate hierarchical data.
```{r setup}
library(purrr)
```
## Extracting deeply nested elements
Some times you get data structures that are very deeply nested. A common source of such data is JSON from a web API. I've previously downloaded a list of GitHub issues related to this book and saved it as `issues.json`. Now I'm going to load it into a list with jsonlite. By default `fromJSON()` tries to be helpful and simplifies the structure a little for you. Here I'm going to show you how to do it with purrr, so I set `simplifyVector = FALSE`:

View File

@ -1,10 +1,5 @@
# Iteration
```{r setup, include=FALSE}
library(purrr)
library(stringr)
```
In [functions], we talked about how important it is to reduce duplication in your code. Reducing code duplication has three main benefits:
1. It's easier to see the intent of your code, because your eyes are
@ -39,6 +34,14 @@ The goal of using purrr functions instead of for loops is to allow you break com
This structure makes it easier to solve new problems. It also makes it easier to understand your solutions to old problems when you re-read your old code.
### Prerequisites
Once you've mastered the for loops provided by base R, you'll learn some of the powerful programming tools provided by purrr.
```{r setup}
library(purrr)
```
## For loops
Imagine we have this simple data frame:
@ -126,7 +129,7 @@ That's all there is to the for loop! Now is a good time to practice creating som
```{r}
out <- ""
for (x in letters) {
out <- str_c(out, x)
out <- stringr::str_c(out, x)
}
x <- sample(100)
@ -842,7 +845,7 @@ library(ggplot2)
plots <- mtcars %>%
split(.$cyl) %>%
map(~ggplot(., aes(mpg, wt)) + geom_point())
paths <- str_c(names(plots), ".pdf")
paths <- stringr::str_c(names(plots), ".pdf")
pwalk(list(paths, plots), ggsave, path = tempdir())
```

View File

@ -1,12 +1,5 @@
# Model assessment
```{r setup-model, include=FALSE}
library(purrr)
library(tibble)
set.seed(1014)
options(digits = 3)
```
In this chapter, you'll turn the tools of multiple models towards model assessment: learning how the model performs when given new data. So far we've focussed on models as tools for description, using models to help us understand the patterns in the data we have collected so far. But ideally a model will do more than just describe what we have seen so far - it will also help predict what will come next.
In other words, we want a model that doesn't just perform well on the sample, but also accurately summarises the underlying population.

View File

@ -1,13 +1,3 @@
```{r setup, include = FALSE}
library(broom)
library(ggplot2)
library(dplyr)
library(lubridate)
library(tidyr)
library(nycflights13)
library(modelr)
```
# Model building
In the previous chapter you learned how some basic models worked, and learned some basic tools for understanding what a model is telling you about your data. In this chapter, we're going talk more about the model building process: how you start from nothing, and end up with a good model.
@ -43,6 +33,16 @@ For very large and complex datasets this is going to be a lot of work. There are
### Prerequisites
```{r setup, include = FALSE}
library(broom)
library(ggplot2)
library(dplyr)
library(lubridate)
library(tidyr)
library(nycflights13)
library(modelr)
```
```{r}
library(modelr)

View File

@ -1,16 +1,20 @@
# Pipes
```{r, include = FALSE}
library(dplyr)
diamonds <- ggplot2::diamonds
```
Pipes let you transform the way you call deeply nested functions. Using a pipe doesn't affect what the code does; behind the scenes it is run in (almost) the exact same way. What the pipe does is change how _you_ write, and read, code.
You've been using the pipe for a while now, so you already understand the basics. The point of this chapter is to explore the pipe in more detail. You'll learn the alternatives that the pipe replaces, and the pros and cons of the pipe. Importantly, you'll also learn situations in which you should avoid the pipe.
The pipe, `%>%`, comes from the __magrittr__ package by Stefan Milton Bache. This package provides a handful of other helpful tools if you explicitly load it. We'll explore some of those tools to close out the chapter.
### Prerequisites
This chapter focusses on `%>%` which is normally loaded for you by packages in the tidyverse. Here we'll focus on it alone, so we'll make it available directly from magrittr. We'll also extract the `diamonds` dataset out of ggplot2 to use in some examples.
```{r setup}
library(magrittr)
diamonds <- ggplot2::diamonds
```
## Piping alternatives
The point of the pipe is to help you write code in a way that easier to read and understand. To see why the pipe is so useful, we're going to explore a number of ways of writing the same code. Let's use code to tell a story about a little bunny named Foo Foo:
@ -50,12 +54,11 @@ The main downside of this form is that it forces you to name each intermediate e
You may worry that this form creates many intermediate copies of your data and takes up a lot of memory, but that's not necessary. First, worrying about memory is not a useful way to spend your time: worry about it when it becomes a problem (i.e. you run out of memory), not before. Second, R isn't stupid: if you're working with data frames, R will share columns where possible. Let's take a look at an actual data manipulation pipeline where we add a new column to the `diamonds` dataset from ggplot2:
```{r}
diamonds2 <- mutate(diamonds, price_per_carat = price / carat)
diamonds2 <- dplyr::mutate(diamonds, price_per_carat = price / carat)
library(pryr)
object_size(diamonds)
object_size(diamonds2)
object_size(diamonds, diamonds2)
pryr::object_size(diamonds)
pryr::object_size(diamonds2)
pryr::object_size(diamonds, diamonds2)
```
`pryr::object_size()` gives the memory occupied by all of its arguments. The results seem counterintuitive at first:
@ -70,9 +73,9 @@ In the following example, we modify a single value in `diamonds$carat`. That mea
```{r}
diamonds$carat[1] <- NA
object_size(diamonds)
object_size(diamonds2)
object_size(diamonds, diamonds2)
pryr::object_size(diamonds)
pryr::object_size(diamonds2)
pryr::object_size(diamonds, diamonds2)
```
(Note that we use `pryr::object_size()` here, not the built-in `object.size()`. `object.size()` isn't quite smart enough to recognise that the columns are shared across multiple data frames.)
@ -202,10 +205,6 @@ The pipe is a powerful tool, but it's not the only tool at your disposal, and it
The pipe is provided by the magrittr package, by Stefan Milton Bache. Most of the packages you work with in this book will automatically provide `%>%` for you. You might want to load magrittr yourself if you're using another package, or you want to access some of the other pipe variants that magrittr provides.
```{r}
library(magrittr)
```
* When working with more complex pipes, it's sometimes useful to call a
function for its side-effects. Maybe you want to print out the current
object, or plot it, or save it to disk. Many times, such functions don't

View File

@ -1,11 +1,6 @@
# Relational data
```{r setup-relation, include = FALSE}
library(dplyr)
library(nycflights13)
library(ggplot2)
library(stringr)
```
## Introduction
It's rare that a data analysis involves only a single table of data. Typically you have many tables of data, and you must combine them to answer the questions that you're interested in. Collectively, multiple tables of data are called __relational data__ because it is the relations, not just the individual datasets, that are particularly important.
@ -23,6 +18,15 @@ To work with relational data you need verbs that work with pairs of tables. Ther
The most common place to find relational data is in a _relational_ database management system, a term that encompasses almost all modern databases. If you've used a database before, you've almost certainly used SQL. If so, you should find the concepts in this chapter familiar, although their expression in dplyr is a little different. Generally, dplyr is a little easier to use than SQL because dplyr is specialised to data analysis: it makes common data analysis operations easier, at the expense of making it difficult to do other things.
### Prerequisites
We will explore relational data from `nycflights13` using the two-table verbs from dplyr.
```{r setup-relation}
library(nycflights13)
library(dplyr)
```
## nycflights13 {#nycflights13-relational}
You can use the nycflights13 package to learn about relational data. nycflights13 contains four data frames that are related to the `flights` table that you used in Data Transformation:
@ -262,8 +266,8 @@ So far all the diagrams have assumed that the keys are unique. But that's not al
and a foreign key in `x`.
```{r}
x <- tibble(key = c(1, 2, 2, 1), val_x = str_c("x", 1:4))
y <- tibble(key = 1:2, val_y = str_c("y", 1:2))
x <- tibble(key = c(1, 2, 2, 1), val_x = stringr::str_c("x", 1:4))
y <- tibble(key = 1:2, val_y = stringr::str_c("y", 1:2))
left_join(x, y, by = "key")
```
@ -276,8 +280,8 @@ So far all the diagrams have assumed that the keys are unique. But that's not al
```
```{r}
x <- tibble(key = c(1, 2, 2, 3), val_x = str_c("x", 1:4))
y <- tibble(key = c(1, 2, 2, 3), val_y = str_c("y", 1:4))
x <- tibble(key = c(1, 2, 2, 3), val_x = stringr::str_c("x", 1:4))
y <- tibble(key = c(1, 2, 2, 3), val_y = stringr::str_c("y", 1:4))
left_join(x, y, by = "key")
```
@ -327,7 +331,7 @@ So far, the pairs of tables have always been joined by a single variable, and th
data frame so you can show the spatial distribution of delays. Here's an
easy way to draw a map of the United States:
```{r, include = FALSE}
```{r, eval = FALSE, include = FALSE}
airports %>%
semi_join(flights, c("faa" = "dest")) %>%
ggplot(aes(lon, lat)) +

View File

@ -1,19 +1,26 @@
# Strings
```{r setup-strings, include = FALSE, cache = FALSE}
library(stringr)
common <- rcorpora::corpora("words/common")$commonWords
fruit <- rcorpora::corpora("foods/fruits")$fruits
sentences <- readr::read_lines("harvard-sentences.txt")
```
<!-- look at http://d-rug.github.io/blog/2015/regex.fick/, http://qntm.org/files/re/re.html -->
## Introduction
This chapter introduces you to string manipulation in R. You'll learn the basics of how strings work and how to create them by hand, but the focus of this chapter will be on regular expressions. Character variables typically come as unstructured or semi-structured data. When this happens, you need some tools to make order from madness. Regular expressions are a very concise language for describing patterns in strings. When you first look at them, you'll think a cat walked across your keyboard, but as you learn more, you'll see how they allow you to express complex patterns very concisely. The goal of this chapter is not to teach you every detail of regular expressions. Instead I'll give you a solid foundation that allows you to solve a wide variety of problems and point you to resources where you can learn more.
This chapter will focus on the __stringr__ package. This package provides a consistent set of functions that all work the same way and are easier to learn than the base R equivalents. We'll also take a brief look at the __stringi__ package. This package is what stringr uses internally: it's more complex than stringr (and includes many many more functions). stringr includes tools to let you tackle the most common 90% of string manipulation challenges; stringi contains functions to let you tackle the last 10%.
### Prerequisites
In this chapter you'll use the stringr package to manipulate strings.
```{r setup, cache = FALSE}
library(stringr)
# To be moved into stringr
common <- rcorpora::corpora("words/common")$commonWords
fruit <- rcorpora::corpora("foods/fruits")$fruits
sentences <- readr::read_lines("harvard-sentences.txt")
```
## String basics
In R, strings are stored in a character vector. You can create strings with either single quotes or double quotes: there is no difference in behaviour. I recommend always using `"`, unless you want to create a string that contains multiple `"`, in which case use `'`.

View File

@ -9,21 +9,10 @@ In this chapter, you will learn the best way to organize your data for R, a task
Note that this chapter explains how to change the format, or layout, of tabular data. You will learn how to use different file formats with R in the next chapter, Import Data.
## Outline
### Prerequisites
In *Section 4.1*, you will learn how the features of R determine the best way to layout your data. This section introduces "tidy data," a way to organize your data that works particularly well with R.
*Section 4.2* teaches the basic method for making untidy data tidy. In this section, you will learn how to reorganize the values in your dataset with the `spread()` and `gather()` functions of the `tidyr` package.
*Section 4.3* explains how to split apart and combine values in your dataset to make them easier to access with R.
*Section 4.4* concludes the chapter, combining everything you've learned about `tidyr` to tidy a real dataset on tuberculosis epidemiology collected by the *World Health Organization*.
## Prerequisites
```{r message=FALSE}
```{r}
library(tidyr)
library(dplyr)
```
## Tidy data

View File

@ -23,8 +23,6 @@ library(ggplot2)
You only need to install a package once, but you need to reload it every time you start a new session.
TODO: mention missing values.
## A graphing template
Let's use our first graph to answer a question: Do cars with big engines use more fuel than cars with small engines? You probably already have an answer, but try to make your answer precise. What does the relationship between engine size and fuel efficiency look like? Is it positive? Negative? Linear? Nonlinear?