Merge branch 'master' of github.com:hadley/r4ds

This commit is contained in:
hadley 2016-04-18 08:45:38 -05:00
commit b2f4766376
4 changed files with 35 additions and 37 deletions

View File

@ -5,11 +5,11 @@ library(dplyr)
diamonds <- ggplot2::diamonds
```
Pipes let you transform the way you call deeply nested functions. Using a pipe doesn't affect what the code does; behind the scenes it is run in (almost) exactly the same way. What the pipe does is change how you write, and read, code.
Pipes let you transform the way you call deeply nested functions. Using a pipe doesn't affect what the code does; behind the scenes it is run in (almost) the exact same way. What the pipe does is change how _you_ write, and read, code.
You've been using the pipe for a while now, so you already understand the basics. The point of this chapter is to explore the pipe in more detail. You'll learn the alternatives that the pipe replaces, and the pros and cons of the pipe. Importantly, you'll also learn situations in which you should avoid the pipe.
The pipe, `%>%`, comes from the magrittr package by Stefan Milton Bache. This package provides a handful of other helpful tools if you explicitly load it. We'll explore some of those tools to close out the chapter.
The pipe, `%>%`, comes from the __magrittr__ package by Stefan Milton Bache. This package provides a handful of other helpful tools if you explicitly load it. We'll explore some of those tools to close out the chapter.
## Piping alternatives
@ -33,7 +33,7 @@ And we'll use a function for each key verb: `hop()`, `scoop()`, and `bop()`. Usi
1. Compose functions.
1. Use the pipe.
We'll work through each approach, showing you the code and talking about the advantages and disadvantages.
We'll work through each approach, showing you the code and talking about the advantages and disadvantages. Note that these are made up functions; please don't expect this code to do something.
### Intermediate steps
@ -45,9 +45,9 @@ foo_foo_2 <- scoop(foo_foo_1, up = field_mice)
foo_foo_3 <- bop(foo_foo_2, on = head)
```
The main downside of this form is that it forces you to name each intermediate element. If there are natural names, this form feels natural, and you should use it. But in this example, there aren't natural names, and we're adding numeric suffixes just to make the names unique. That leads to two problems: the code is cluttered with unimportant names, and you have to be carefully increment the suffix on each line. Whenever I write code like this, I invariably use the wrong number on one line and then spend 10 minutes scratching my head and trying to figure out what went wrong with my code.
The main downside of this form is that it forces you to name each intermediate element. If there are natural names, this form feels natural, and you should use it. But in this example, there aren't natural names, and we're adding numeric suffixes just to make the names unique. That leads to two problems: the code is cluttered with unimportant names, and you have to carefully increment the suffix on each line. Whenever I write code like this, I invariably use the wrong number on one line and then spend 10 minutes scratching my head and trying to figure out what went wrong with my code.
You may worry that this form creates many intermediate copies of your data and takes up a lot of memory. First, worrying about memory is not a useful way to spend your time: worry about it when it becomes a problem (i.e. you run out of memory), not before. Second, R isn't stupid: if you're working with data frames, R will share columns where possible. Let's take a look at an actual data manipulation pipeline where we add a new column to the `diamonds` dataset from ggplot2:
You may worry that this form creates many intermediate copies of your data and takes up a lot of memory, but that's not necessary. First, worrying about memory is not a useful way to spend your time: worry about it when it becomes a problem (i.e. you run out of memory), not before. Second, R isn't stupid: if you're working with data frames, R will share columns where possible. Let's take a look at an actual data manipulation pipeline where we add a new column to the `diamonds` dataset from ggplot2:
```{r}
diamonds2 <- mutate(diamonds, price_per_carat = price / carat)
@ -110,7 +110,7 @@ bop(
```
Here the disadvantage is that you have to read from inside-out, from right-to-left, and that the arguments end up spread far apart (evocatively called the
[dagwood sandwhich](https://en.wikipedia.org/wiki/Dagwood_sandwich) problem).
[dagwood sandwhich](https://en.wikipedia.org/wiki/Dagwood_sandwich) problem). In short, this code is hard for a human to consume.
### Use the pipe
@ -161,10 +161,9 @@ This means that the pipe won't work for two classes of functions:
Other functions with this problem include `get()` and `load()`
1. Functions that make use lazy evaluation. In R, function arguments
1. Functions that use lazy evaluation. In R, function arguments
are only computed when the function uses them, not prior to calling the
function. This means that the function can affect the global environment in
various ways. The pipe computed each element in turn, so you can't
function. The pipe computes each element in turn, so you can't
rely on this behaviour.
One place that this is a problem is `tryCatch()`, which lets you capture
@ -201,13 +200,13 @@ The pipe is a powerful tool, but it's not the only tool at your disposal, and it
## Other tools from magrittr
The pipe is provided by the magrittr package, by Stefan Milton Bache. Most of packages you work in this book automatically provide `%>%` for you. You might want to load magrittr yourself if you're using another package, or you want to access some of the other pipe variants that magrittr provides.
The pipe is provided by the magrittr package, by Stefan Milton Bache. Most of the packages you work with in this book will automatically provide `%>%` for you. You might want to load magrittr yourself if you're using another package, or you want to access some of the other pipe variants that magrittr provides.
```{r}
library(magrittr)
```
* When working with more complex pipes, it's some times useful to call a
* When working with more complex pipes, it's sometimes useful to call a
function for its side-effects. Maybe you want to print out the current
object, or plot it, or save it to disk. Many times, such functions don't
return anything, effectively terminating the pipe.

View File

@ -2,21 +2,21 @@
Code is a tool of communication, not just to the computer, but to other people. This is important because every project you undertake is fundamentally collaborative. Even if you're not working with other people, you'll definitely be working with future-you. You want to write clear code so that future-you doesn't curse present-you when you look at a project again after several months have passed.
To me, improving your communication skills is a key part of mastering R as a programming language. Over time, you want your code to become more and more clear, and easier to write. Removing duplication is an important part of expressing yourself clearly because it lets the reader (i.e. future-you!) focus on what's different between operations rather than what's the same. The goal is not just to write better functions or to do things that you couldn't do before, but to code with more "ease". As you internalise the ideas in this chapter, you should find it easier to re-tackle problems that you've solved in the past with much effort.
To me, improving your communication skills is a key part of mastering R as a programming language. Over time, you want your code to become more and more clear, and easier to write. Removing duplication is an important part of expressing yourself clearly because it lets the reader (i.e. future-you!) focus on what's different between operations rather than what's the same. The goal is not just to write better functions or to do things that you couldn't do before, but to code with more "ease". As you internalise the ideas in this chapter, you should find it easier to re-tackle problems that you've struggled to solve in the past.
In the following chapters, you'll learn important programming skills:
1. We'll start by diving deep into the __pipe__, `%>%`, talking more about how
it works, what the alternatives are, and when not to use the pipe.
1. Copy-and-paste is powerful tool, but you should avoid doing it more than
1. Copy-and-paste is a powerful tool, but you should avoid doing it more than
twice. Repeating yourself in code is dangerous because it can easily lead
to errors and inconsistencies. Instead, write __functions__ which let
you extract out repeated code so that it can be easily reused.
1. Functions extract out repeated code, but you often need to repeat the
same actions on multiple inputs. You need tools for __iteration__ that
let you do similar things again again. These tools include for loops
let you do similar things again and again. These tools include for loops
and functional programming.
1. As you start to write more powerful functions, you'll need a solid
@ -24,16 +24,16 @@ In the following chapters, you'll learn important programming skills:
vectors, the three important S3 classes built on top of them, and
understand the mysteries of the list and data frame.
1. One of the partiuclarly important data structures in R is the list.
Lists are important because a list can contain other lists, so is
1. One of the particularly important data structures in R is the list.
Lists are important because a list can contain other lists, so it is
__hierarchical__. Two common scenarios where hierarchical structures
arise are json, and fitting many models. You'll need to learn some new
tools from the purrr package to make handling these cases as easy as
tools from the purrr package to make handle these cases as easily as
possible.
The goal of these chapters is to teach you the minimum about programming that a practicising data scientist must know. It turns out this is a reasonable amount, and I think it's worth investing in your programming skills. It's an investment that won't pay off immediately, but over time it will allow you to solve new problems more quickly, and reuse your insights from previous problems in new scenarios.
Writing code is similar in many ways to writing prose. One parallel which I find particularly useful is that in both cases rewriting is key to clarity. The first expression of your ideas is unlikely to be particularly clear, and you may need to rewrite multiple times. After solving a data analysis challenge, it's often worth looking at your code and thinking about whether or not it's obvious what you've done. If you spend a little time rewriting your code while the ideas are fresh, you can save a lot of time later trying to recreate what your code did. But this doesn't mean you should rewrite every function: you need to balance what you need to achieve now with saving time in the long run. (But the more you rewrite your functions the more likely you'll first attempt will be clear.)
Writing code is similar in many ways to writing prose. One parallel which I find particularly useful is that in both cases rewriting is the key to clarity. The first expression of your ideas is unlikely to be particularly clear, and you may need to rewrite multiple times. After solving a data analysis challenge, it's often worth looking at your code and thinking about whether or not it's obvious what you've done. If you spend a little time rewriting your code while the ideas are fresh, you can save a lot of time later trying to recreate what your code did. But this doesn't mean you should rewrite every function: you need to balance what you need to achieve now with saving time in the long run. (But the more you rewrite your functions the more likely you'll first attempt will be clear.)
## Learning more
@ -41,7 +41,7 @@ As you become a better R programmer, you'll learn more techniques for reducing v
To learn more you need to study R as a programming language, not just an interactive environment for data science. We have written two books that will help you do so:
* [Hands on programming with R](http://shop.oreilly.com/product/0636920028574.do),
* [Hands on Programming with R](http://shop.oreilly.com/product/0636920028574.do),
by Garrett Grolemund. This is an introduction to R as a programming language
and is a great place to start if R is your first programming language.

View File

@ -20,11 +20,11 @@ To work with relational data you need verbs that work with pairs of tables. Ther
* __Set operations__, which treat observations like they were set elements.
The most common place to find relational data is in a _relational_ database management system, a term that encompasses almost all modern databases. If you've used a database before, you've almost certainly used SQL. If so, you should find the concepts in this chapter familiar, although their expression in dplyr is a little different. Generally, dplyr is a little easier to use than SQL because it's specialised to data analysis: it makes common data analysis operations easier, at the expense of making it difficult to do other things.
The most common place to find relational data is in a _relational_ database management system, a term that encompasses almost all modern databases. If you've used a database before, you've almost certainly used SQL. If so, you should find the concepts in this chapter familiar, although their expression in dplyr is a little different. Generally, dplyr is a little easier to use than SQL because dplyr is specialised to data analysis: it makes common data analysis operations easier, at the expense of making it difficult to do other things.
## nycflights13 {#nycflights13-relational}
You'll learn about relational data with other datasets from the nycflights13 package. As well as the `flights` table that you've worked with so far, nycflights13 contains four other related data frames:
You can use the nycflights13 package to learn about relational data. nycflights13 contains four data frames that are related to the `flights` table that you used in Data Transformation:
* `airlines` lets you look up the full carrier name from its abbreviated
code:
@ -66,7 +66,7 @@ For nycflights13:
connects to `airlines` with the `carrier` variable.
* `flights` connects to `airports` in two ways: via the `origin` or the
`dest`.
`dest` variables.
* `flights` connects to `weather` via `origin` (the location), and
`year`, `month`, `day` and `hour` (the time).
@ -101,11 +101,10 @@ There are two types of keys:
* A __primary key__ uniquely identifies an observation in its own table.
For example, `planes$tailnum` is a primary key because it uniquely identifies
each plane.
each plane in the `planes` table.
* A __foreign key__ uniquely identifies an observation in another table.
For example, the `flights$tailnum` is a foreign key because it matches each
flight to a unique plane.
For example, the `flights$tailnum` is a foreign key because it appears in the `flights` table where it matches each flight to a unique plane.
A variable can be both part of primary key _and_ a foreign key. For example, `origin` is part of the `weather` primary key, and is also a foreign key for the `airport` table.

View File

@ -10,7 +10,7 @@ sentences <- readr::read_lines("harvard-sentences.txt")
<!-- look at http://d-rug.github.io/blog/2015/regex.fick/, http://qntm.org/files/re/re.html -->
This chapter introduces you to string manipulation in R. You'll learn the basics of how strings work and how to create them by hand, but the focus of this chapter will be on regular expressions. Character variables typically unstructured or semi-structured data so you need some tools to make order from madness. Regular expressions are a very concise language for describing patterns in strings. When you first look at them, you'll think a cat walked across your keyboard, but as you learn more, you'll see how they allow you to express complex patterns very concisely. The goal of this chapter is not to teach you every detail of regular expressions. Instead we'll give you a solid foundation that allows you to solve a wide variety of problems and point you to resources where you can learn more.
This chapter introduces you to string manipulation in R. You'll learn the basics of how strings work and how to create them by hand, but the focus of this chapter will be on regular expressions. Character variables typically come as unstructured or semi-structured data. When this happens, you need some tools to make order from madness. Regular expressions are a very concise language for describing patterns in strings. When you first look at them, you'll think a cat walked across your keyboard, but as you learn more, you'll see how they allow you to express complex patterns very concisely. The goal of this chapter is not to teach you every detail of regular expressions. Instead I'll give you a solid foundation that allows you to solve a wide variety of problems and point you to resources where you can learn more.
This chapter will focus on the __stringr__ package. This package provides a consistent set of functions that all work the same way and are easier to learn than the base R equivalents. We'll also take a brief look at the __stringi__ package. This package is what stringr uses internally: it's more complex than stringr (and includes many many more functions). stringr includes tools to let you tackle the most common 90% of string manipulation challenges; stringi contains functions to let you tackle the last 10%.
@ -30,9 +30,9 @@ double_quote <- "\"" # or '"'
single_quote <- '\'' # or "'"
```
That means if you want to include a literal `\`, you'll need to double it up: `"\\"`.
That means if you want to include a literal backslash, you'll need to double it up: `"\\"`.
Beware that the printed representation of the string is not the same as string itself, because the printed representation shows the escapes. To see the raw contents of the string, use `writeLines()`:
Beware that the printed representation of a string is not the same as string itself, because the printed representation shows the escapes. To see the raw contents of the string, use `writeLines()`:
```{r}
x <- c("\"", "\\")
@ -83,7 +83,7 @@ Use the `sep` argument to control how they're separated:
str_c("x", "y", sep = ", ")
```
Like most other functions in R, missing values are infectious. If you want them to print as `NA`, use `str_replace_na()`:
Like most other functions in R, missing values are contagious. If you want them to print as `"NA"`, use `str_replace_na()`:
```{r}
x <- c("abc", NA)
@ -118,7 +118,7 @@ str_c(c("x", "y", "z"), collapse = ", ")
### Subsetting strings
You can extract parts of a string using `str_sub()`. As well as the string, `str_sub()` takes `start` and `end` argument which give the (inclusive) position of the substring:
You can extract parts of a string using `str_sub()`. As well as the string, `str_sub()` takes `start` and `end` arguments which give the (inclusive) position of the substring:
```{r}
x <- c("Apple", "Banana", "Pear")
@ -186,7 +186,7 @@ str_sort(x, locale = "haw") # Hawaiian
Regular expressions, regexps for short, are a very terse language that allow to describe patterns in strings. They take a little while to get your head around, but once you've got it you'll find them extremely useful.
To learn regular expressions, we'll use `str_show()` and `str_show_all()`. These functions take a character vector and a regular expression, and show you how they match. We'll start with very simple regular expressions and then gradually get more and more complicated. Once you've mastered pattern matching, you'll learn how to apply those ideas with various stringr functions.
To learn regular expressions, we'll use `str_view()` and `str_view_all()`. These functions take a character vector and a regular expression, and show you how they match. We'll start with very simple regular expressions and then gradually get more and more complicated. Once you've mastered pattern matching, you'll learn how to apply those ideas with various stringr functions.
### Basic matches
@ -203,7 +203,7 @@ The next step up in complexity is `.`, which matches any character (except a new
str_view(x, ".a.")
```
But if "`.`" matches any character, how do you match an actual "`.`"? You need to use an "escape" to tell the regular expression you want to match it exactly, not use the special behaviour. The escape character used by regular expressions is `\`. Unfortunately, that's also the escape character used by strings, so to match a literal "`.`" you need to use `\\.`.
But if "`.`" matches any character, how do you match an actual "`.`"? You need to use an "escape" to tell the regular expression you want to match it exactly, not use the special behaviour. In other words, you need to make the regular expression `\.`, but this creates a problem. We use strings to represent regular expressions, and `\` is also used as an escape symbol in strings. So the string `"\."` reduces to the special character written as `\.` In this case, `\.` is not a recognized special character and the string would lead to an error; but `"\n"` would reduce to a new line, `"\t"` would reduce to a tab, and `"\\"` would reduce to a literal `\`, which provides a way forward. To create a string that reduces to a literal backslash followed by a period, you need to escape the backslash. So to match a literal "`.`" you need to use `"\\."`, which simplifies to the regular expression `\.`.
```{r, cache = FALSE}
# To create the regular expression, we need \\
@ -216,7 +216,7 @@ writeLines(dot)
str_view(c("abc", "a.c", "bef"), "a\\.c")
```
If `\` is used an escape character, how do you match a literal `\`? Well you need to escape it, creating the regular expression `\\`. To create that regular expression, you need to use a string, which also needs to escape `\`. That means to match a literal `\` you need to write `"\\\\"` - you need four backslashes to match one!
If `\` is used as an escape character in regular expressions, how do you match a literal `\`? Well you need to escape it, creating the regular expression `\\`. To create that regular expression, you need to use a string, which also needs to escape `\`. That means to match a literal `\` you need to write `"\\\\"` - you need four backslashes to match one!
```{r, cache = FALSE}
x <- "a\\b"
@ -372,7 +372,7 @@ str_view(fruit, "(..)\\1", match = TRUE)
(You'll also see how they're useful in conjunction with `str_match()` in a few pages.)
Unfortunately `()` in regexps serve two purposes: you usually use them to disambiguate precedence, but you can also use for grouping. If you're using one set for grouping and one set for disambiguation, things can get confusing. You might want to use `(?:)` instead: it only disambiguates, and doesn't modify the grouping. They are called non-capturing parentheses.
Unfortunately `()` in regexps serve two purposes: you usually use them to disambiguate precedence, but you can also use them for grouping. If you're using one set for grouping and one set for disambiguation, things can get confusing. You might want to use `(?:)` instead: it only disambiguates, and doesn't modify the grouping. `(?:)` are called non-capturing parentheses.
For example:
@ -401,7 +401,7 @@ Now that you've learned the basics of regular expressions, it's time to learn ho
* Find the positions of matches.
* Extract the content of matches.
* Replace matches with new values.
* How can you split a string based on a match.
* Split a string based on a match.
Because regular expressions are so powerful, it's easy to try and solve every problem with a single regular expression. But since you're in a programming language, it's often easy to break the problem down into smaller pieces. If you find yourself getting stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down into smaller pieces, solving each challenge before moving onto the next one.
@ -459,7 +459,7 @@ str_count("abababa", "aba")
str_view_all("abababa", "aba")
```
Note the use of `str_view_all()`. As you'll shortly learn, many stringr functions come in pairs: one function works with a single match, and the other works with all matches.
Note the use of `str_view_all()`. As you'll shortly learn, many stringr functions come in pairs: one function works with a single match, and the other works with all matches. The second function will have the suffix `_all`.
### Exercises
@ -633,7 +633,7 @@ sentences %>%
You can also request a maximum number of pieces:
```{r}
fields <- c("Name: Hadley", "County: NZ", "Age: 35")
fields <- c("Name: Hadley", "Country: NZ", "Age: 35")
fields %>% str_split(": ", n = 2, simplify = TRUE)
```