Merge branch 'master' of github.com:hadley/r4ds

This commit is contained in:
hadley 2016-08-15 07:38:53 -05:00
commit c660934214
7 changed files with 40 additions and 40 deletions

View File

@ -2,7 +2,7 @@
## Introduction
This chapter will show you how to work with dates and times in R. At first glance, dates and times seem simple. You use them all the time in your regular life, and they don't seem to cause much confusion. However, the more you learn about dates and times, the more complicated they seem to get. To warm yp, trying these three seemingly simple questions:
This chapter will show you how to work with dates and times in R. At first glance, dates and times seem simple. You use them all the time in your regular life, and they don't seem to cause much confusion. However, the more you learn about dates and times, the more complicated they seem to get. To warm up, try these three seemingly simple questions:
* Does every year have 365 days?
* Does every day have 24 hours?
@ -10,7 +10,7 @@ This chapter will show you how to work with dates and times in R. At first glanc
I'm sure you know that not every year has 365 days, but do you know the full rule for determining if a year is a leap year? (It has three parts.) You might have remembered that many parts of the world use daylight savings time (DST), so that some days have 23 hours, and others have 25. You probably didn't know that some minutes have 61 seconds because every now and then leap seconds are added because the Earth's rotation is gradually slowing down.
Dates and times are hard because they have to reconcile two physical phenomenon (the rotation of the Earth and its orbit around the sun) with a whole raft of geopolitical phenonmeon including months, time zones, and DST. This chapter won't teach you every last detail about dates and times, but it will give you a solid grounding of practical skills that will help you with common data analysis challenges.
Dates and times are hard because they have to reconcile two physical phenomenon (the rotation of the Earth and its orbit around the sun) with a whole raft of geopolitical phenomenon including months, time zones, and DST. This chapter won't teach you every last detail about dates and times, but it will give you a solid grounding of practical skills that will help you with common data analysis challenges.
### Prerequisites
@ -69,7 +69,7 @@ mdy("January 31st, 2017")
dmy("31-Jan-2017")
```
These functions also take unquoted numbers. This is the most concise way to create a single date/time object, as you might need when filtering date/time data. `ymd()` is short and ununambiguous:
These functions also take unquoted numbers. This is the most concise way to create a single date/time object, as you might need when filtering date/time data. `ymd()` is short and unambiguous:
```{r}
ymd(20170131)
@ -149,7 +149,7 @@ Note the two tricks I needed to create these plots:
means 1 day.
1. R doesn't like to compare date-times with dates, so you can force
`ymd()` to geneate a date-time by supplying a `tz` argument.
`ymd()` to generate a date-time by supplying a `tz` argument.
### From other types
@ -322,7 +322,7 @@ Setting larger components of a date to a constant is a powerful technique that a
1. What makes the distribution of `diamonds$carat` and
`flights$sched_dep_time` similar?
1. Confirm my hypthosis that the early departures of flights in minutes
1. Confirm my hypothesis that the early departures of flights in minutes
20-30 and 50-60 are caused by scheduled flights that leave early.
Hint: create a binary variable that tells you whether or not a flight
was delayed.

View File

@ -2,7 +2,7 @@
## Introduction
Working with data provided by R packages is a great way to learn the tools of data science, but at some point you want to stop learning and start working with your own data. In this chapter, you'll learn how to read plain-text rectangular files into R. Here, we'll only scratch surface of data import, but many of the principles will translate to the other forms of data. We'll finish with a few pointers to packages that useful for other types of data.
Working with data provided by R packages is a great way to learn the tools of data science, but at some point you want to stop learning and start working with your own data. In this chapter, you'll learn how to read plain-text rectangular files into R. Here, we'll only scratch surface of data import, but many of the principles will translate to the other forms of data. We'll finish with a few pointers to packages that are useful for other types of data.
### Prerequisites
@ -30,7 +30,7 @@ Most of readr's functions are concerned with turning flat files into data frames
[webreadr](https://github.com/Ironholds/webreadr) which is built on top
of `read_log()` and provides many more helpful tools.)
These functions all have similar syntax: once you've mastered one, you can use the others with ease. For the rest of this chapter we'll focus on `read_csv()`. Not onl are csv files one of the most common forms of data storage, but once you understand `read_csv()`, you can easily apply your knowledge to all the other functions in readr.
These functions all have similar syntax: once you've mastered one, you can use the others with ease. For the rest of this chapter we'll focus on `read_csv()`. Not only are csv files one of the most common forms of data storage, but once you understand `read_csv()`, you can easily apply your knowledge to all the other functions in readr.
The first argument to `read_csv()` is the most important: it's the path to the file to read.
@ -250,7 +250,7 @@ charToRaw("Hadley")
Each hexadecimal number represents a byte of information: `48` is H, `61` is a, and so on. The mapping from hexadecimal number to character is called the encoding, and in this case the encoding is called ASCII. ASCII does a great job of representing English characters, because it's the __American__ Standard Code for Information Interchange.
Things get more complicated for languages other than English. In the early days of computing there were many competing standards for encoding non-English characters, and to correctly interpret a string you need to know both the values and the encoding. For example, two common encodings are Latin1 (aka ISO-8859-1, used for Western European languages) and Latin2 (aka ISO-8859-2, used for Eastern European languages). In Latin1, the byte `b1` is "±", but in Latin2, it's "ą"! Fortunately, today there is one standard that is supported almost everywhere: UTF-8. UTF-8 can encode just about every character used by humans today, as well as many extra symbols (like emoji!).
Things get more complicated for languages other than English. In the early days of computing there were many competing standards for encoding non-English characters, and to correctly interpret a string you needed to know both the values and the encoding. For example, two common encodings are Latin1 (aka ISO-8859-1, used for Western European languages) and Latin2 (aka ISO-8859-2, used for Eastern European languages). In Latin1, the byte `b1` is "±", but in Latin2, it's "ą"! Fortunately, today there is one standard that is supported almost everywhere: UTF-8. UTF-8 can encode just about every character used by humans today, as well as many extra symbols (like emoji!).
readr uses UTF-8 everywhere: it assumes your data is UTF-8 encoded when you read it, and always uses it when writing. This is a good default, but will fail for data produced by older systems that don't understand UTF-8. If this happens to you, your strings will look weird when you print them. Sometimes just one or two characters might be messed up; other times you'll get complete gibberish. For example:
@ -340,7 +340,7 @@ Time
: `%M` minutes.
: `%S` integer seconds.
: `%OS` real seconds.
: `%Z` Time zone (as name, e.g. `America/Chicago`). Beware abbreviations:
: `%Z` Time zone (as name, e.g. `America/Chicago`). Beware of abbreviations:
if you're American, note that "EST" is a Canadian time zone that does not
have daylight savings time. It is \emph{not} Eastern Standard Time! We'll
come back to this [time zones].
@ -628,6 +628,6 @@ To get other types of data into R, we recommend starting with the tidyverse pack
__RSQLite__, __RPostgreSQL__ etc) allows you to run SQL queries against a
database and return a data frame.
For hierarchical data: use __jsonlite__ (by Jeroen Ooms) for json, and __xml2__ for XML. whichYou will need to convert them to data frames using the tools on [handling hierarchy].
For hierarchical data: use __jsonlite__ (by Jeroen Ooms) for json, and __xml2__ for XML. You will need to convert them to data frames using the tools on [handling hierarchy].
For other file types, try the [R data import/export manual](https://cran.r-project.org/doc/manuals/r-release/R-data.html) and the [__rio__](https://github.com/leeper/rio) package.

View File

@ -16,7 +16,7 @@ To work with relational data you need verbs that work with pairs of tables. Ther
* __Set operations__, which treat observations as if they were set elements.
The most common place to find relational data is in a _relational_ database management system (or RDBMS), a term that encompasses almost all modern databases. If you've used a database before, you've almost certainly used SQL. If so, you should find the concepts in this chapter familiar, although their expression in dplyr is a little different. Generally, dplyr is a little easier to use than SQL because dplyr is specialised to data analysis: it makes common data analysis operations easier, at the expense of making it more difficult to do other things that don't commonly need for data analysis.
The most common place to find relational data is in a _relational_ database management system (or RDBMS), a term that encompasses almost all modern databases. If you've used a database before, you've almost certainly used SQL. If so, you should find the concepts in this chapter familiar, although their expression in dplyr is a little different. Generally, dplyr is a little easier to use than SQL because dplyr is specialised to do data analysis: it makes common data analysis operations easier, at the expense of making it more difficult to do other things that don't commonly need for data analysis.
### Prerequisites
@ -176,7 +176,7 @@ flights2 <- flights %>%
flights2
```
(Remember, when you're in RStudio, you can also use `View()` to avoid this problem).
(Remember, when you're in RStudio, you can also use `View()` to avoid this problem.)
Imagine you want to add the full airline name to the `flights2` data. You can combine the `airlines` and `flights2` data frames with `left_join()`:
@ -186,7 +186,7 @@ flights2 %>%
left_join(airlines, by = "carrier")
```
The result of joining airlines to flights is an additional variable: `name`. This is why I call this type of join a mutating join. In this case, you could have got to the same place using `mutate()` and R's base subsetting:
The result of joining airlines to flights2 is an additional variable: `name`. This is why I call this type of join a mutating join. In this case, you could have got to the same place using `mutate()` and R's base subsetting:
```{r}
flights2 %>%
@ -472,7 +472,7 @@ The inverse of a semi-join is an anti-join. An anti-join keeps the rows that _do
knitr::include_graphics("diagrams/join-anti.png")
```
Anti-joins are are useful for diagnosing join mismatches. For example, when connecting `flights` and `planes`, you might be interested to know that there are many `flights` that don't have a match in `planes`:
Anti-joins are useful for diagnosing join mismatches. For example, when connecting `flights` and `planes`, you might be interested to know that there are many `flights` that don't have a match in `planes`:
```{r}
flights %>%

View File

@ -279,7 +279,7 @@ You can also match the boundary between words with `\b`. I don't often use this
### Character classes and alternatives
There are number of special patterns that match more than one character. You've already seen `.`, which matches any character apart from a newline. There are four other useful tools:
There are a number of special patterns that match more than one character. You've already seen `.`, which matches any character apart from a newline. There are four other useful tools:
* `\d`: matches any digit.
* `\s`: matches any whitespace (e.g. space, tab, newline).
@ -366,7 +366,7 @@ str_view(x, 'C[LX]+?')
1. `"\\{.+\\}"`
1. `\d{4}-\d{2}-\d{2}`
1. `"\\\\{4}"`
1. Create regular expressions to find all words that:
1. Start with three consonants.
@ -378,7 +378,7 @@ str_view(x, 'C[LX]+?')
### Grouping and backreferences
Earlier, you learned about parentheses as a way to disambiguate complex expressions. They also definie "groups" that you can refer to with _backreferences_, like `\1`, `\2` etc. For example, the following regular expression finds all fruits that have a repeated pair of letters.
Earlier, you learned about parentheses as a way to disambiguate complex expressions. They also define "groups" that you can refer to with _backreferences_, like `\1`, `\2` etc. For example, the following regular expression finds all fruits that have a repeated pair of letters.
```{r}
str_view(fruit, "(..)\\1", match = TRUE)
@ -401,7 +401,7 @@ str_view(fruit, "(..)\\1", match = TRUE)
1. Start and end with the same character.
1. Contain a repeated pair of letters
(e.g. "church" contains "ch" repeated twice)
(e.g. "church" contains "ch" repeated twice.)
1. Contain one letter repeated in at least three places
(e.g. "eleven" contains three "e"s.)

View File

@ -16,7 +16,7 @@ library(tibble)
## Creating tibbles {#tibbles}
The almost all of the functions that you'll use in this book produce tibbles as using tibbles is one of the common features of packages in the tidyverse. Most other R packages use regular data frames, so you might want to coerce a data frame to a tibble. You can do that with `as_tibble()`:
Almost all of the functions that you'll use in this book produce tibbles as using tibbles is one of the common features of packages in the tidyverse. Most other R packages use regular data frames, so you might want to coerce a data frame to a tibble. You can do that with `as_tibble()`:
```{r}
as_tibble(iris)
@ -67,7 +67,7 @@ I often add a comment (the line starting with `#`), to make it really clear wher
## Tibbles vs. data frames
There are two main differences in the usage of a data frame vs a tibble: printing, and subsetting.
There are two main differences in the usage of a data frame vs a tibble: printing and subsetting.
### Printing
@ -83,7 +83,7 @@ tibble(
)
```
Tibbles are designed so that you don't accidentally overwhelm your console when you print large dataframes. But sometimes you need more output than the default display. There are a few options that can help.
Tibbles are designed so that you don't accidentally overwhelm your console when you print large data frames. But sometimes you need more output than the default display. There are a few options that can help.
First, you can explicitly `print()` the data frame and control the number of rows (`n`) and the `width` of the display. `width = Inf` will display all columns:
@ -112,7 +112,7 @@ nycflights13::flights %>%
### Subsetting
So far all the tools you've learned have worked with complete dataframes. If you want to pull out a single variable, you need some new tools, `$` and `[[`. `[[` can extract by name or position; `$` only extracts by name but is a little less typing.
So far all the tools you've learned have worked with complete data frames. If you want to pull out a single variable, you need some new tools, `$` and `[[`. `[[` can extract by name or position; `$` only extracts by name but is a little less typing.
```{r}
df <- tibble(
@ -147,7 +147,7 @@ Some older functions don't work with tibbles. If you encounter one of these func
class(as.data.frame(tb))
```
The main reason that some older functions don't work with tibble is the `[` function. We don't use `[` much in this book much because `dplyr::filter()` and `dplyr::select()` allow you to solve the same problems with clearer code (but you will learn a little about it in [vector subsetting](#vector-subsetting). With base R data frames, `[` sometimes returns a data frame, and sometimes returns a vector. With tibbles, `[` always returns a nother tibble.
The main reason that some older functions don't work with tibble is the `[` function. We don't use `[` much in this book much because `dplyr::filter()` and `dplyr::select()` allow you to solve the same problems with clearer code (but you will learn a little about it in [vector subsetting](#vector-subsetting). With base R data frames, `[` sometimes returns a data frame, and sometimes returns a vector. With tibbles, `[` always returns another tibble.
## Exercises

View File

@ -8,7 +8,7 @@
> "Tidy datasets are all alike, but every messy dataset is messy in its
> own way." -- Hadley Wickham
In this chapter, you will learn a consistent way to organise your data in R, a organisation called __tidy data__. Getting your data into this format requires some upfront work, but that work pays off in the long-term. Once you have tidy data and the tidy tools provided by packages in the tidyverse, you will spend much less time munging data from one representation to another, allowing you to spend more time on the analytic questions at hand.
In this chapter, you will learn a consistent way to organise your data in R, an organisation called __tidy data__. Getting your data into this format requires some upfront work, but that work pays off in the long-term. Once you have tidy data and the tidy tools provided by packages in the tidyverse, you will spend much less time munging data from one representation to another, allowing you to spend more time on the analytic questions at hand.
This chapter will give you a practical introduction to tidy data and the accompanying tools in the __tidyr__ package. If you'd like to learn more about the underlying theory, you might enjoy the *Tidy Data* paper published in the Journal of Statistical Software, <http://www.jstatsoft.org/v59/i10/paper>.
@ -41,7 +41,7 @@ There are three interrelated rules which make a dataset tidy:
1. Each variable must have its own column.
1. Each observation must have its own row.
1. Each value much have its own cell.
1. Each value must have its own cell.
Figure \@ref(fig:tidy-structure) shows the rules visually.
@ -49,7 +49,7 @@ Figure \@ref(fig:tidy-structure) shows the rules visually.
knitr::include_graphics("images/tidy-1.png")
```
These three rules are interrelated because it's impossible to only satisfy two of the three. That interrelationship leads to even simpler set of practical instructions:
These three rules are interrelated because it's impossible to only satisfy two of the three. That interrelationship leads to an even simpler set of practical instructions:
1. Put each dataset in a tibble.
1. Put each variable in a column.
@ -119,7 +119,7 @@ The second step is to resolve one of two common problems:
1. One variable might be spread across multiple columns.
1. One observation might be scattered across mutliple rows.
1. One observation might be scattered across multiple rows.
Typically a dataset will only suffer from one of these problems; it'll only suffer from both if you're really unlucky! To fix these problems, you'll need the two most important functions in tidyr: `gather()` and `spread()`.
@ -185,10 +185,10 @@ To tidy this up, we first analyse the representation in similar way to `gather()
* The column that contains variable names, the `key` column. Here, it's
`type`.
* The column that contains values froms multiple variables, the `value`
* The column that contains values forms multiple variables, the `value`
column. Here it's `count`.
Once we've figured that out, we can use `spread()`, as shown progammatically below, and visually in Figure \@ref(fig:tidy-spread).
Once we've figured that out, we can use `spread()`, as shown programmatically below, and visually in Figure \@ref(fig:tidy-spread).
```{r}
spread(table2, key = type, value = count)
@ -295,7 +295,7 @@ table3 %>%
You can also pass a vector of integers to `sep`. `separate()` will interpret the integers as positions to split at. Positive values start at 1 on the far-left of the strings; negative value start at -1 on the far-right of the strings. When using integers to separate strings, the length of `sep` should be one less than the number of names in `into`.
You can use this arrangement to separate the last two digits of each year. This make this data lesss tidy, but is useful in other cases, as you'll see in a little bit.
You can use this arrangement to separate the last two digits of each year. This make this data less tidy, but is useful in other cases, as you'll see in a little bit.
```{r}
table3 %>%
@ -317,7 +317,7 @@ table5 %>%
unite(new, century, year)
```
In this case we also need to use the `sep` arguent. The default will place an underscore (`_`) between the values from different columns. Here we don't want any separator so we use `""`:
In this case we also need to use the `sep` argument. The default will place an underscore (`_`) between the values from different columns. Here we don't want any separator so we use `""`:
```{r}
table5 %>%
@ -345,7 +345,7 @@ table5 %>%
## Missing values
Changing the representation of a dataset brings up an important subtlety of missing values. Suprisingly, a value can be missing in one of two possible ways:
Changing the representation of a dataset brings up an important subtlety of missing values. Surprisingly, a value can be missing in one of two possible ways:
* __Explicitly__, i.e. flagged with `NA`.
* __Implicitly__, i.e. simply not present in the data.
@ -440,9 +440,9 @@ The best place to start is almost always to gathering together the columns that
* We don't know what all the other columns are yet, but given the structure
in the variable names (e.g. `new_sp_m014`, `new_ep_m014`, `new_ep_f014`)
these are likely to be values, not variable.
these are likely to be values, not variables.
So we need to gather together all the columns from `new_sp_m3544` to `newrel_f65`. We don't know what those values represent yet, so we'll give them the generic name `"key"`. We know the cells repesent the count of cases, so we'll use the variable `cases`. There are a lot of missing values in the current representation, so for now we'll use `na.rm` just so we can focus on the values that are present.
So we need to gather together all the columns from `new_sp_m3544` to `newrel_f65`. We don't know what those values represent yet, so we'll give them the generic name `"key"`. We know the cells represent the count of cases, so we'll use the variable `cases`. There are a lot of missing values in the current representation, so for now we'll use `na.rm` just so we can focus on the values that are present.
```{r}
who1 <- who %>%
@ -550,7 +550,7 @@ who %>%
## Non-tidy data
Before we continue on to other topics, it's worth talking briefly about non-tidy data. Earlier in the chapter, I used the perjorative term "messy" to refer to non-tidy data. That's an oversimplification: there are lots of useful and well founded data structures that are not tidy data. There are two mains reasons to use other data structures:
Before we continue on to other topics, it's worth talking briefly about non-tidy data. Earlier in the chapter, I used the pejorative term "messy" to refer to non-tidy data. That's an oversimplification: there are lots of useful and well founded data structures that are not tidy data. There are two mains reasons to use other data structures:
* Alternative representations may have substantial performance or space
advantages.

View File

@ -23,13 +23,13 @@ This part of the book proceeds as follows:
You'll learn the underlying principles, and how to get your data into a
tidy form.
Data wrangling also encompasses data transformation, which you've already learn a little about. Now we'll focus new skills for three specific types of data you will frequently encounter in practice:
Data wrangling also encompasses data transformation, which you've already learn a little about. Now we'll focus on new skills for three specific types of data you will frequently encounter in practice:
* [Dates and times] will give you the key tools for working with
dates and date-times.
* [Relational data] will give you tools for working with multiple
interrelated datasets.
* [Strings] will introduce regular expressions, a powerful tool for
manipulating strings.
* [Relational data] will give you tools for working with multiple
interrelated datasets.
* [Dates and times] will give you the key tools for working with
dates and date-times.