Small edits to transform.Rmd (typos and mistakes)

This commit is contained in:
Garrett 2016-04-05 20:56:21 -04:00
parent 0bf6fb4da4
commit 6803e565fd
1 changed files with 57 additions and 38 deletions

View File

@ -58,7 +58,7 @@ class(flights)
This is called a `tbl_df` (pronounced "tibble diff") or a `data_frame` (pronounced "data underscore frame"; cf. `data dot frame`). Generally, however, we won't worry about this relatively minor difference and will refer to everything as data frames.
You'll learn more about how that works in data structures. If you want to convert your own data frames to this special case, use `as.data_frame()`. I recommend it for large data frames as it makes interactive exploration much less painful.
You'll learn more about how `data_frame` works in data structures. If you want to convert your own data frames to this special case, use `as.data_frame()`. I recommend it for large data frames as it makes interactive exploration much less painful.
To create your own new tbl\_df from individual vectors, use `data_frame()`:
@ -114,7 +114,7 @@ There are five dplyr functions that you will use to do the vast majority of data
* create new variables with functions of existing variables (`mutate()`), or
* collapse many values down to a single summary (`summarise()`).
These can all be used in conjunction with `group_by()` which changes the scope of each function from operating on the entire dataset to operating on it group-by-group. These six functions the provide the verbs for a language of data manipulation.
These can all be used in conjunction with `group_by()` which changes the scope of each function from operating on the entire dataset to operating on it group-by-group. These six functions provide the verbs for a language of data manipulation.
All verbs work similarly:
@ -200,7 +200,7 @@ Multiple arguments to `filter()` are combined with "and". To get more complicate
filter(flights, month == 11 | month == 12)
```
Note the order isn't like English. The following expression doesn't find on months that equal 11 or 12. Instead it finds all months that equal `11 | 12`, which is `TRUE`. In a numeric context (like here), `TRUE` becomes one, so this finds all flights in January, not November or December.
Note the order isn't like English. The following expression doesn't find on months that equal 11 or 12. Instead it finds all months that equal `11 | 12`, an expression that evaluates to `TRUE`. In a numeric context (like here), `TRUE` becomes one, so this finds all flights in January, not November or December (It is the equivalent of `filter(flights, month == 1)`).
```{r, eval = FALSE}
filter(flights, month == 11 | 12)
@ -225,7 +225,7 @@ filter(flights, !(arr_delay > 120 | dep_delay > 120))
filter(flights, arr_delay <= 120, dep_delay <= 120)
```
Note that R has both `&` and `|` and `&&` and `||`. `&` and `|` are vectorised: you give them two vectors of logical values and they return a vector of logical values. `&&` and `||` are scalar operators: you give them individual `TRUE`s or `FALSE`s. They're used in `if` statements when programming. You'll learn about that later on.
Note that R has both `&` and `|` and `&&` and `||`. `&` and `|` are vectorised: you give them two vectors of logical values and they return a vector of logical values. `&&` and `||` are scalar operators: you give them individual `TRUE`s or `FALSE`s. They're used in `if` statements when programming. You'll learn about that later on in Chapter ?.
Sometimes you want to find all rows after the first `TRUE`, or all rows until the first `FALSE`. The cumulative functions `cumany()` and `cumall()` allow you to find these values:
@ -239,11 +239,11 @@ filter(df, cumany(x)) # all rows after first TRUE
filter(df, cumall(y)) # all rows until first FALSE
```
Whenever you start using multipart expressions in your `filter()`, it's typically a good idea to make them explicit variables with `mutate()` so that you can more easily check your work. You'll learn about `mutate()` in the next section.
Whenever you start using multipart expressions in your `filter()`, it's typically a good idea to make the expressions explicit variables with `mutate()` so that you can more easily check your work. You'll learn about `mutate()` in the next section.
### Missing values
One important feature of R that can make comparison tricky is the missing value, `NA`. `NA` represents an unknown value so missing values are "infectious": any operation involving an unknown value will also be unknown.
One important feature of R that can make comparison tricky is the missing value, `NA`. `NA` represents an unknown value so missing values are "contagious": any operation involving an unknown value will also be unknown.
```{r}
NA > 5
@ -305,7 +305,7 @@ filter(df, is.na(x) | x > 1)
arrange(flights, year, month, day)
```
Use `desc()` to order a column in descending order:
Use `desc()` to re-order by a column in descending order:
```{r}
arrange(flights, desc(arr_delay))
@ -358,7 +358,7 @@ There are a number of helper functions you can use within `select()`:
* `ends_with("xyz")`: matches names that end with "xyz".
* `contains("ijk")`: matches name that contain "ijk".
* `contains("ijk")`: matches names that contain "ijk".
* `matches("(.)\\1")`: selects variables that match a regular expression.
This one matches any variables that contain repeated characters. You'll
@ -382,7 +382,7 @@ rename(flights, tail_num = tailnum)
--------------------------------------------------------------------------------
The `select()` function works similarly to the `select` argument in `base::subset()`. Because the dplyr philosophy is to have small functions that do one thing well, it is its own function in dplyr.
The `select()` function works similarly to the `select` argument in `base::subset()`. `select()` is its own function in dplyr because the dplyr philosophy is to have small functions that each do one thing well.
--------------------------------------------------------------------------------
@ -395,7 +395,7 @@ The `select()` function works similarly to the `select` argument in `base::subse
Besides selecting sets of existing columns, it's often useful to add new columns that are functions of existing columns. This is the job of `mutate()`.
`mutate()` always adds new columns at the end so we'll start by creating a narrower dataset so we can see the new variables. Remember that when you're in RStudio, the easiest way to see all the columns is `View()`
`mutate()` always adds new columns at the end of your dataset so we'll start by creating a narrower dataset so we can see the new variables. Remember that when you're in RStudio, the easiest way to see all the columns is `View()`
```{r}
flights_sml <- select(flights,
@ -410,7 +410,7 @@ mutate(flights_sml,
)
```
Note that you can refer to columns that you've just created:
Note that you can refer to columns in `mutate()` that you've just created:
```{r}
mutate(flights_sml,
@ -432,13 +432,13 @@ transmute(flights,
--------------------------------------------------------------------------------
`mutate()` is similar to `transform()` in base R, but in `mutate()` you can refer to variables you've just created; in `transform()` you can not.
`mutate()` is similar to `transform()` in base R, but in `mutate()` you can refer to variables you've just created; in `transform()` you cannot.
--------------------------------------------------------------------------------
### Useful functions
There are many functions for creating new variables. The key property is that the function must be vectorised: it needs to return the same number of outputs as inputs. There's no way to list every possible function that you might use, but here's a selection of functions that are frequently useful:
There are many functions for creating new variables that you can use with `mutate()`. The key property is that the function must be vectorised: it needs to return the same number of outputs as inputs. There's no way to list every possible function that you might use, but here's a selection of functions that are frequently useful:
* Arithmetic operators: `+`, `-`, `*`, `/`, `^`. These are all vectorised, so
you can work with multiple columns. These operations use "recycling rules"
@ -477,12 +477,25 @@ There are many functions for creating new variables. The key property is that th
values. This allows you to compute running differences (e.g. `x - lag(x)`)
or find when values change (`x != lag(x))`. They are most useful in
conjunction with `group_by()`, which you'll learn about shortly.
```{r}
x <- 1:10
x
lag(x)
lead(x)
```
* Cumulative and rolling aggregates: R provides functions for running sums,
products, mins and maxes: `cumsum()`, `cumprod()`, `cummin()`, `cummax()`.
dplyr provides `cummean()` for cumulative means. If you need rolling
aggregates (i.e. a sum computed over a rolling window), try the RcppRoll
package.
```{r}
x
cumsum(x)
cummean(x)
```
* Logical comparisons, `<`, `<=`, `>`, `>=`, `!=`, which you learned about
earlier. If you're doing a complex sequence of logical operations it's
@ -495,13 +508,13 @@ There are many functions for creating new variables. The key property is that th
ranks; use `desc(x)` to give the largest values the smallest ranks.
```{r}
x <- c(1, 2, 2, NA, 3, 4)
y <- c(1, 2, 2, NA, 3, 4)
data_frame(
row_number(x),
min_rank(x),
dense_rank(x),
percent_rank(x),
cume_dist(x)
row_number(y),
min_rank(y),
dense_rank(y),
percent_rank(y),
cume_dist(y)
) %>% knitr::kable()
```
@ -548,18 +561,18 @@ The last verb is `summarise()`. It collapses a data frame to a single row:
summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
```
That's not terribly useful unless we pair it with `group_by()`. This changes the unit of analysis from the complete dataset to individual groups. When you the dplyr verbs on a grouped data frame they'll be automatically applied "by group". For example, if we applied exactly the same code to a data frame grouped by day, we get the average delay per day:
That's not terribly useful unless we pair it with `group_by()`. This changes the unit of analysis from the complete dataset to individual groups. When you use the dplyr verbs on a grouped data frame they'll be automatically applied "by group". For example, if we applied exactly the same code to a data frame grouped by date, we get the average delay per date:
```{r}
by_day <- group_by(flights, year, month, day)
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))
```
Together `group_by()` and `summarise()` provide one of tools that you'll use most commonly when working with dplyr: grouped summaries. But before we go any further with this, we need to introduce a powerful new idea: the pipe.
Together `group_by()` and `summarise()` provide one of the tools that you'll use most commonly when working with dplyr: grouped summaries. But before we go any further with this, we need to introduce a powerful new idea: the pipe.
### Combining multiple operations with the pipe
Imagine we want to explore the relationship between the distance and average delay for each location. Using what you already know about dplyr, you might write code like this:
Imagine that we want to explore the relationship between the distance and average delay for each location. Using what you already know about dplyr, you might write code like this:
```{r, fig.width = 6}
by_dest <- group_by(flights, dest)
@ -577,13 +590,13 @@ ggplot(delay, aes(dist, delay)) +
geom_smooth(se = FALSE)
```
There are three steps:
There are three steps to prepare this data:
* Group flights by destination
1. Group flights by destination
* Summarise to compute distance, average delay, and number of flights.
2. Summarise to compute distance, average delay, and number of flights.
* Filter to remove noisy points and Honolulu airport which is almost
3. Filter to remove noisy points and Honolulu airport, which is almost
twice as far away as the next closest airport.
This code is a little frustrating to write because we have to give each intermediate data frame a name, even though we don't care about it. Naming things well is hard, so this slows us down.
@ -603,9 +616,9 @@ delays <- flights %>%
This focuses on the transformations, not what's being transformed, which makes the code easier to read. You can read it as a series of imperative statements: group, then summarise, then filter. As suggested by this reading, a good way to pronounce `%>%` when reading code is "then".
Behind the scenes, `x %>% f(y)` turns into `f(x, y)` so you can use it to rewrite multiple operations that you can read left-to-right, top-to-bottom. We'll use piping frequently from now on because it considerably improves the readability of code, and we'll come back to it in more detail in Chapter XYZ.
Behind the scenes, `x %>% f(y)` turns into `f(x, y)`, and `x %>% f(y) %>% g(z)` turns into `g(f(x, y), z)` and so on. You can use the pipe to rewrite multiple operations in a way that you can read left-to-right, top-to-bottom. We'll use piping frequently from now on because it considerably improves the readability of code, and we'll come back to it in more detail in Chapter XYZ.
Most of the packages you'll learn through this book have been designed to work with the pipe (tidyr, dplyr, stringr, purrr, ...). The only exception is ggplot2: it was developed considerably before the discovery of the pipe. Unfortunately the next iteration of ggplot2, ggvis, which does use the pipe, isn't ready for prime time yet.
Most of the packages you'll learn through this book have been designed to work with the pipe (tidyr, dplyr, stringr, purrr, ...). The only exception is ggplot2: it was developed considerably before the pipe was discovered. Unfortunately the next iteration of ggplot2, ggvis, which does use the pipe, isn't ready for prime time yet.
### Missing values
@ -617,7 +630,7 @@ flights %>%
summarise(mean = mean(dep_delay))
```
We get a lot of missing values! That's because aggregation functions obey the usual rule of missing values: if there's any missing value in the input, the output will be a missing value. Fortunately, all aggregation functions have an `na.rm` argument which removes the missing values prior to computation:
We get a lot of missing values! That's because aggregation functions obey the usual rule of missing values: if there's any missing value in the input, the output will be a missing value. `x %>% f(y)` turns into `f(x, y)`ou'll learn more about aggregation functions in Section 5.7.4. Fortunately, all aggregation functions have an `na.rm` argument which removes the missing values prior to computation:
```{r}
flights %>%
@ -637,7 +650,7 @@ not_cancelled %>%
### Counts
Whenever you do any aggregation, it's always a good idea to include either a count (`n()`), or a count of non-missing values (`sum(!is.na(x))`). That way you can check that you're not drawing conclusions based on very small amounts of data amount of non-missing data.
Whenever you do any aggregation, it's always a good idea to include either a count (`n()`), or a count of non-missing values (`sum(!is.na(x))`). That way you can check that you're not drawing conclusions based on very small amounts of non-missing data.
For example, let's look at the planes (identified by their tail number) that have the highest average delays:
@ -670,7 +683,7 @@ ggplot(delays, aes(n, delay)) +
Not suprisingly, there is much more variation in the average delay when there are few flights. The shape of this plot is very characteristic: whenever you plot a mean (or many other summaries) vs number of observations, you'll see that the variation decreases as the sample size increases.
When looking at this sort of plot, it's often useful to filter out the groups with the smallest numbers of observations, so you can see more of the pattern and less of the extreme variation in the smallest groups. This what the following code does, and also shows you a handy pattern for integrating ggplot2 into dplyr flows. It's a bit painful that you have to switch from `%>%` to `+`, but once you get the hang of it, it's quite convenient.
When looking at this sort of plot, it's often useful to filter out the groups with the smallest numbers of observations, so you can see more of the pattern and less of the extreme variation in the smallest groups. This is what the following code does, and also shows you a handy pattern for integrating ggplot2 into dplyr flows. It's a bit painful that you have to switch from `%>%` to `+`, but once you get the hang of it, it's quite convenient.
```{r}
delays %>%
@ -723,7 +736,7 @@ You can find a good explanation of this problem at <http://varianceexplained.org
Just using means, counts, and sum can get you a long way, but R provides many other useful summary functions:
* Measure of location: we've used `mean(x)`, but `median(x)` is also
useful.The mean is the sum divided by the length; the median is a value
useful. The mean is the sum divided by the length; the median is a value
where 50% of `x` is above, and 50% is below.
It's sometimes useful to combine aggregation with logical subsetting:
@ -733,7 +746,7 @@ Just using means, counts, and sum can get you a long way, but R provides many ot
group_by(year, month, day) %>%
summarise(
avg_delay1 = mean(arr_delay),
avg_delay2 = mean(arr_delay[arr_delay > 0])
avg_delay2 = mean(arr_delay[arr_delay > 0]) # the average positive delay
)
```
@ -743,14 +756,14 @@ Just using means, counts, and sum can get you a long way, but R provides many ot
are robust equivalents that maybe more useful if you have outliers.
```{r}
# Why is distance to some destinations more variable than others?
# Why is distance to some destinations more variable than to others?
not_cancelled %>%
group_by(dest) %>%
summarise(distance_sd = sd(distance)) %>%
arrange(desc(distance_sd))
```
* By rank: `min(x)`, `quantile(x, 0.25)`, `max(x)`.
* Measures of rank: `min(x)`, `quantile(x, 0.25)`, `max(x)`.
```{r}
# When do the first and last flights leave each day?
@ -762,8 +775,8 @@ Just using means, counts, and sum can get you a long way, but R provides many ot
)
```
* By position: `first(x)`, `nth(x, 2)`, `last(x)`. These work similarly to
`x[1]`, x[n], and `x[length(x)]` but let you set a default value if that
* Measures of position: `first(x)`, `nth(x, 2)`, `last(x)`. These work similarly to
`x[1]`, `n <- 2; x[n]`, and `x[length(x)]` but let you set a default value if that
position does not exist (i.e. you're trying to get the 3rd element from a
group that only has two elements).
@ -810,7 +823,7 @@ Just using means, counts, and sum can get you a long way, but R provides many ot
count(tailnum, wt = distance)
```
* Counts and proportions of logical values: `sum(x > 10)`, `mean(y == 0)`
* Counts and proportions of logical values: `sum(x > 10)`, `mean(y == 0)`.
When used with numeric functions, `TRUE` is converted to 1 and `FALSE` to 0.
This makes `sum()` and `mean()` particularly useful: `sum(x)` gives the
number of `TRUE`s in `x`, and `mean(x)` gives the proportion.
@ -845,6 +858,12 @@ Be careful when progressively rolling up summaries: it's OK for sums and counts,
If you need to remove grouping, and return to operations on ungrouped data, use `ungroup()`.
```{r}
daily %>%
ungroup() %>% # no longer grouped by date
summarise(flights = n()) # all flights
```
### Exercises
1. Brainstorm at least 5 different ways to assess the typical delay