Break up data-transform content
This commit is contained in:
parent
40d7bcb5d0
commit
861e27026e
|
@ -117,115 +117,6 @@ When this happens you'll get an informative error:
|
|||
filter(flights, month = 1)
|
||||
```
|
||||
|
||||
There's another common problem you might encounter when using `==`: floating point numbers.
|
||||
These results might surprise you!
|
||||
|
||||
```{r}
|
||||
(sqrt(2) ^ 2) == 2
|
||||
(1 / 49 * 49) == 1
|
||||
```
|
||||
|
||||
Computers use finite precision arithmetic (they obviously can't store an infinite number of digits!) so remember that every number you see is an approximation.
|
||||
Instead of relying on `==`, use `near()`:
|
||||
|
||||
```{r}
|
||||
near(sqrt(2) ^ 2, 2)
|
||||
near(1 / 49 * 49, 1)
|
||||
```
|
||||
|
||||
### Logical operators
|
||||
|
||||
Multiple arguments to `filter()` are combined with "and": every expression must be true in order for a row to be included in the output.
|
||||
For other types of combinations, you'll need to use Boolean operators yourself: `&` is "and", `|` is "or", and `!` is "not".
|
||||
Figure \@ref(fig:bool-ops) shows the complete set of Boolean operations.
|
||||
|
||||
```{r bool-ops, echo = FALSE, fig.cap = "Complete set of boolean operations. `x` is the left-hand circle, `y` is the right-hand circle, and the shaded region show which parts each operator selects.", fig.alt = "Six Venn diagrams, each explaining a given logical operator. The circles (sets) in each of the Venn diagrams represent x and y. 1. y & !x is y but none of x, x & y is the intersection of x and y, x & !y is x but none of y, x is all of x none of y, xor(x, y) is everything except the intersection of x and y, y is all of y none of x, and x | y is everything."}
|
||||
knitr::include_graphics("diagrams/transform-logical.png")
|
||||
```
|
||||
|
||||
The following code finds all flights that departed in November or December:
|
||||
|
||||
```{r, eval = FALSE}
|
||||
filter(flights, month == 11 | month == 12)
|
||||
```
|
||||
|
||||
The order of operations doesn't work like English.
|
||||
You can't write `filter(flights, month == (11 | 12))`, which you might literally translate into "finds all flights that departed in November or December".
|
||||
Instead it finds all months that equal `11 | 12`, an expression that evaluates to `TRUE`.
|
||||
In a numeric context (like here), `TRUE` becomes `1`, so this finds all flights in January, not November or December.
|
||||
This is quite confusing!
|
||||
|
||||
A useful short-hand for this problem is `x %in% y`.
|
||||
This will select every row where `x` is one of the values in `y`.
|
||||
We could use it to rewrite the code above:
|
||||
|
||||
```{r, eval = FALSE}
|
||||
nov_dec <- filter(flights, month %in% c(11, 12))
|
||||
```
|
||||
|
||||
Sometimes you can simplify complicated subsetting by remembering De Morgan's law: `!(x & y)` is the same as `!x | !y`, and `!(x | y)` is the same as `!x & !y`.
|
||||
For example, if you wanted to find flights that weren't delayed (on arrival or departure) by more than two hours, you could use either of the following two filters:
|
||||
|
||||
```{r, eval = FALSE}
|
||||
filter(flights, !(arr_delay > 120 | dep_delay > 120))
|
||||
filter(flights, arr_delay <= 120, dep_delay <= 120)
|
||||
```
|
||||
|
||||
As well as `&` and `|`, R also has `&&` and `||`.
|
||||
Don't use them here!
|
||||
You'll learn when you should use them in Section \@ref(conditional-execution) on conditional execution.
|
||||
|
||||
Whenever you start using complicated, multipart expressions in `filter()`, consider making them explicit variables instead.
|
||||
That makes it much easier to check your work.
|
||||
You'll learn how to create new variables shortly.
|
||||
|
||||
### Missing values {#missing-values-filter}
|
||||
|
||||
One important feature of R that can make comparison tricky is missing values, or `NA`s ("not availables").
|
||||
`NA` represents an unknown value so missing values are "contagious": almost any operation involving an unknown value will also be unknown.
|
||||
|
||||
```{r}
|
||||
NA > 5
|
||||
10 == NA
|
||||
NA + 10
|
||||
NA / 2
|
||||
```
|
||||
|
||||
The most confusing result is this one:
|
||||
|
||||
```{r}
|
||||
NA == NA
|
||||
```
|
||||
|
||||
It's easiest to understand why this is true with a bit more context:
|
||||
|
||||
```{r}
|
||||
# Let x be Mary's age. We don't know how old she is.
|
||||
x <- NA
|
||||
|
||||
# Let y be John's age. We don't know how old he is.
|
||||
y <- NA
|
||||
|
||||
# Are John and Mary the same age?
|
||||
x == y
|
||||
# We don't know!
|
||||
```
|
||||
|
||||
If you want to determine if a value is missing, use `is.na()`:
|
||||
|
||||
```{r}
|
||||
is.na(x)
|
||||
```
|
||||
|
||||
`filter()` only includes rows where the condition is `TRUE`; it excludes both `FALSE` and `NA` values.
|
||||
If you want to preserve missing values, ask for them explicitly:
|
||||
|
||||
```{r}
|
||||
df <- tibble(x = c(1, NA, 3))
|
||||
filter(df, x > 1)
|
||||
filter(df, is.na(x) | x > 1)
|
||||
```
|
||||
|
||||
### Exercises
|
||||
|
||||
1. Find all flights that
|
||||
|
@ -238,20 +129,10 @@ filter(df, is.na(x) | x > 1)
|
|||
f. Were delayed by at least an hour, but made up over 30 minutes in flight
|
||||
g. Departed between midnight and 6am (inclusive)
|
||||
|
||||
2. Another useful dplyr filtering helper is `between()`.
|
||||
What does it do?
|
||||
Can you use it to simplify the code needed to answer the previous challenges?
|
||||
|
||||
3. How many flights have a missing `dep_time`?
|
||||
2. How many flights have a missing `dep_time`?
|
||||
What other variables are missing?
|
||||
What might these rows represent?
|
||||
|
||||
4. Why is `NA ^ 0` not missing?
|
||||
Why is `NA | TRUE` not missing?
|
||||
Why is `FALSE & NA` not missing?
|
||||
Can you figure out the general rule?
|
||||
(`NA * 0` is a tricky counterexample!)
|
||||
|
||||
## Arrange rows with `arrange()`
|
||||
|
||||
`arrange()` works similarly to `filter()` except that instead of selecting rows, it changes their order.
|
||||
|
@ -268,14 +149,6 @@ Use `desc()` to re-order by a column in descending order:
|
|||
arrange(flights, desc(dep_delay))
|
||||
```
|
||||
|
||||
Missing values are always sorted at the end:
|
||||
|
||||
```{r}
|
||||
df <- tibble(x = c(5, 2, NA))
|
||||
arrange(df, x)
|
||||
arrange(df, desc(x))
|
||||
```
|
||||
|
||||
### Exercises
|
||||
|
||||
1. Sort `flights` to find the flights with longest departure delays.
|
||||
|
@ -286,9 +159,6 @@ arrange(df, desc(x))
|
|||
3. Which flights travelled the farthest?
|
||||
Which travelled the shortest?
|
||||
|
||||
4. How could you use `arrange()` to sort all missing values to the start?
|
||||
(Hint: use `!is.na()`).
|
||||
|
||||
## Select columns with `select()` {#select}
|
||||
|
||||
It's not uncommon to get datasets with hundreds or even thousands of variables.
|
||||
|
@ -396,80 +266,6 @@ transmute(flights,
|
|||
)
|
||||
```
|
||||
|
||||
### Useful creation functions {#mutate-funs}
|
||||
|
||||
There are many functions for creating new variables that you can use with `mutate()`.
|
||||
The key property is that the function must be vectorised: it must take a vector of values as input, return a vector with the same number of values as output.
|
||||
There's no way to list every possible function that you might use, but here's a selection of functions that are frequently useful:
|
||||
|
||||
- Arithmetic operators: `+`, `-`, `*`, `/`, `^`.
|
||||
These are all vectorised, using the so called "recycling rules".
|
||||
If one parameter is shorter than the other, it will be automatically extended to be the same length.
|
||||
This is most useful when one of the arguments is a single number: `air_time / 60`, `hours * 60 + minute`, etc.
|
||||
|
||||
Arithmetic operators are also useful in conjunction with the aggregate functions you'll learn about later.
|
||||
For example, `x / sum(x)` calculates the proportion of a total, and `y - mean(y)` computes the difference from the mean.
|
||||
|
||||
- Modular arithmetic: `%/%` (integer division) and `%%` (remainder), where `x == y * (x %/% y) + (x %% y)`.
|
||||
Modular arithmetic is a handy tool because it allows you to break integers up into pieces.
|
||||
For example, in the flights dataset, you can compute `hour` and `minute` from `dep_time` with:
|
||||
|
||||
```{r}
|
||||
transmute(flights,
|
||||
dep_time,
|
||||
hour = dep_time %/% 100,
|
||||
minute = dep_time %% 100
|
||||
)
|
||||
```
|
||||
|
||||
- Logs: `log()`, `log2()`, `log10()`.
|
||||
Logarithms are an incredibly useful transformation for dealing with data that ranges across multiple orders of magnitude.
|
||||
They also convert multiplicative relationships to additive.
|
||||
|
||||
All else being equal, I recommend using `log2()` because it's easy to interpret: a difference of 1 on the log scale corresponds to doubling on the original scale and a difference of -1 corresponds to halving.
|
||||
|
||||
- Offsets: `lead()` and `lag()` allow you to refer to leading or lagging values.
|
||||
This allows you to compute running differences (e.g. `x - lag(x)`) or find when values change (`x != lag(x)`).
|
||||
They are most useful in conjunction with `group_by()`, which you'll learn about shortly.
|
||||
|
||||
```{r}
|
||||
(x <- 1:10)
|
||||
lag(x)
|
||||
lead(x)
|
||||
```
|
||||
|
||||
- Cumulative and rolling aggregates: R provides functions for running sums, products, mins and maxes: `cumsum()`, `cumprod()`, `cummin()`, `cummax()`; and dplyr provides `cummean()` for cumulative means.
|
||||
If you need rolling aggregates (i.e. a sum computed over a rolling window), try the RcppRoll package.
|
||||
|
||||
```{r}
|
||||
x
|
||||
cumsum(x)
|
||||
cummean(x)
|
||||
```
|
||||
|
||||
- Logical comparisons: `<`, `<=`, `>`, `>=`, `!=`, and `==`, which you learned about earlier.
|
||||
If you're doing a complex sequence of logical operations it's often a good idea to store the interim values in new variables so you can check that each step is working as expected.
|
||||
|
||||
- Ranking: there are a number of ranking functions, but you should start with `min_rank()`.
|
||||
It does the most usual type of ranking (e.g. 1st, 2nd, 2nd, 4th).
|
||||
The default gives smallest values the small ranks; use `desc(x)` to give the largest values the smallest ranks.
|
||||
|
||||
```{r}
|
||||
y <- c(1, 2, 2, NA, 3, 4)
|
||||
min_rank(y)
|
||||
min_rank(desc(y))
|
||||
```
|
||||
|
||||
If `min_rank()` doesn't do what you need, look at the variants `row_number()`, `dense_rank()`, `percent_rank()`, `cume_dist()`, `ntile()`.
|
||||
See their help pages for more details.
|
||||
|
||||
```{r}
|
||||
row_number(y)
|
||||
dense_rank(y)
|
||||
percent_rank(y)
|
||||
cume_dist(y)
|
||||
```
|
||||
|
||||
### Exercises
|
||||
|
||||
```{r, eval = FALSE, echo = FALSE}
|
||||
|
@ -588,7 +384,7 @@ Working with the pipe is one of the key criteria for belonging to the tidyverse.
|
|||
The only exception is ggplot2: it was written before the pipe was discovered.
|
||||
Unfortunately, the next iteration of ggplot2, ggvis, which does use the pipe, isn't quite ready for prime time yet.
|
||||
|
||||
### Missing values {#missing-values-summarise}
|
||||
## Missing values {#missing-values-summarise}
|
||||
|
||||
You may have wondered about the `na.rm` argument we used above.
|
||||
What happens if we don't set it?
|
||||
|
@ -621,7 +417,7 @@ not_cancelled %>%
|
|||
summarise(mean = mean(dep_delay))
|
||||
```
|
||||
|
||||
### Grouping by multiple variables
|
||||
## Grouping by multiple variables
|
||||
|
||||
You can group a data frame by multiple variables as well.
|
||||
Note that the grouping information is printed on top of the output.
|
||||
|
@ -770,134 +566,6 @@ batters %>%
|
|||
|
||||
You can find a good explanation of this problem at <http://varianceexplained.org/r/empirical_bayes_baseball/> and <http://www.evanmiller.org/how-not-to-sort-by-average-rating.html>.
|
||||
|
||||
### Useful summary functions {#summarise-funs}
|
||||
|
||||
Just using means, counts, and sum can get you a long way, but R provides many other useful summary functions:
|
||||
|
||||
- Measures of location: we've used `mean(x)`, but `median(x)` is also useful.
|
||||
The mean is the sum divided by the length; the median is a value where 50% of `x` is above it, and 50% is below it.
|
||||
|
||||
```{r}
|
||||
not_cancelled %>%
|
||||
group_by(month) %>%
|
||||
summarise(
|
||||
med_arr_delay = median(arr_delay),
|
||||
med_dep_delay = median(dep_delay)
|
||||
)
|
||||
```
|
||||
|
||||
It's sometimes useful to combine aggregation with logical subsetting.
|
||||
We haven't talked about this sort of subsetting yet, but you'll learn more about it in Section \@ref(vector-subsetting).
|
||||
|
||||
```{r}
|
||||
not_cancelled %>%
|
||||
group_by(year, month, day) %>%
|
||||
summarise(
|
||||
avg_delay1 = mean(arr_delay),
|
||||
avg_delay2 = mean(arr_delay[arr_delay > 0]) # the average positive delay
|
||||
)
|
||||
```
|
||||
|
||||
- Measures of spread: `sd(x)`, `IQR(x)`, `mad(x)`.
|
||||
The root mean squared deviation, or standard deviation `sd(x)`, is the standard measure of spread.
|
||||
The interquartile range `IQR(x)` and median absolute deviation `mad(x)` are robust equivalents that may be more useful if you have outliers.
|
||||
|
||||
```{r}
|
||||
# Why is distance to some destinations more variable than to others?
|
||||
not_cancelled %>%
|
||||
group_by(dest) %>%
|
||||
summarise(distance_sd = sd(distance)) %>%
|
||||
arrange(desc(distance_sd))
|
||||
```
|
||||
|
||||
- Measures of rank: `min(x)`, `quantile(x, 0.25)`, `max(x)`.
|
||||
Quantiles are a generalisation of the median.
|
||||
For example, `quantile(x, 0.25)` will find a value of `x` that is greater than 25% of the values, and less than the remaining 75%.
|
||||
|
||||
```{r}
|
||||
# When do the first and last flights leave each day?
|
||||
not_cancelled %>%
|
||||
group_by(year, month, day) %>%
|
||||
summarise(
|
||||
first = min(dep_time),
|
||||
last = max(dep_time)
|
||||
)
|
||||
```
|
||||
|
||||
- Measures of position: `first(x)`, `nth(x, 2)`, `last(x)`.
|
||||
These work similarly to `x[1]`, `x[2]`, and `x[length(x)]` but let you set a default value if that position does not exist (i.e. you're trying to get the 3rd element from a group that only has two elements).
|
||||
For example, we can find the first and last departure for each day:
|
||||
|
||||
```{r}
|
||||
not_cancelled %>%
|
||||
group_by(year, month, day) %>%
|
||||
summarise(
|
||||
first_dep = first(dep_time),
|
||||
last_dep = last(dep_time)
|
||||
)
|
||||
```
|
||||
|
||||
These functions are complementary to filtering on ranks.
|
||||
Filtering gives you all variables, with each observation in a separate row:
|
||||
|
||||
```{r}
|
||||
not_cancelled %>%
|
||||
group_by(year, month, day) %>%
|
||||
mutate(r = min_rank(desc(dep_time))) %>%
|
||||
filter(r %in% range(r))
|
||||
```
|
||||
|
||||
- Counts: You've seen `n()`, which takes no arguments, and returns the size of the current group.
|
||||
To count the number of non-missing values, use `sum(!is.na(x))`.
|
||||
To count the number of distinct (unique) values, use `n_distinct(x)`.
|
||||
|
||||
```{r}
|
||||
# Which destinations have the most carriers?
|
||||
not_cancelled %>%
|
||||
group_by(dest) %>%
|
||||
summarise(carriers = n_distinct(carrier)) %>%
|
||||
arrange(desc(carriers))
|
||||
```
|
||||
|
||||
Counts are so useful that dplyr provides a simple helper if all you want is a count:
|
||||
|
||||
```{r}
|
||||
not_cancelled %>%
|
||||
count(dest)
|
||||
```
|
||||
|
||||
Just like with `group_by()`, you can also provide multiple variables to `count()`.
|
||||
|
||||
```{r}
|
||||
not_cancelled %>%
|
||||
count(carrier, dest)
|
||||
```
|
||||
|
||||
You can optionally provide a weight variable.
|
||||
For example, you could use this to "count" (sum) the total number of miles a plane flew:
|
||||
|
||||
```{r}
|
||||
not_cancelled %>%
|
||||
count(tailnum, wt = distance)
|
||||
```
|
||||
|
||||
- Counts and proportions of logical values: `sum(x > 10)`, `mean(y == 0)`.
|
||||
When used with numeric functions, `TRUE` is converted to 1 and `FALSE` to 0.
|
||||
This makes `sum()` and `mean()` very useful: `sum(x)` gives the number of `TRUE`s in `x`, and `mean(x)` gives the proportion.
|
||||
|
||||
```{r}
|
||||
# How many flights left before 5am? (these usually indicate delayed
|
||||
# flights from the previous day)
|
||||
not_cancelled %>%
|
||||
group_by(year, month, day) %>%
|
||||
summarise(n_early = sum(dep_time < 500))
|
||||
|
||||
# What proportion of flights are delayed by more than an hour?
|
||||
not_cancelled %>%
|
||||
group_by(year, month, day) %>%
|
||||
summarise(hour_prop = mean(arr_delay > 60))
|
||||
```
|
||||
|
||||
### Exercises
|
||||
|
||||
1. Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights.
|
||||
|
|
|
@ -1,3 +1,191 @@
|
|||
# Logicals and numbers {#logicals-numbers}
|
||||
|
||||
## Introduction
|
||||
|
||||
`between()`
|
||||
|
||||
## Logical operators
|
||||
|
||||
Multiple arguments to `filter()` are combined with "and": every expression must be true in order for a row to be included in the output.
|
||||
For other types of combinations, you'll need to use Boolean operators yourself: `&` is "and", `|` is "or", and `!` is "not".
|
||||
Figure \@ref(fig:bool-ops) shows the complete set of Boolean operations.
|
||||
|
||||
```{r bool-ops, echo = FALSE, fig.cap = "Complete set of boolean operations. `x` is the left-hand circle, `y` is the right-hand circle, and the shaded region show which parts each operator selects.", fig.alt = "Six Venn diagrams, each explaining a given logical operator. The circles (sets) in each of the Venn diagrams represent x and y. 1. y & !x is y but none of x, x & y is the intersection of x and y, x & !y is x but none of y, x is all of x none of y, xor(x, y) is everything except the intersection of x and y, y is all of y none of x, and x | y is everything."}
|
||||
knitr::include_graphics("diagrams/transform-logical.png")
|
||||
```
|
||||
|
||||
The following code finds all flights that departed in November or December:
|
||||
|
||||
```{r, eval = FALSE}
|
||||
filter(flights, month == 11 | month == 12)
|
||||
```
|
||||
|
||||
The order of operations doesn't work like English.
|
||||
You can't write `filter(flights, month == (11 | 12))`, which you might literally translate into "finds all flights that departed in November or December".
|
||||
Instead it finds all months that equal `11 | 12`, an expression that evaluates to `TRUE`.
|
||||
In a numeric context (like here), `TRUE` becomes `1`, so this finds all flights in January, not November or December.
|
||||
This is quite confusing!
|
||||
|
||||
A useful short-hand for this problem is `x %in% y`.
|
||||
This will select every row where `x` is one of the values in `y`.
|
||||
We could use it to rewrite the code above:
|
||||
|
||||
```{r, eval = FALSE}
|
||||
nov_dec <- filter(flights, month %in% c(11, 12))
|
||||
```
|
||||
|
||||
Sometimes you can simplify complicated subsetting by remembering De Morgan's law: `!(x & y)` is the same as `!x | !y`, and `!(x | y)` is the same as `!x & !y`.
|
||||
For example, if you wanted to find flights that weren't delayed (on arrival or departure) by more than two hours, you could use either of the following two filters:
|
||||
|
||||
```{r, eval = FALSE}
|
||||
filter(flights, !(arr_delay > 120 | dep_delay > 120))
|
||||
filter(flights, arr_delay <= 120, dep_delay <= 120)
|
||||
```
|
||||
|
||||
As well as `&` and `|`, R also has `&&` and `||`.
|
||||
Don't use them here!
|
||||
You'll learn when you should use them in Section \@ref(conditional-execution) on conditional execution.
|
||||
|
||||
Whenever you start using complicated, multipart expressions in `filter()`, consider making them explicit variables instead.
|
||||
That makes it much easier to check your work.
|
||||
You'll learn how to create new variables shortly.
|
||||
|
||||
## Summaries
|
||||
|
||||
- Counts and proportions of logical values: `sum(x > 10)`, `mean(y == 0)`.
|
||||
When used with numeric functions, `TRUE` is converted to 1 and `FALSE` to 0.
|
||||
This makes `sum()` and `mean()` very useful: `sum(x)` gives the number of `TRUE`s in `x`, and `mean(x)` gives the proportion.
|
||||
|
||||
```{r}
|
||||
# How many flights left before 5am? (these usually indicate delayed
|
||||
# flights from the previous day)
|
||||
not_cancelled %>%
|
||||
group_by(year, month, day) %>%
|
||||
summarise(n_early = sum(dep_time < 500))
|
||||
|
||||
# What proportion of flights are delayed by more than an hour?
|
||||
not_cancelled %>%
|
||||
group_by(year, month, day) %>%
|
||||
summarise(hour_prop = mean(arr_delay > 60))
|
||||
```
|
||||
|
||||
## Basic math
|
||||
|
||||
There are many functions for creating new variables that you can use with `mutate()`.
|
||||
The key property is that the function must be vectorised: it must take a vector of values as input, return a vector with the same number of values as output.
|
||||
There's no way to list every possible function that you might use, but here's a selection of functions that are frequently useful:
|
||||
|
||||
- Arithmetic operators: `+`, `-`, `*`, `/`, `^`.
|
||||
These are all vectorised, using the so called "recycling rules".
|
||||
If one parameter is shorter than the other, it will be automatically extended to be the same length.
|
||||
This is most useful when one of the arguments is a single number: `air_time / 60`, `hours * 60 + minute`, etc.
|
||||
|
||||
Arithmetic operators are also useful in conjunction with the aggregate functions you'll learn about later.
|
||||
For example, `x / sum(x)` calculates the proportion of a total, and `y - mean(y)` computes the difference from the mean.
|
||||
|
||||
- Modular arithmetic: `%/%` (integer division) and `%%` (remainder), where `x == y * (x %/% y) + (x %% y)`.
|
||||
Modular arithmetic is a handy tool because it allows you to break integers up into pieces.
|
||||
For example, in the flights dataset, you can compute `hour` and `minute` from `dep_time` with:
|
||||
|
||||
```{r}
|
||||
transmute(flights,
|
||||
dep_time,
|
||||
hour = dep_time %/% 100,
|
||||
minute = dep_time %% 100
|
||||
)
|
||||
```
|
||||
|
||||
- Logs: `log()`, `log2()`, `log10()`.
|
||||
Logarithms are an incredibly useful transformation for dealing with data that ranges across multiple orders of magnitude.
|
||||
They also convert multiplicative relationships to additive.
|
||||
|
||||
All else being equal, I recommend using `log2()` because it's easy to interpret: a difference of 1 on the log scale corresponds to doubling on the original scale and a difference of -1 corresponds to halving.
|
||||
|
||||
- Logical comparisons: `<`, `<=`, `>`, `>=`, `!=`, and `==`, which you learned about earlier.
|
||||
If you're doing a complex sequence of logical operations it's often a good idea to store the interim values in new variables so you can check that each step is working as expected.
|
||||
|
||||
- Cumulative and rolling aggregates: R provides functions for running sums, products, mins and maxes: `cumsum()`, `cumprod()`, `cummin()`, `cummax()`; and dplyr provides `cummean()` for cumulative means.
|
||||
If you need rolling aggregates (i.e. a sum computed over a rolling window), try the RcppRoll package.
|
||||
|
||||
```{r}
|
||||
x
|
||||
cumsum(x)
|
||||
cummean(x)
|
||||
```
|
||||
|
||||
## Summaries
|
||||
|
||||
Just using means, counts, and sum can get you a long way, but R provides many other useful summary functions:
|
||||
|
||||
- Measures of location: we've used `mean(x)`, but `median(x)` is also useful.
|
||||
The mean is the sum divided by the length; the median is a value where 50% of `x` is above it, and 50% is below it.
|
||||
|
||||
```{r}
|
||||
not_cancelled %>%
|
||||
group_by(month) %>%
|
||||
summarise(
|
||||
med_arr_delay = median(arr_delay),
|
||||
med_dep_delay = median(dep_delay)
|
||||
)
|
||||
```
|
||||
|
||||
It's sometimes useful to combine aggregation with logical subsetting.
|
||||
We haven't talked about this sort of subsetting yet, but you'll learn more about it in Section \@ref(vector-subsetting).
|
||||
|
||||
```{r}
|
||||
not_cancelled %>%
|
||||
group_by(year, month, day) %>%
|
||||
summarise(
|
||||
avg_delay1 = mean(arr_delay),
|
||||
avg_delay2 = mean(arr_delay[arr_delay > 0]) # the average positive delay
|
||||
)
|
||||
```
|
||||
|
||||
- Measures of spread: `sd(x)`, `IQR(x)`, `mad(x)`.
|
||||
The root mean squared deviation, or standard deviation `sd(x)`, is the standard measure of spread.
|
||||
The interquartile range `IQR(x)` and median absolute deviation `mad(x)` are robust equivalents that may be more useful if you have outliers.
|
||||
|
||||
```{r}
|
||||
# Why is distance to some destinations more variable than to others?
|
||||
not_cancelled %>%
|
||||
group_by(dest) %>%
|
||||
summarise(distance_sd = sd(distance)) %>%
|
||||
arrange(desc(distance_sd))
|
||||
```
|
||||
|
||||
- Measures of rank: `min(x)`, `quantile(x, 0.25)`, `max(x)`.
|
||||
Quantiles are a generalisation of the median.
|
||||
For example, `quantile(x, 0.25)` will find a value of `x` that is greater than 25% of the values, and less than the remaining 75%.
|
||||
|
||||
```{r}
|
||||
# When do the first and last flights leave each day?
|
||||
not_cancelled %>%
|
||||
group_by(year, month, day) %>%
|
||||
summarise(
|
||||
first = min(dep_time),
|
||||
last = max(dep_time)
|
||||
)
|
||||
```
|
||||
|
||||
## Floating point
|
||||
|
||||
There's another common problem you might encounter when using `==`: floating point numbers.
|
||||
These results might surprise you!
|
||||
|
||||
```{r}
|
||||
(sqrt(2) ^ 2) == 2
|
||||
(1 / 49 * 49) == 1
|
||||
```
|
||||
|
||||
Computers use finite precision arithmetic (they obviously can't store an infinite number of digits!) so remember that every number you see is an approximation.
|
||||
Instead of relying on `==`, use `near()`:
|
||||
|
||||
```{r}
|
||||
near(sqrt(2) ^ 2, 2)
|
||||
near(1 / 49 * 49, 1)
|
||||
```
|
||||
|
||||
## Exercises
|
||||
|
||||
1. How could you use `arrange()` to sort all missing values to the start?
|
||||
(Hint: use `!is.na()`).
|
||||
|
|
|
@ -1,3 +1,70 @@
|
|||
# Missing values {#missing-values}
|
||||
|
||||
## Introduction
|
||||
|
||||
## Basics
|
||||
|
||||
### Missing values {#missing-values-filter}
|
||||
|
||||
One important feature of R that can make comparison tricky is missing values, or `NA`s ("not availables").
|
||||
`NA` represents an unknown value so missing values are "contagious": almost any operation involving an unknown value will also be unknown.
|
||||
|
||||
```{r}
|
||||
NA > 5
|
||||
10 == NA
|
||||
NA + 10
|
||||
NA / 2
|
||||
```
|
||||
|
||||
The most confusing result is this one:
|
||||
|
||||
```{r}
|
||||
NA == NA
|
||||
```
|
||||
|
||||
It's easiest to understand why this is true with a bit more context:
|
||||
|
||||
```{r}
|
||||
# Let x be Mary's age. We don't know how old she is.
|
||||
x <- NA
|
||||
|
||||
# Let y be John's age. We don't know how old he is.
|
||||
y <- NA
|
||||
|
||||
# Are John and Mary the same age?
|
||||
x == y
|
||||
# We don't know!
|
||||
```
|
||||
|
||||
If you want to determine if a value is missing, use `is.na()`:
|
||||
|
||||
```{r}
|
||||
is.na(x)
|
||||
```
|
||||
|
||||
## dplyr verbs
|
||||
|
||||
`filter()` only includes rows where the condition is `TRUE`; it excludes both `FALSE` and `NA` values.
|
||||
If you want to preserve missing values, ask for them explicitly:
|
||||
|
||||
```{r}
|
||||
df <- tibble(x = c(1, NA, 3))
|
||||
filter(df, x > 1)
|
||||
filter(df, is.na(x) | x > 1)
|
||||
```
|
||||
|
||||
Missing values are always sorted at the end:
|
||||
|
||||
```{r}
|
||||
df <- tibble(x = c(5, 2, NA))
|
||||
arrange(df, x)
|
||||
arrange(df, desc(x))
|
||||
```
|
||||
|
||||
## Exercises
|
||||
|
||||
1. Why is `NA ^ 0` not missing?
|
||||
Why is `NA | TRUE` not missing?
|
||||
Why is `FALSE & NA` not missing?
|
||||
Can you figure out the general rule?
|
||||
(`NA * 0` is a tricky counterexample!)
|
||||
|
|
|
@ -1,3 +1,96 @@
|
|||
# Vector tools
|
||||
|
||||
## Introduction
|
||||
|
||||
`%in%`
|
||||
|
||||
## Counts
|
||||
|
||||
- Counts: You've seen `n()`, which takes no arguments, and returns the size of the current group.
|
||||
To count the number of non-missing values, use `sum(!is.na(x))`.
|
||||
To count the number of distinct (unique) values, use `n_distinct(x)`.
|
||||
|
||||
```{r}
|
||||
# Which destinations have the most carriers?
|
||||
not_cancelled %>%
|
||||
group_by(dest) %>%
|
||||
summarise(carriers = n_distinct(carrier)) %>%
|
||||
arrange(desc(carriers))
|
||||
```
|
||||
|
||||
Counts are so useful that dplyr provides a simple helper if all you want is a count:
|
||||
|
||||
```{r}
|
||||
not_cancelled %>%
|
||||
count(dest)
|
||||
```
|
||||
|
||||
Just like with `group_by()`, you can also provide multiple variables to `count()`.
|
||||
|
||||
```{r}
|
||||
not_cancelled %>%
|
||||
count(carrier, dest)
|
||||
```
|
||||
|
||||
You can optionally provide a weight variable.
|
||||
For example, you could use this to "count" (sum) the total number of miles a plane flew:
|
||||
|
||||
```{r}
|
||||
not_cancelled %>%
|
||||
count(tailnum, wt = distance)
|
||||
```
|
||||
|
||||
## Window functions
|
||||
|
||||
- Offsets: `lead()` and `lag()` allow you to refer to leading or lagging values.
|
||||
This allows you to compute running differences (e.g. `x - lag(x)`) or find when values change (`x != lag(x)`).
|
||||
They are most useful in conjunction with `group_by()`, which you'll learn about shortly.
|
||||
|
||||
```{r}
|
||||
(x <- 1:10)
|
||||
lag(x)
|
||||
lead(x)
|
||||
```
|
||||
|
||||
- Ranking: there are a number of ranking functions, but you should start with `min_rank()`.
|
||||
It does the most usual type of ranking (e.g. 1st, 2nd, 2nd, 4th).
|
||||
The default gives smallest values the small ranks; use `desc(x)` to give the largest values the smallest ranks.
|
||||
|
||||
```{r}
|
||||
y <- c(1, 2, 2, NA, 3, 4)
|
||||
min_rank(y)
|
||||
min_rank(desc(y))
|
||||
```
|
||||
|
||||
If `min_rank()` doesn't do what you need, look at the variants `row_number()`, `dense_rank()`, `percent_rank()`, `cume_dist()`, `ntile()`.
|
||||
See their help pages for more details.
|
||||
|
||||
```{r}
|
||||
row_number(y)
|
||||
dense_rank(y)
|
||||
percent_rank(y)
|
||||
cume_dist(y)
|
||||
```
|
||||
|
||||
- Measures of position: `first(x)`, `nth(x, 2)`, `last(x)`.
|
||||
These work similarly to `x[1]`, `x[2]`, and `x[length(x)]` but let you set a default value if that position does not exist (i.e. you're trying to get the 3rd element from a group that only has two elements).
|
||||
For example, we can find the first and last departure for each day:
|
||||
|
||||
```{r}
|
||||
not_cancelled %>%
|
||||
group_by(year, month, day) %>%
|
||||
summarise(
|
||||
first_dep = first(dep_time),
|
||||
last_dep = last(dep_time)
|
||||
)
|
||||
```
|
||||
|
||||
These functions are complementary to filtering on ranks.
|
||||
Filtering gives you all variables, with each observation in a separate row:
|
||||
|
||||
```{r}
|
||||
not_cancelled %>%
|
||||
group_by(year, month, day) %>%
|
||||
mutate(r = min_rank(desc(dep_time))) %>%
|
||||
filter(r %in% range(r))
|
||||
```
|
||||
|
|
Loading…
Reference in New Issue