From a73755838f461201bb495ca45a7a8a856d8e3100 Mon Sep 17 00:00:00 2001 From: Hadley Wickham Date: Thu, 17 Mar 2022 14:15:24 -0500 Subject: [PATCH] More on logical + numbers --- logicals.Rmd | 251 +++++++++++++++++++++++++++++++++------------------ numbers.Rmd | 7 +- 2 files changed, 167 insertions(+), 91 deletions(-) diff --git a/logicals.Rmd b/logicals.Rmd index 564c4a8..3134e9d 100644 --- a/logicals.Rmd +++ b/logicals.Rmd @@ -1,4 +1,4 @@ -# Logicals and numbers {#logicals-numbers} +# Logicals and numbers {#logicals} ```{r, results = "asis", echo = FALSE} status("drafting") @@ -7,7 +7,8 @@ status("drafting") ## Introduction In this chapter, you'll learn useful tools for working with logical vectors. -The elements in a logical vector can have one of three possible values: `TRUE`, `FALSE`, and `NA`. +Logical vectors are the simplest type of vector because each element can only be one of three possible values: `TRUE`, `FALSE`, and `NA`. +Despite that simplicity, they're an extremely powerful tool. ### Prerequisites @@ -18,44 +19,93 @@ library(nycflights13) ## Comparisons -Some times you'll get data that already includes logical vectors but in most cases you'll create them by using a comparison. +Some times you'll get data that already includes logical vectors but in most cases you'll create them by using a comparison, like `<`, `<=`, `>`, `>=`, `!=`, and `==`. -`<`, `<=`, `>`, `>=`, `!=`, and `==`. -If you're doing a complex sequence of logical operations it's often a good idea to store the interim values in new variables so you can check that each step is working as expected. +### In `mutate()` -A useful shortcut is `between(x, low, high)` which is a bit less typing than `x >= low & x <= high)`. -If you want an exclusive between or left-open right-closed etc, you'll need to write by hand. +So far, you've mostly created these new variables implicitly within `filter()`: + +```{r} +flights |> + filter(dep_time > 600 & dep_time < 2000 & abs(arr_delay) < 20) +``` + +But it's useful to know that this is a shortcut and you can explicitly create perform these operations inside a `mutate()` + +```{r} +flights |> + mutate( + daytime = dep_time > 600 & dep_time < 2000, + approx_ontime = abs(arr_delay) < 20, + .keep = "used" + ) +``` + +So the filter above could also be written as: + +```{r} +flights |> + mutate( + daytime = dep_time > 600 & dep_time < 2000, + approx_ontime = abs(arr_delay) < 20, + ) |> + filter(daytime & approx_ontime) +``` + +This is an important technique when you're are doing complicated subsetting because it allows you to double-check the intermediate steps. + +### Floating point comparison Beware when using `==` with numbers as results might surprise you! +You might think that the following two computations yield 1 and 2: + +```{r} +(1 / 49 * 49) +sqrt(2) ^ 2 +``` + +But if you test them for equality, you'll discover that they're not what you expect! ```{r} -(sqrt(2) ^ 2) == 2 (1 / 49 * 49) == 1 +(sqrt(2) ^ 2) == 2 ``` -Computers use finite precision arithmetic (they obviously can't store an infinite number of digits!) so remember that every number you see is an approximation. +That's because computers use finite precision arithmetic (they obviously can't store an infinite number of digits!) so in most cases, the number number you see is an actually approximation. +R usually rounds these numbers to avoid displaying a bunch of usually unimportant digits. +You can use the `digits` argument to `format()` to force R to display more: ```{r} -(sqrt(2) ^ 2) - 2 -(1 / 49 * 49) - 1 +format(1 / 49 * 49, digits = 20) +format(sqrt(2) ^ 2, digits = 20) ``` -So instead of relying on `==`, use `near()`, which does the comparison with a small amount of tolerance: +Instead of relying on `==`, you can use `dplyr::near()`, which does the comparison with a small amount of tolerance: ```{r} near(sqrt(2) ^ 2, 2) near(1 / 49 * 49, 1) ``` -Alternatively, you might want to use `round()` to trim off extra digits. +### `is.na()` + +Another common way to create logical vector is with `is.na()`. +This is particularly important in conjunction with `filter()` because filter only selects rows where the value is `TRUE`; rows where the value is `FALSE` are automatically dropped. + +```{r} +flights |> filter(is.na(dep_delay) | is.na(arr_delay)) +flights |> filter(is.na(dep_delay) != is.na(arr_delay)) +``` ## Boolean algebra -For other types of combinations, you'll need to use Boolean operators yourself: `|` is "or" and `!` is "not". -Figure \@ref(fig:bool-ops) shows the complete set of Boolean operations. +Once you have multiple logical vectors, you can combine them together using Boolean algebra: `&` is "and", `|` is "or", and `!` is "not". +`xor()` provides one final useful operation: exclusive or. +Figure \@ref(fig:bool-ops) shows the complete set of Boolean operations and how they work. ```{r bool-ops} #| echo: false +#| out.width: NULL #| fig.cap: > #| Complete set of boolean operations. `x` is the left-hand #| circle, `y` is the right-hand circle, and the shaded region show @@ -70,71 +120,122 @@ Figure \@ref(fig:bool-ops) shows the complete set of Boolean operations. knitr::include_graphics("diagrams/transform-logical.png") ``` +As well as `&` and `|`, R also has `&&` and `||`. +Don't use them in dplyr functions! +These are called short-circuiting operators and only ever return a single `TRUE` or `FALSE`. +They're important for programming so you'll learn more about them in Section \@ref(conditional-execution). + The following code finds all flights that departed in November or December: ```{r, eval = FALSE} -flights |> filter(month == 11 | month == 12) +flights |> + filter(month == 11 | month == 12) ``` Note that the order of operations doesn't work like English. -You can't write `filter(flights, month == 11 | 12)`, which you might read as "find all flights that departed in November or December". -Instead it does something rather confusing. -First it evaluates `11 | 12` which is equivalent to `TRUE | TRUE`, which returns `TRUE`. +You can't think "find all flights that departed in November or December" and write `flights |> filter(month == 11 | 12)`. +This code will not error, but it will do something rather confusing. +First R evaluates `11 | 12` which is equivalent to `TRUE | TRUE`, which returns `TRUE`. Then it evaluates `month == TRUE`. -Since month is numeric, this is equivalent to `month == 1`, so that expression finds all flights in January! +Since month is numeric, this is equivalent to `month == 1`, so `flights |> filter(month == 11 | 12)` returns all flights in January! -An easy way to solve this problem is to use `%in%`. +### `%in%` + +An easy way to avoid this issue is to use `%in%`. `x %in% y` returns a logical vector the same length as `x` that is `TRUE` whenever a value in `x` is anywhere in `y` . -So we could use it to rewrite the code above: +So we could instead write: ```{r, eval = FALSE} -nov_dec <- flights |> filter(month %in% c(11, 12)) +flights |> + filter(month %in% c(11, 12)) ``` Sometimes you can simplify complicated subsetting by remembering De Morgan's law: `!(x & y)` is the same as `!x | !y`, and `!(x | y)` is the same as `!x & !y`. For example, if you wanted to find flights that weren't delayed (on arrival or departure) by more than two hours, you could use either of the following two filters: ```{r, eval = FALSE} -flights |> filter(!(arr_delay > 120 | dep_delay > 120)) -flights |> filter(arr_delay <= 120, dep_delay <= 120) -``` - -As well as `&` and `|`, R also has `&&` and `||`. -Don't use them in dplyr functions! -These are called short-circuiting operators and you'll learn when you should use them in Section \@ref(conditional-execution) on conditional execution. - -## Missing values {#logical-missing} - -`filter()` only selects rows where the logical expression is `TRUE`; it doesn't select rows where it's missing or `FALSE`. -If you want to find rows containing missing values, you'll need to convert missingness into a logical vector using `is.na()`. - -```{r} -flights |> filter(is.na(dep_delay) | is.na(arr_delay)) -flights |> filter(is.na(dep_delay) != is.na(arr_delay)) -``` - -## In mutate() - -Whenever you start using complicated, multi-part expressions in `filter()`, consider making them explicit variables instead. -That makes it much easier to check your work.When checking your work, a particularly useful `mutate()` argument is `.keep = "used"`: this will just show you the variables you've used, along with the variables that you created. -This makes it easy to see the variables involved side-by-side. - -```{r} flights |> - mutate(is_cancelled = is.na(dep_delay) | is.na(arr_delay), .keep = "used") |> - filter(is_cancelled) + filter(!(arr_delay > 120 | dep_delay > 120)) +flights |> + filter(arr_delay <= 120 & dep_delay <= 120) ``` -## Cumulative functions +### Missing values {#logical-missing} + +The rules for missing values in Boolean algebra are a little tricky to explain because they seem inconsistent at first glance: + +```{r} +NA & c(TRUE, FALSE, NA) +NA | c(TRUE, FALSE, NA) +``` + + + +To understand what's going on you need to think about `x | TRUE`, because regardless of whether `x` is `TRUE` or `FALSE` the result is still `TRUE`. +That means even if you don't know what `x` is (i.e. it's missing), the result must still be `TRUE`. + +## Summaries + +There are four particularly useful summary functions for logical vectors: they all take a vector of logical values and return a single value, making them a good fit for use in `summarise()`. + +`any()` and `all()` --- `any()` will return if there's at least one `TRUE`, `all()` will return `TRUE` if all values are `TRUE`. +Like all summary functions, they'll return `NA` if there are any missing values present, and like usual you can make the missing values go away with `na.rm = TRUE`. +We could use this to see if there were any days where every flight was delayed: + +```{r} +not_cancelled <- flights |> filter(!is.na(dep_delay), !is.na(arr_delay)) + +not_cancelled |> + group_by(year, month, day) |> + filter(all(arr_delay >= 0)) +``` + +`sum()` and `mean()` are particularly useful with logical vectors because when you use a logical vector in a numeric context, `TRUE` becomes 1 and `FALSE` becomes 0. +That means that `sum(x)` gives the number of `TRUE`s in `x` and `mean(x)` gives the proportion of `TRUE`s. +That lets us find the day with the highest proportion of delayed flights: + +```{r} +not_cancelled |> + group_by(year, month, day) |> + summarise(prop_delayed = mean(arr_delay > 0)) |> + arrange(desc(prop_delayed)) + +``` + +Or we could ask how many flights left before 5am, which usually are flights that were delayed from the previous day: + +```{r} +not_cancelled |> + group_by(year, month, day) |> + summarise(n_early = sum(dep_time < 500)) |> + arrange(desc(n_early)) +``` + +### Exercises + +1. For each plane, count the number of flights before the first delay of greater than 1 hour. +2. What does `prod()` return when applied to a logical vector? What logical summary function is it equivalent to? What does `min()` return applied to a logical vector? What logical summary function is it equivalent to? + +## Transformations + +### Cumulative functions Another useful pair of functions are cumulative any, `cumany()`, and cumulative all, `cumall()`. `cumany()` will be `TRUE` after it encounters the first `TRUE`, and `cumall()` will be `FALSE` after it encounters its first `FALSE`. -These are particularly useful in conjunction with `filter()` because they allow you to select: -- `cumall(x)`: all cases until the first `FALSE`. -- `cumall(!x)`: all cases until the first `TRUE`. -- `cumany(x)`: all cases after the first `TRUE`. -- `cumany(!x)`: all cases after the first `FALSE`. +```{r} +cumany(c(FALSE, FALSE, TRUE, TRUE, FALSE, TRUE)) +cumall(c(TRUE, FALSE, TRUE, TRUE, FALSE, TRUE)) +``` + +These are particularly useful in conjunction with `filter()` because they allow you to select rows: + +- Before the first `FALSE` with `cumall(x)`. +- Before the first `TRUE` with `cumall(!x)`. +- After the first `TRUE` with `cumany(x)`. +- After the first `FALSE` with `cumany(!x)`. + +If you imagine some data about a bank balance, then these functions allow you t ```{r} df <- data.frame( @@ -147,11 +248,11 @@ df |> filter(cumany(balance < 0)) df |> filter(cumall(!(balance < 0))) ``` -## Conditional outputs +### Conditional outputs -If you want to use one value when a condition is true and another value when it's `FALSE`, you can use `if_else()`[^logicals-numbers-1]. +If you want to use one value when a condition is true and another value when it's `FALSE`, you can use `if_else()`[^logicals-1]. -[^logicals-numbers-1]: This is equivalent to the base R function `ifelse`. +[^logicals-1]: This is equivalent to the base R function `ifelse`. There are two main advantages of `if_else()`over `ifelse()`: you can choose what should happen to missing values, and `if_else()` is much more likely to give you a meaningful error message if you use the wrong type of variable. ```{r} @@ -206,36 +307,6 @@ case_when( ) ``` -## Summaries - -When you use a logical vector in a numeric context, `TRUE` becomes 1 and `FALSE` becomes 0, and when you use a numeric vector in a logical context, 0 becomes `FALSE` and everything else becomes `TRUE`. - -There are four particularly useful summary functions for logical vectors: they all take a vector of logical values and return a single value, making them a good fit for use in `summarise()`. - -`any()` and `all()` --- `any()` will return if there's at least one `TRUE`, `all()` will return `TRUE` if all values are `TRUE`. -Like all summary functions, they'll return `NA` if there are any missing values present, and like usual you can make the missing values go away with `na.rm = TRUE`. - -`sum()` and `mean()` are particularly useful with logical vectors because `TRUE` is converted to 1 and `FALSE` to 0. -This means that `sum(x)` gives the number of `TRUE`s in `x` and `mean(x)` gives the proportion of `TRUE`s: - -```{r} -not_cancelled <- flights |> filter(!is.na(dep_delay), !is.na(arr_delay)) - -# How many flights left before 5am? (these usually indicate delayed -# flights from the previous day) -not_cancelled |> - group_by(year, month, day) |> - summarise(n_early = sum(dep_time < 500)) - -# What proportion of flights are delayed by more than an hour? -not_cancelled |> - group_by(year, month, day) |> - summarise(hour_prop = mean(arr_delay > 60)) -``` - -### Exercises - -1. For each plane, count the number of flights before the first delay of greater than 1 hour. -2. What does `prod()` return when applied to a logical vector? What logical summary function is it equivalent to? What does `min()` return applied to a logical vector? What logical summary function is it equivalent to? +## ## diff --git a/numbers.Rmd b/numbers.Rmd index 7d8a913..92c3581 100644 --- a/numbers.Rmd +++ b/numbers.Rmd @@ -1,4 +1,4 @@ -# Numbers {#logicals-numbers} +# Numbers {#numbers} ```{r, results = "asis", echo = FALSE} status("drafting") @@ -19,6 +19,11 @@ library(nycflights13) Doesn't quite belong here, but it's really important (and it makes numbers) so I wanted to discuss it first. +```{r} +not_cancelled <- flights |> + filter(!is.na(dep_time)) +``` + - Counts: You've seen `n()`, which takes no arguments, and returns the size of the current group. To count the number of non-missing values, use `sum(!is.na(x))`. To count the number of distinct (unique) values, use `n_distinct(x)`.