r4ds/logicals.Rmd

313 lines
10 KiB
Plaintext
Raw Normal View History

2022-03-18 03:15:24 +08:00
# Logicals and numbers {#logicals}
2021-03-04 01:13:14 +08:00
2021-05-04 21:10:39 +08:00
```{r, results = "asis", echo = FALSE}
status("drafting")
```
2021-03-04 01:13:14 +08:00
## Introduction
2021-04-19 20:56:29 +08:00
2022-03-17 22:46:35 +08:00
In this chapter, you'll learn useful tools for working with logical vectors.
2022-03-18 03:15:24 +08:00
Logical vectors are the simplest type of vector because each element can only be one of three possible values: `TRUE`, `FALSE`, and `NA`.
Despite that simplicity, they're an extremely powerful tool.
2021-04-19 20:56:29 +08:00
2022-02-05 02:27:20 +08:00
### Prerequisites
```{r, message = FALSE}
2021-04-19 22:31:38 +08:00
library(tidyverse)
library(nycflights13)
```
2022-03-17 22:46:35 +08:00
## Comparisons
2022-02-05 02:27:20 +08:00
2022-03-18 03:15:24 +08:00
Some times you'll get data that already includes logical vectors but in most cases you'll create them by using a comparison, like `<`, `<=`, `>`, `>=`, `!=`, and `==`.
2022-03-17 22:46:35 +08:00
2022-03-18 03:15:24 +08:00
### In `mutate()`
2022-03-17 22:46:35 +08:00
2022-03-18 03:15:24 +08:00
So far, you've mostly created these new variables implicitly within `filter()`:
```{r}
flights |>
filter(dep_time > 600 & dep_time < 2000 & abs(arr_delay) < 20)
```
But it's useful to know that this is a shortcut and you can explicitly create perform these operations inside a `mutate()`
```{r}
flights |>
mutate(
daytime = dep_time > 600 & dep_time < 2000,
approx_ontime = abs(arr_delay) < 20,
.keep = "used"
)
```
So the filter above could also be written as:
```{r}
flights |>
mutate(
daytime = dep_time > 600 & dep_time < 2000,
approx_ontime = abs(arr_delay) < 20,
) |>
filter(daytime & approx_ontime)
```
This is an important technique when you're are doing complicated subsetting because it allows you to double-check the intermediate steps.
### Floating point comparison
2022-03-17 22:46:35 +08:00
Beware when using `==` with numbers as results might surprise you!
2022-03-18 03:15:24 +08:00
You might think that the following two computations yield 1 and 2:
```{r}
(1 / 49 * 49)
sqrt(2) ^ 2
```
But if you test them for equality, you'll discover that they're not what you expect!
2022-03-17 22:46:35 +08:00
```{r}
(1 / 49 * 49) == 1
2022-03-18 03:15:24 +08:00
(sqrt(2) ^ 2) == 2
2022-03-17 22:46:35 +08:00
```
2022-03-18 03:15:24 +08:00
That's because computers use finite precision arithmetic (they obviously can't store an infinite number of digits!) so in most cases, the number number you see is an actually approximation.
R usually rounds these numbers to avoid displaying a bunch of usually unimportant digits.
You can use the `digits` argument to `format()` to force R to display more:
2022-03-17 22:46:35 +08:00
```{r}
2022-03-18 03:15:24 +08:00
format(1 / 49 * 49, digits = 20)
format(sqrt(2) ^ 2, digits = 20)
2022-03-17 22:46:35 +08:00
```
2022-03-18 03:15:24 +08:00
Instead of relying on `==`, you can use `dplyr::near()`, which does the comparison with a small amount of tolerance:
2022-03-17 22:46:35 +08:00
```{r}
near(sqrt(2) ^ 2, 2)
near(1 / 49 * 49, 1)
```
2021-04-19 20:56:29 +08:00
2022-03-18 03:15:24 +08:00
### `is.na()`
Another common way to create logical vector is with `is.na()`.
This is particularly important in conjunction with `filter()` because filter only selects rows where the value is `TRUE`; rows where the value is `FALSE` are automatically dropped.
```{r}
flights |> filter(is.na(dep_delay) | is.na(arr_delay))
flights |> filter(is.na(dep_delay) != is.na(arr_delay))
```
2022-03-17 22:46:35 +08:00
## Boolean algebra
2022-02-05 02:27:20 +08:00
2022-03-18 03:15:24 +08:00
Once you have multiple logical vectors, you can combine them together using Boolean algebra: `&` is "and", `|` is "or", and `!` is "not".
`xor()` provides one final useful operation: exclusive or.
Figure \@ref(fig:bool-ops) shows the complete set of Boolean operations and how they work.
2021-04-19 20:56:29 +08:00
2022-02-05 02:27:20 +08:00
```{r bool-ops}
#| echo: false
2022-03-18 03:15:24 +08:00
#| out.width: NULL
2022-02-05 02:27:20 +08:00
#| fig.cap: >
#| Complete set of boolean operations. `x` is the left-hand
#| circle, `y` is the right-hand circle, and the shaded region show
#| which parts each operator selects."
#| fig.alt: >
#| Six Venn diagrams, each explaining a given logical operator. The
#| circles (sets) in each of the Venn diagrams represent x and y. 1. y &
#| !x is y but none of x, x & y is the intersection of x and y, x & !y is
#| x but none of y, x is all of x none of y, xor(x, y) is everything
#| except the intersection of x and y, y is all of y none of x, and
#| x | y is everything.
2021-04-19 20:56:29 +08:00
knitr::include_graphics("diagrams/transform-logical.png")
```
2022-03-18 03:15:24 +08:00
As well as `&` and `|`, R also has `&&` and `||`.
Don't use them in dplyr functions!
These are called short-circuiting operators and only ever return a single `TRUE` or `FALSE`.
They're important for programming so you'll learn more about them in Section \@ref(conditional-execution).
2021-04-19 20:56:29 +08:00
The following code finds all flights that departed in November or December:
```{r, eval = FALSE}
2022-03-18 03:15:24 +08:00
flights |>
filter(month == 11 | month == 12)
2021-04-19 20:56:29 +08:00
```
2022-02-05 02:27:20 +08:00
Note that the order of operations doesn't work like English.
2022-03-18 03:15:24 +08:00
You can't think "find all flights that departed in November or December" and write `flights |> filter(month == 11 | 12)`.
This code will not error, but it will do something rather confusing.
First R evaluates `11 | 12` which is equivalent to `TRUE | TRUE`, which returns `TRUE`.
2022-02-05 02:27:20 +08:00
Then it evaluates `month == TRUE`.
2022-03-18 03:15:24 +08:00
Since month is numeric, this is equivalent to `month == 1`, so `flights |> filter(month == 11 | 12)` returns all flights in January!
### `%in%`
2021-04-19 20:56:29 +08:00
2022-03-18 03:15:24 +08:00
An easy way to avoid this issue is to use `%in%`.
2022-02-05 02:27:20 +08:00
`x %in% y` returns a logical vector the same length as `x` that is `TRUE` whenever a value in `x` is anywhere in `y` .
2022-03-18 03:15:24 +08:00
So we could instead write:
2021-04-19 20:56:29 +08:00
```{r, eval = FALSE}
2022-03-18 03:15:24 +08:00
flights |>
filter(month %in% c(11, 12))
2021-04-19 20:56:29 +08:00
```
Sometimes you can simplify complicated subsetting by remembering De Morgan's law: `!(x & y)` is the same as `!x | !y`, and `!(x | y)` is the same as `!x & !y`.
For example, if you wanted to find flights that weren't delayed (on arrival or departure) by more than two hours, you could use either of the following two filters:
```{r, eval = FALSE}
2022-03-18 03:15:24 +08:00
flights |>
filter(!(arr_delay > 120 | dep_delay > 120))
flights |>
filter(arr_delay <= 120 & dep_delay <= 120)
2021-04-19 20:56:29 +08:00
```
2022-03-18 03:15:24 +08:00
### Missing values {#logical-missing}
The rules for missing values in Boolean algebra are a little tricky to explain because they seem inconsistent at first glance:
```{r}
NA & c(TRUE, FALSE, NA)
NA | c(TRUE, FALSE, NA)
```
2021-04-19 20:56:29 +08:00
2022-03-18 03:15:24 +08:00
<!-- Draw truth tables? -->
2021-04-19 20:56:29 +08:00
2022-03-18 03:15:24 +08:00
To understand what's going on you need to think about `x | TRUE`, because regardless of whether `x` is `TRUE` or `FALSE` the result is still `TRUE`.
That means even if you don't know what `x` is (i.e. it's missing), the result must still be `TRUE`.
## Summaries
There are four particularly useful summary functions for logical vectors: they all take a vector of logical values and return a single value, making them a good fit for use in `summarise()`.
`any()` and `all()` --- `any()` will return if there's at least one `TRUE`, `all()` will return `TRUE` if all values are `TRUE`.
Like all summary functions, they'll return `NA` if there are any missing values present, and like usual you can make the missing values go away with `na.rm = TRUE`.
We could use this to see if there were any days where every flight was delayed:
2021-04-19 20:56:29 +08:00
2022-02-05 02:27:20 +08:00
```{r}
2022-03-18 03:15:24 +08:00
not_cancelled <- flights |> filter(!is.na(dep_delay), !is.na(arr_delay))
not_cancelled |>
group_by(year, month, day) |>
filter(all(arr_delay >= 0))
2022-02-05 02:27:20 +08:00
```
2021-04-19 20:56:29 +08:00
2022-03-18 03:15:24 +08:00
`sum()` and `mean()` are particularly useful with logical vectors because when you use a logical vector in a numeric context, `TRUE` becomes 1 and `FALSE` becomes 0.
That means that `sum(x)` gives the number of `TRUE`s in `x` and `mean(x)` gives the proportion of `TRUE`s.
That lets us find the day with the highest proportion of delayed flights:
```{r}
not_cancelled |>
group_by(year, month, day) |>
summarise(prop_delayed = mean(arr_delay > 0)) |>
arrange(desc(prop_delayed))
```
2021-04-19 22:31:38 +08:00
2022-03-18 03:15:24 +08:00
Or we could ask how many flights left before 5am, which usually are flights that were delayed from the previous day:
2021-04-19 20:56:29 +08:00
2022-02-05 02:27:20 +08:00
```{r}
2022-03-18 03:15:24 +08:00
not_cancelled |>
group_by(year, month, day) |>
summarise(n_early = sum(dep_time < 500)) |>
arrange(desc(n_early))
2022-02-05 02:27:20 +08:00
```
2022-03-18 03:15:24 +08:00
### Exercises
1. For each plane, count the number of flights before the first delay of greater than 1 hour.
2. What does `prod()` return when applied to a logical vector? What logical summary function is it equivalent to? What does `min()` return applied to a logical vector? What logical summary function is it equivalent to?
## Transformations
### Cumulative functions
2022-03-17 22:46:35 +08:00
Another useful pair of functions are cumulative any, `cumany()`, and cumulative all, `cumall()`.
`cumany()` will be `TRUE` after it encounters the first `TRUE`, and `cumall()` will be `FALSE` after it encounters its first `FALSE`.
2022-03-18 03:15:24 +08:00
```{r}
cumany(c(FALSE, FALSE, TRUE, TRUE, FALSE, TRUE))
cumall(c(TRUE, FALSE, TRUE, TRUE, FALSE, TRUE))
```
These are particularly useful in conjunction with `filter()` because they allow you to select rows:
- Before the first `FALSE` with `cumall(x)`.
- Before the first `TRUE` with `cumall(!x)`.
- After the first `TRUE` with `cumany(x)`.
- After the first `FALSE` with `cumany(!x)`.
If you imagine some data about a bank balance, then these functions allow you t
2022-03-17 22:46:35 +08:00
```{r}
df <- data.frame(
date = as.Date("2020-01-01") + 0:6,
balance = c(100, 50, 25, -25, -50, 30, 120)
)
# all rows after first overdraft
df |> filter(cumany(balance < 0))
# all rows until first overdraft
df |> filter(cumall(!(balance < 0)))
```
2022-03-18 03:15:24 +08:00
### Conditional outputs
2022-02-05 02:27:20 +08:00
2022-03-18 03:15:24 +08:00
If you want to use one value when a condition is true and another value when it's `FALSE`, you can use `if_else()`[^logicals-1].
2022-02-05 02:27:20 +08:00
2022-03-18 03:15:24 +08:00
[^logicals-1]: This is equivalent to the base R function `ifelse`.
2022-02-05 02:27:20 +08:00
There are two main advantages of `if_else()`over `ifelse()`: you can choose what should happen to missing values, and `if_else()` is much more likely to give you a meaningful error message if you use the wrong type of variable.
```{r}
df <- data.frame(
date = as.Date("2020-01-01") + 0:6,
balance = c(100, 50, 25, -25, -50, 30, 120)
)
2022-02-24 03:15:52 +08:00
df |> mutate(status = if_else(balance < 0, "overdraft", "ok"))
2022-02-05 02:27:20 +08:00
```
If you start to nest multiple sets of `if_else`s, I'd suggest switching to `case_when()` instead.
`case_when()` has a special syntax: it takes pairs that look like `condition ~ output`.
`condition` must evaluate to a logical vector; when it's `TRUE`, output will be used.
2021-04-19 20:56:29 +08:00
2022-02-05 02:27:20 +08:00
```{r}
2022-02-24 03:15:52 +08:00
df |>
2022-02-05 02:27:20 +08:00
mutate(
status = case_when(
balance == 0 ~ "no money",
balance < 0 ~ "overdraft",
balance > 0 ~ "ok"
)
)
```
(Note that I usually add spaces to make the outputs line up so it's easier to scan)
If none of the cases match, the output will be missing:
```{r}
x <- 1:10
case_when(
x %% 2 == 0 ~ "even",
)
```
You can create a catch all value by using `TRUE` as the condition:
```{r}
case_when(
x %% 2 == 0 ~ "even",
TRUE ~ "odd"
)
```
If multiple conditions are `TRUE`, the first is used:
```{r}
case_when(
x < 5 ~ "< 5",
x < 3 ~ "< 3",
)
```
2022-03-18 03:15:24 +08:00
##
2022-02-05 02:27:20 +08:00
2022-03-17 22:46:35 +08:00
##