r4ds/logicals.Rmd

313 lines
10 KiB
Plaintext

# Logicals and numbers {#logicals}
```{r, results = "asis", echo = FALSE}
status("drafting")
```
## Introduction
In this chapter, you'll learn useful tools for working with logical vectors.
Logical vectors are the simplest type of vector because each element can only be one of three possible values: `TRUE`, `FALSE`, and `NA`.
Despite that simplicity, they're an extremely powerful tool.
### Prerequisites
```{r, message = FALSE}
library(tidyverse)
library(nycflights13)
```
## Comparisons
Some times you'll get data that already includes logical vectors but in most cases you'll create them by using a comparison, like `<`, `<=`, `>`, `>=`, `!=`, and `==`.
### In `mutate()`
So far, you've mostly created these new variables implicitly within `filter()`:
```{r}
flights |>
filter(dep_time > 600 & dep_time < 2000 & abs(arr_delay) < 20)
```
But it's useful to know that this is a shortcut and you can explicitly create perform these operations inside a `mutate()`
```{r}
flights |>
mutate(
daytime = dep_time > 600 & dep_time < 2000,
approx_ontime = abs(arr_delay) < 20,
.keep = "used"
)
```
So the filter above could also be written as:
```{r}
flights |>
mutate(
daytime = dep_time > 600 & dep_time < 2000,
approx_ontime = abs(arr_delay) < 20,
) |>
filter(daytime & approx_ontime)
```
This is an important technique when you're are doing complicated subsetting because it allows you to double-check the intermediate steps.
### Floating point comparison
Beware when using `==` with numbers as results might surprise you!
You might think that the following two computations yield 1 and 2:
```{r}
(1 / 49 * 49)
sqrt(2) ^ 2
```
But if you test them for equality, you'll discover that they're not what you expect!
```{r}
(1 / 49 * 49) == 1
(sqrt(2) ^ 2) == 2
```
That's because computers use finite precision arithmetic (they obviously can't store an infinite number of digits!) so in most cases, the number number you see is an actually approximation.
R usually rounds these numbers to avoid displaying a bunch of usually unimportant digits.
You can use the `digits` argument to `format()` to force R to display more:
```{r}
format(1 / 49 * 49, digits = 20)
format(sqrt(2) ^ 2, digits = 20)
```
Instead of relying on `==`, you can use `dplyr::near()`, which does the comparison with a small amount of tolerance:
```{r}
near(sqrt(2) ^ 2, 2)
near(1 / 49 * 49, 1)
```
### `is.na()`
Another common way to create logical vector is with `is.na()`.
This is particularly important in conjunction with `filter()` because filter only selects rows where the value is `TRUE`; rows where the value is `FALSE` are automatically dropped.
```{r}
flights |> filter(is.na(dep_delay) | is.na(arr_delay))
flights |> filter(is.na(dep_delay) != is.na(arr_delay))
```
## Boolean algebra
Once you have multiple logical vectors, you can combine them together using Boolean algebra: `&` is "and", `|` is "or", and `!` is "not".
`xor()` provides one final useful operation: exclusive or.
Figure \@ref(fig:bool-ops) shows the complete set of Boolean operations and how they work.
```{r bool-ops}
#| echo: false
#| out.width: NULL
#| fig.cap: >
#| Complete set of boolean operations. `x` is the left-hand
#| circle, `y` is the right-hand circle, and the shaded region show
#| which parts each operator selects."
#| fig.alt: >
#| Six Venn diagrams, each explaining a given logical operator. The
#| circles (sets) in each of the Venn diagrams represent x and y. 1. y &
#| !x is y but none of x, x & y is the intersection of x and y, x & !y is
#| x but none of y, x is all of x none of y, xor(x, y) is everything
#| except the intersection of x and y, y is all of y none of x, and
#| x | y is everything.
knitr::include_graphics("diagrams/transform-logical.png")
```
As well as `&` and `|`, R also has `&&` and `||`.
Don't use them in dplyr functions!
These are called short-circuiting operators and only ever return a single `TRUE` or `FALSE`.
They're important for programming so you'll learn more about them in Section \@ref(conditional-execution).
The following code finds all flights that departed in November or December:
```{r, eval = FALSE}
flights |>
filter(month == 11 | month == 12)
```
Note that the order of operations doesn't work like English.
You can't think "find all flights that departed in November or December" and write `flights |> filter(month == 11 | 12)`.
This code will not error, but it will do something rather confusing.
First R evaluates `11 | 12` which is equivalent to `TRUE | TRUE`, which returns `TRUE`.
Then it evaluates `month == TRUE`.
Since month is numeric, this is equivalent to `month == 1`, so `flights |> filter(month == 11 | 12)` returns all flights in January!
### `%in%`
An easy way to avoid this issue is to use `%in%`.
`x %in% y` returns a logical vector the same length as `x` that is `TRUE` whenever a value in `x` is anywhere in `y` .
So we could instead write:
```{r, eval = FALSE}
flights |>
filter(month %in% c(11, 12))
```
Sometimes you can simplify complicated subsetting by remembering De Morgan's law: `!(x & y)` is the same as `!x | !y`, and `!(x | y)` is the same as `!x & !y`.
For example, if you wanted to find flights that weren't delayed (on arrival or departure) by more than two hours, you could use either of the following two filters:
```{r, eval = FALSE}
flights |>
filter(!(arr_delay > 120 | dep_delay > 120))
flights |>
filter(arr_delay <= 120 & dep_delay <= 120)
```
### Missing values {#logical-missing}
The rules for missing values in Boolean algebra are a little tricky to explain because they seem inconsistent at first glance:
```{r}
NA & c(TRUE, FALSE, NA)
NA | c(TRUE, FALSE, NA)
```
<!-- Draw truth tables? -->
To understand what's going on you need to think about `x | TRUE`, because regardless of whether `x` is `TRUE` or `FALSE` the result is still `TRUE`.
That means even if you don't know what `x` is (i.e. it's missing), the result must still be `TRUE`.
## Summaries
There are four particularly useful summary functions for logical vectors: they all take a vector of logical values and return a single value, making them a good fit for use in `summarise()`.
`any()` and `all()` --- `any()` will return if there's at least one `TRUE`, `all()` will return `TRUE` if all values are `TRUE`.
Like all summary functions, they'll return `NA` if there are any missing values present, and like usual you can make the missing values go away with `na.rm = TRUE`.
We could use this to see if there were any days where every flight was delayed:
```{r}
not_cancelled <- flights |> filter(!is.na(dep_delay), !is.na(arr_delay))
not_cancelled |>
group_by(year, month, day) |>
filter(all(arr_delay >= 0))
```
`sum()` and `mean()` are particularly useful with logical vectors because when you use a logical vector in a numeric context, `TRUE` becomes 1 and `FALSE` becomes 0.
That means that `sum(x)` gives the number of `TRUE`s in `x` and `mean(x)` gives the proportion of `TRUE`s.
That lets us find the day with the highest proportion of delayed flights:
```{r}
not_cancelled |>
group_by(year, month, day) |>
summarise(prop_delayed = mean(arr_delay > 0)) |>
arrange(desc(prop_delayed))
```
Or we could ask how many flights left before 5am, which usually are flights that were delayed from the previous day:
```{r}
not_cancelled |>
group_by(year, month, day) |>
summarise(n_early = sum(dep_time < 500)) |>
arrange(desc(n_early))
```
### Exercises
1. For each plane, count the number of flights before the first delay of greater than 1 hour.
2. What does `prod()` return when applied to a logical vector? What logical summary function is it equivalent to? What does `min()` return applied to a logical vector? What logical summary function is it equivalent to?
## Transformations
### Cumulative functions
Another useful pair of functions are cumulative any, `cumany()`, and cumulative all, `cumall()`.
`cumany()` will be `TRUE` after it encounters the first `TRUE`, and `cumall()` will be `FALSE` after it encounters its first `FALSE`.
```{r}
cumany(c(FALSE, FALSE, TRUE, TRUE, FALSE, TRUE))
cumall(c(TRUE, FALSE, TRUE, TRUE, FALSE, TRUE))
```
These are particularly useful in conjunction with `filter()` because they allow you to select rows:
- Before the first `FALSE` with `cumall(x)`.
- Before the first `TRUE` with `cumall(!x)`.
- After the first `TRUE` with `cumany(x)`.
- After the first `FALSE` with `cumany(!x)`.
If you imagine some data about a bank balance, then these functions allow you t
```{r}
df <- data.frame(
date = as.Date("2020-01-01") + 0:6,
balance = c(100, 50, 25, -25, -50, 30, 120)
)
# all rows after first overdraft
df |> filter(cumany(balance < 0))
# all rows until first overdraft
df |> filter(cumall(!(balance < 0)))
```
### Conditional outputs
If you want to use one value when a condition is true and another value when it's `FALSE`, you can use `if_else()`[^logicals-1].
[^logicals-1]: This is equivalent to the base R function `ifelse`.
There are two main advantages of `if_else()`over `ifelse()`: you can choose what should happen to missing values, and `if_else()` is much more likely to give you a meaningful error message if you use the wrong type of variable.
```{r}
df <- data.frame(
date = as.Date("2020-01-01") + 0:6,
balance = c(100, 50, 25, -25, -50, 30, 120)
)
df |> mutate(status = if_else(balance < 0, "overdraft", "ok"))
```
If you start to nest multiple sets of `if_else`s, I'd suggest switching to `case_when()` instead.
`case_when()` has a special syntax: it takes pairs that look like `condition ~ output`.
`condition` must evaluate to a logical vector; when it's `TRUE`, output will be used.
```{r}
df |>
mutate(
status = case_when(
balance == 0 ~ "no money",
balance < 0 ~ "overdraft",
balance > 0 ~ "ok"
)
)
```
(Note that I usually add spaces to make the outputs line up so it's easier to scan)
If none of the cases match, the output will be missing:
```{r}
x <- 1:10
case_when(
x %% 2 == 0 ~ "even",
)
```
You can create a catch all value by using `TRUE` as the condition:
```{r}
case_when(
x %% 2 == 0 ~ "even",
TRUE ~ "odd"
)
```
If multiple conditions are `TRUE`, the first is used:
```{r}
case_when(
x < 5 ~ "< 5",
x < 3 ~ "< 3",
)
```
##
##