540 lines
18 KiB
Plaintext
540 lines
18 KiB
Plaintext
# Logical vectors {#logicals}
|
|
|
|
```{r, results = "asis", echo = FALSE}
|
|
status("drafting")
|
|
```
|
|
|
|
## Introduction
|
|
|
|
In this chapter, you'll learn useful tools for working with logical vectors.
|
|
Logical vectors are the simplest type of vector because each element can only be one of three possible values: `TRUE`, `FALSE`, and `NA`.
|
|
You'll find logical vectors directly in data relatively rarely, but despite that they're extremely powerful because you'll frequently create them during data analysis.
|
|
|
|
We'll begin with the most common way of creating logical vectors: numeric comparisons.
|
|
Then we'll talk about using Boolean algebra to combine different logical vectors, and some useful summaries for logical vectors.
|
|
We'll finish off with some other tool for making conditional changes.
|
|
Along the way, you'll also learn a little more about working with missing values, `NA`.
|
|
|
|
### Prerequisites
|
|
|
|
Most of the functions you'll learn about this package are provided by base R; I'll label any new functions that don't come from base R with `dplyr::`.
|
|
You don't need the tidyverse to use base R functions, but we'll still load it so we can use `mutate()`, `filter()`, and friends.
|
|
use plenty of functions .
|
|
We'll also continue to draw inspiration from the nyclights13 dataset.
|
|
|
|
```{r setup, message = FALSE}
|
|
library(tidyverse)
|
|
library(nycflights13)
|
|
```
|
|
|
|
However, as we start to discuss more tools, there won't always be a perfect real example.
|
|
So we'll also start to use more abstract examples where we create some dummy data with `c()`.
|
|
This makes it easiesr to explain the general point at the cost to making it harder to see how it might apply to your data problems.
|
|
Just remember that any manipulate we do to a free-floating vector, you can do to a variable inside data frame with `mutate()` and friends.
|
|
|
|
```{r}
|
|
x <- c(1, 2, 3, 5, 7, 11, 13)
|
|
x * 2
|
|
|
|
# Equivalent to:
|
|
df <- tibble(x)
|
|
df |>
|
|
mutate(y = x * 2)
|
|
```
|
|
|
|
## Comparisons
|
|
|
|
A very common way to create a logical vector is via a numeric comparison with `<`, `<=`, `>`, `>=`, `!=`, and `==`.
|
|
You'll learn other ways to create them in later chapters dealing with strings and dates.
|
|
So far, we've mostly created logical variables implicitly within `filter()` --- they are computed, used, and then throw away.
|
|
For example, the following filter finds all day time departures that leave roughly on time:
|
|
|
|
```{r}
|
|
flights |>
|
|
filter(dep_time > 600 & dep_time < 2000 & abs(arr_delay) < 20)
|
|
```
|
|
|
|
But it's useful to know that this is a shortcut and you can explicitly create the underlying logical variables with `mutate()`:
|
|
|
|
```{r}
|
|
flights |>
|
|
mutate(
|
|
daytime = dep_time > 600 & dep_time < 2000,
|
|
approx_ontime = abs(arr_delay) < 20,
|
|
.keep = "used"
|
|
)
|
|
```
|
|
|
|
This is useful because it allows you to name components, which can made the code easier to read, and it allows you to double-check the intermediate steps.
|
|
This is a particularly useful technique when you're doing more complicated Boolean algebra, as you'll learn about in the next section.
|
|
|
|
So the initial filter could also be written as:
|
|
|
|
```{r, results = FALSE}
|
|
flights |>
|
|
mutate(
|
|
daytime = dep_time > 600 & dep_time < 2000,
|
|
approx_ontime = abs(arr_delay) < 20,
|
|
) |>
|
|
filter(daytime & approx_ontime)
|
|
```
|
|
|
|
### Floating point comparison
|
|
|
|
Beware when using `==` with numbers as results might surprise you!
|
|
It looks like this vector contains the numbers 1 and 2:
|
|
|
|
```{r}
|
|
x <- c(1 / 49 * 49, sqrt(2) ^ 2)
|
|
x
|
|
```
|
|
|
|
But if you test them for equality, you surprisingly get `FALSE`:
|
|
|
|
```{r}
|
|
x == c(1, 2)
|
|
```
|
|
|
|
That's because computers use finite precision arithmetic (they obviously can't store an infinite number of digits!) so in most cases, the number number you see is an actually approximation.
|
|
R usually rounds these numbers to avoid displaying a bunch of usually unimportant digits.
|
|
|
|
To see the details you can call `print()` with the the `digits`[^logicals-1] argument.
|
|
R normally calls print automatically for you (i.e. `x` is a shortcut for `print(x)`), but calling it explicitly is useful if you want to provide other arguments:
|
|
|
|
[^logicals-1]: A floating point number can hold roughly 16 decimal digits; the precise number is surprisingly complicated and depends on the number.
|
|
|
|
```{r}
|
|
print(x, digits = 16)
|
|
```
|
|
|
|
Now that you've seen why `==` is failing, what can you do about it?
|
|
One option is to use `round()` to round to any number of digits, or instead of `==`, use `dplyr::near()`, which does the comparison with a small amount of tolerance:
|
|
|
|
```{r}
|
|
near(x, c(1, 2))
|
|
```
|
|
|
|
### Missing values {#na-comparison}
|
|
|
|
Missing values represent the unknown so they missing values are "contagious": almost any operation involving an unknown value will also be unknown:
|
|
|
|
```{r}
|
|
NA > 5
|
|
10 == NA
|
|
```
|
|
|
|
The most confusing result is this one:
|
|
|
|
```{r}
|
|
NA == NA
|
|
```
|
|
|
|
It's easiest to understand why this is true with a bit more context:
|
|
|
|
```{r}
|
|
# Let x be Mary's age. We don't know how old she is.
|
|
x <- NA
|
|
|
|
# Let y be John's age. We don't know how old he is.
|
|
y <- NA
|
|
|
|
# Are John and Mary the same age?
|
|
x == y
|
|
# We don't know!
|
|
```
|
|
|
|
So if you want to find all flights with `dep_time` is missing, the following code won't work because `dep_time == NA` will yield a `NA` for every single row, and `filter()` automatically drops missing values:
|
|
|
|
```{r}
|
|
flights |>
|
|
filter(dep_time == NA)
|
|
```
|
|
|
|
Instead we'll need a new tool: `is.na()`.
|
|
|
|
### `is.na()`
|
|
|
|
There's one other very useful way to create logical vectors: `is.na()`.
|
|
This takes any type of vector and returns `TRUE` is the value is `NA`, and `FALSE` otherwise:
|
|
|
|
```{r}
|
|
is.na(c(TRUE, NA, FALSE))
|
|
is.na(c(1, NA, 3))
|
|
is.na(c("a", NA, "b"))
|
|
```
|
|
|
|
We can use `is.na()` to find all the rows with a missing `dep_time`:
|
|
|
|
```{r}
|
|
flights |>
|
|
filter(is.na(dep_time))
|
|
```
|
|
|
|
It can also be useful in `arrange()`, because by default, `arrange()` puts all the missing values at the end.
|
|
You can override this default by first sorting by `is.na()`:
|
|
|
|
```{r}
|
|
flights |>
|
|
arrange(arr_delay)
|
|
|
|
flights |>
|
|
arrange(desc(is.na(arr_delay)), arr_delay)
|
|
```
|
|
|
|
### Exercises
|
|
|
|
1. How does `dplyr::near()` work? Read the source code to find out.
|
|
2. Use `mutate()`, `is.na()`, and `count()` together to describe how the missing values in `dep_time`, `sched_dep_time` and `dep_delay` are connected.
|
|
|
|
## Boolean algebra
|
|
|
|
Once you have multiple logical vectors, you can combine them together using Boolean algebra.
|
|
In R, `&` is "and", `|` is "or", and `!` is "not", and `xor()` is exclusive or[^logicals-2].
|
|
Figure \@ref(fig:bool-ops) shows the complete set of Boolean operations and how they work.
|
|
|
|
[^logicals-2]: That is, `xor(x, y)` is true if x is true, or y is true, but not both.
|
|
This is how we usually use "or" In English.
|
|
Both is not usually an acceptable answer to the question "would you like ice cream or cake?".
|
|
|
|
```{r bool-ops}
|
|
#| echo: false
|
|
#| out.width: NULL
|
|
#| fig.cap: >
|
|
#| Complete set of boolean operations. `x` is the left-hand
|
|
#| circle, `y` is the right-hand circle, and the shaded region show
|
|
#| which parts each operator selects."
|
|
#| fig.alt: >
|
|
#| Six Venn diagrams, each explaining a given logical operator. The
|
|
#| circles (sets) in each of the Venn diagrams represent x and y. 1. y &
|
|
#| !x is y but none of x, x & y is the intersection of x and y, x & !y is
|
|
#| x but none of y, x is all of x none of y, xor(x, y) is everything
|
|
#| except the intersection of x and y, y is all of y none of x, and
|
|
#| x | y is everything.
|
|
knitr::include_graphics("diagrams/transform.png", dpi = 270)
|
|
```
|
|
|
|
As well as `&` and `|`, R also has `&&` and `||`.
|
|
Don't use them in dplyr functions!
|
|
These are called short-circuiting operators and only ever return a single `TRUE` or `FALSE`.
|
|
They're important for programming so you'll learn more about them in Section \@ref(conditional-execution).
|
|
|
|
The following code finds all flights that departed in November or December:
|
|
|
|
```{r, eval = FALSE}
|
|
flights |>
|
|
filter(month == 11 | month == 12)
|
|
```
|
|
|
|
Note that the order of operations doesn't work like English.
|
|
You can't think "find all flights that departed in November or December" and write `flights |> filter(month == 11 | 12)`.
|
|
This code will not error, but it will do something rather confusing.
|
|
First R evaluates `11 | 12` which is equivalent to `TRUE | TRUE`, which returns `TRUE`.
|
|
Then it evaluates `month == TRUE`.
|
|
Since month is numeric, this is equivalent to `month == 1`, so `flights |> filter(month == 11 | 12)` returns all flights in January!
|
|
|
|
### `%in%`
|
|
|
|
An easy way to avoid this issue is to use `%in%`.
|
|
`x %in% y` returns a logical vector the same length as `x` that is `TRUE` whenever a value in `x` is anywhere in `y` .
|
|
|
|
```{r}
|
|
letters[1:10] %in% c("a", "e", "i", "o", "u")
|
|
```
|
|
|
|
So we could instead write:
|
|
|
|
```{r, eval = FALSE}
|
|
flights |>
|
|
filter(month %in% c(11, 12))
|
|
```
|
|
|
|
Note that `%in%` obeys different rules for `NA` to `==`.
|
|
|
|
```{r}
|
|
c(1, 2, NA) == NA
|
|
c(1, 2, NA) %in% NA
|
|
```
|
|
|
|
This can make for a useful shortcut:
|
|
|
|
```{r}
|
|
flights |>
|
|
filter(dep_time %in% c(NA, 0800))
|
|
```
|
|
|
|
### Missing values {#na-boolean}
|
|
|
|
The rules for missing values in Boolean algebra are a little tricky to explain because they seem inconsistent at first glance:
|
|
|
|
```{r}
|
|
df <- tibble(x = c(TRUE, FALSE, NA))
|
|
|
|
df |>
|
|
mutate(
|
|
and = x & NA,
|
|
or = x | NA
|
|
)
|
|
```
|
|
|
|
To understand what's going on, think about `NA | TRUE`.
|
|
A missing value means that the value could either be `TRUE` or `FALSE`.
|
|
`TRUE | TRUE` and `FALSE | TRUE` are both `TRUE`, so `NA | TRUE` must also be `TRUE`.
|
|
Similar reasoning applies with `NA & FALSE`.
|
|
|
|
### Exercises
|
|
|
|
1. Find all flights where `arr_delay` is missing but `dep_delay` is not. Find all flights where neither `arr_time` nor `sched_arr_time` are missing, but `arr_delay` is.
|
|
2. How many flights have a missing `dep_time`? What other variables are missing in these rows? What might these rows represent?
|
|
3. Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay?
|
|
|
|
## Summaries {#logical-summaries}
|
|
|
|
While, you can summarize logical variables directly with functions that work only with logicals, there are two other important summaries.
|
|
Numeric summaries like `sum()` and `mean()`, and using summaries as inline filters.
|
|
|
|
### Logical summaries
|
|
|
|
There are two important logical summaries: `any()` and `all()`.
|
|
`any(x)` is the equivalent of `|`; it'll return `TRUE` if there are any `TRUE`'s in `x`.
|
|
`all(x)` is equivalent of `&`; it'll return `TRUE` only if all values of `x` are `TRUE`'s.
|
|
Like all summary functions, they'll return `NA` if there are any missing values present, and like usual you can make the missing values go away with `na.rm = TRUE`.
|
|
|
|
For example, we could use `all()` to find out if there were days where every flight was delayed:
|
|
|
|
```{r}
|
|
not_cancelled <- flights |>
|
|
filter(!is.na(dep_delay), !is.na(arr_delay))
|
|
|
|
not_cancelled |>
|
|
group_by(year, month, day) |>
|
|
summarise(
|
|
all_delayed = all(arr_delay >= 0),
|
|
any_delayed = any(arr_delay >= 0),
|
|
.groups = "drop"
|
|
)
|
|
```
|
|
|
|
In most cases, however, `any()` and `all()` are a little too crude, and it would be nice to be able to get a little more detail about how many values are `TRUE` or `FALSE`.
|
|
That leads us to the numeric summaries.
|
|
|
|
### Numeric summaries
|
|
|
|
When you use a logical vector in a numeric context, `TRUE` becomes 1 and `FALSE` becomes 0.
|
|
This makes `sum()` and `mean()` are particularly useful with logical vectors because `sum(x)` will give the number of `TRUE`s and `mean(x)` gives the proportion of `TRUE`s.
|
|
That lets us see the distribution of delays across the days of the year:
|
|
|
|
```{r}
|
|
not_cancelled |>
|
|
group_by(year, month, day) |>
|
|
summarise(
|
|
prop_delayed = mean(arr_delay > 0),
|
|
.groups = "drop"
|
|
) |>
|
|
ggplot(aes(prop_delayed)) +
|
|
geom_histogram(binwidth = 0.05)
|
|
```
|
|
|
|
Or we could ask how many flights left before 5am, which usually are flights that were delayed from the previous day:
|
|
|
|
```{r}
|
|
not_cancelled |>
|
|
group_by(year, month, day) |>
|
|
summarise(
|
|
n_early = sum(dep_time < 500),
|
|
.groups = "drop"
|
|
) |>
|
|
arrange(desc(n_early))
|
|
```
|
|
|
|
### Logical subsetting
|
|
|
|
There's one final use for logical vectors in summaries: you can use a logical vector to filter a single variable to a subset of interest.
|
|
This makes use of the base `[` (pronounced subset) operator, which you'll learn more about this in Section \@ref(vector-subsetting).
|
|
|
|
Imagine we wanted to look at the average delay just for flights that were actually delayed.
|
|
One way to do so would be to first filter the flights:
|
|
|
|
```{r}
|
|
not_cancelled |>
|
|
filter(arr_delay > 0) |>
|
|
group_by(year, month, day) |>
|
|
summarise(
|
|
ahead = mean(arr_delay),
|
|
n = n(),
|
|
.groups = "drop"
|
|
)
|
|
```
|
|
|
|
This works, but what if we wanted to also compute the average delay for flights that left early?
|
|
We'd need to perform a separate filter step, and then figure out how to combine the two data frames together (which we'll cover in Chapter \@ref(relational-data)).
|
|
Instead you could use `[` to perform an inline filtering: `arr_delay[arr_delay > 0]` will yield only the positive arrival delays.
|
|
|
|
This leads to:
|
|
|
|
```{r}
|
|
not_cancelled |>
|
|
group_by(year, month, day) |>
|
|
summarise(
|
|
ahead = mean(arr_delay[arr_delay > 0]),
|
|
behind = mean(arr_delay[arr_delay < 0]),
|
|
n = n(),
|
|
.groups = "drop"
|
|
)
|
|
```
|
|
|
|
Also note the difference in the group size: in the first chunk `n` gives the number of delayed flights per day; in the second, `n` gives the total number of flights.
|
|
|
|
### Exercises
|
|
|
|
1. What will `sum(is.na(x))` tell you? How about `mean(is.na(x))`?
|
|
2. What does `prod()` return when applied to a logical vector? What logical summary function is it equivalent to? What does `min()` return applied to a logical vector? What logical summary function is it equivalent to? Read the documentation and perform a few experiments.
|
|
|
|
## Conditional transformations
|
|
|
|
One of the most powerful features of logical vectors are their use for conditional transformations, i.e. returning one value for true values, and a different value for false values.
|
|
We'll see a couple of different ways to do this, and the
|
|
|
|
### `if_else()`
|
|
|
|
If you want to use one value when a condition is true and another value when it's `FALSE`, you can use `if_else()`[^logicals-3].
|
|
|
|
[^logicals-3]: This is equivalent to the base R function `ifelse`.
|
|
There are two main advantages of `if_else()`over `ifelse()`: you can choose what should happen to missing values, and `if_else()` is much more likely to give you a meaningful error message if you use the wrong type of variable.
|
|
|
|
```{r}
|
|
df <- tibble(
|
|
date = as.Date("2020-01-01") + 0:6,
|
|
balance = c(100, 50, 25, -25, -50, 30, 120)
|
|
)
|
|
df |>
|
|
mutate(
|
|
status = if_else(balance < 0, "overdraft", "ok")
|
|
)
|
|
```
|
|
|
|
If you need to create more complex conditions, you can string together multiple `if_elses()`s, but this quickly gets hard to read.
|
|
|
|
```{r}
|
|
df |>
|
|
mutate(
|
|
status = if_else(balance == 0, "zero",
|
|
if_else(balance < 0, "overdraft", "ok"))
|
|
)
|
|
```
|
|
|
|
Instead, you can switch to `case_when()` instead.
|
|
|
|
### `case_when()`
|
|
|
|
`case_when()` has a special syntax that unfortunately looks like nothing else you'll use in the tidyverse.
|
|
it takes pairs that look like `condition ~ output`.
|
|
`condition` must make a logical a logical vector; when it's `TRUE`, `output` will be used.
|
|
|
|
```{r}
|
|
flights |>
|
|
mutate(
|
|
status = case_when(
|
|
is.na(arr_delay) ~ "cancelled",
|
|
arr_delay > 60 ~ "very late",
|
|
arr_delay > 15 ~ "late",
|
|
abs(arr_delay) <= 15 ~ "on time",
|
|
arr_delay < -15 ~ "early",
|
|
arr_delay < -30 ~ "very early",
|
|
),
|
|
.keep = "used"
|
|
)
|
|
```
|
|
|
|
(Note that I usually add spaces to make the outputs line up so it's easier to scan)
|
|
|
|
To explain how `case_when()` works, lets pull it out of the mutate and create some simple dummy data.
|
|
|
|
```{r}
|
|
x <- 1:10
|
|
case_when(
|
|
x < 5 ~ "small",
|
|
x >= 5 ~ "big"
|
|
)
|
|
```
|
|
|
|
- If none of the cases match, the output will be missing:
|
|
|
|
```{r}
|
|
case_when(
|
|
x %% 2 == 0 ~ "even",
|
|
)
|
|
```
|
|
|
|
- You can create a catch all value by using `TRUE` as the condition:
|
|
|
|
```{r}
|
|
case_when(
|
|
x %% 2 == 0 ~ "even",
|
|
TRUE ~ "odd"
|
|
)
|
|
```
|
|
|
|
- If multiple conditions are `TRUE`, the first is used:
|
|
|
|
```{r}
|
|
case_when(
|
|
x < 5 ~ "< 5",
|
|
x < 3 ~ "< 3",
|
|
TRUE ~ "big"
|
|
)
|
|
```
|
|
|
|
The simple examples I've shown you here all use just a single variable, but the logical conditions can use any number of variables.
|
|
And you can use variables on the right hand side.
|
|
|
|
## Cumulative tricks
|
|
|
|
Before we move on to the next chapter, I want to show you a grab bag of tricks that make use of cumulative functions (i.e. functions that depending on every previous value of a vector).
|
|
These all feel a bit magical, and I'm torn on whether or not they should be included in this book.
|
|
But in the end, some of them are just so useful I think it's important to mention them --- they're not particularly easy to understand and don't help with that many problems, but when they do, they provide a substantial advantage.
|
|
|
|
<!-- TODO: illustration of accumulating function -->
|
|
|
|
Another useful pair of functions are cumulative any, `dplyr::cumany()`, and cumulative all, `dplyr::cumall()`.
|
|
`cumany()` will be `TRUE` after it encounters the first `TRUE`, and `cumall()` will be `FALSE` after it encounters its first `FALSE`.
|
|
|
|
```{r}
|
|
cumany(c(FALSE, FALSE, TRUE, TRUE, FALSE, TRUE))
|
|
cumall(c(TRUE, FALSE, TRUE, TRUE, FALSE, TRUE))
|
|
```
|
|
|
|
These are particularly useful in conjunction with `filter()` because they allow you to select rows:
|
|
|
|
- Before the first `FALSE` with `cumall(x)`.
|
|
- Before the first `TRUE` with `cumall(!x)`.
|
|
- After the first `TRUE` with `cumany(x)`.
|
|
- After the first `FALSE` with `cumany(!x)`.
|
|
|
|
If you imagine some data about a bank balance, then these functions allow you t
|
|
|
|
```{r}
|
|
df <- tibble(
|
|
date = as.Date("2020-01-01") + 0:6,
|
|
balance = c(100, 50, 25, -25, -50, 30, 120)
|
|
)
|
|
# all rows after first overdraft
|
|
df |> filter(cumany(balance < 0))
|
|
# all rows until first overdraft
|
|
df |> filter(cumall(!(balance < 0)))
|
|
```
|
|
|
|
`cumsum()` as way of defining groups:
|
|
|
|
```{r}
|
|
df |>
|
|
mutate(
|
|
negative = balance < 0,
|
|
flip = negative != lag(negative),
|
|
group = cumsum(coalesce(flip, FALSE))
|
|
)
|
|
```
|
|
|
|
### Exercises
|
|
|
|
1. For each plane, count the number of flights before the first delay of greater than 1 hour.
|