More join polishing

This commit is contained in:
Hadley Wickham 2022-09-07 12:30:35 -05:00
parent e6939c52d5
commit 50e8e3965b
14 changed files with 168 additions and 149 deletions

Binary file not shown.

BIN
diagrams/join/closest.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 32 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 26 KiB

After

Width:  |  Height:  |  Size: 26 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 29 KiB

After

Width:  |  Height:  |  Size: 28 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 27 KiB

After

Width:  |  Height:  |  Size: 27 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 32 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 80 KiB

After

Width:  |  Height:  |  Size: 55 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 60 KiB

After

Width:  |  Height:  |  Size: 60 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 44 KiB

After

Width:  |  Height:  |  Size: 44 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 42 KiB

After

Width:  |  Height:  |  Size: 42 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 52 KiB

After

Width:  |  Height:  |  Size: 49 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 39 KiB

After

Width:  |  Height:  |  Size: 39 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 50 KiB

After

Width:  |  Height:  |  Size: 53 KiB

317
joins.qmd
View File

@ -390,12 +390,10 @@ flights2 |>
## How do joins work?
Now that you've used a few joins it's time to learn more about how they work, focusing especially on how each row in `x` matches with each row in `y`.
We'll start with a visual representation of the two simple tibbles defined below.
Figure @fig-join-setup.
The coloured column represents the keys of the two data frames, here literally called `key`.
The grey column represents the "value" column that is carried along for the ride.
Now that you've used a few joins it's time to learn more about how they work, focusing especially on how each row in `x` matches with rows in `y`.
We'll begin by using @fig-join-setup to introduce a visual representation of the two simple tibbles defined below.
The column with colored cells represents the keys of the two data frames, here literally called `key`.
The grey columns represents the "value" columns that is carried along for the ride.
In these examples we'll use a single key variable, but the idea generalizes to multiple keys and multiple values.
```{r}
@ -422,47 +420,45 @@ y <- tribble(
#| fig-alt: >
#| x and y are two data frames with 2 columns and 3 rows each. The first
#| column in each is the key and the second is the value. The contents of
#| these data frames are given in the subsequent code chunk.
#| these data frames are given in the previous code chunk.
knitr::include_graphics("diagrams/join/setup.png", dpi = 270)
```
A join is a way of connecting each row in `x` to zero, one, or more rows in `y`.
@fig-join-setup2 shows each potential match as an intersection of a pair of lines.
If you look closely, you'll notice that we've switched the order of the key and value columns in `x`.
This is to emphasize that joins match based on the key; the other columns are just carried along for the ride.
@fig-join-setup2 shows all potential matches between `x` and `y` as an intersection of a pair of lines.
For this example, the rows in the output will be primarily determined by `x`, so the `x` table is horizontal and will line up with the output.
```{r}
#| label: fig-join-setup2
#| echo: false
#| out-width: ~
#| fig-cap: >
#| To prepare to show how joins work we create a grid showing every
#| possible match between the two tibbles.
#| To understand how joins work, it's useful to think of every possible
#| match. Here we show that by drawing a grid of connecting lines.
#| fig-alt: >
#| x and y data frames placed next to each other, with the key variable
#| moved up front in y so that the key variable in x and key variable
#| in y appear next to each other.
#| x and y are placed at right-angles, with horizonal lines extending
#| from x and vertical lines extending from y. There are 3 rows in x and
#| 3 rows in y leading to 9 intersections that represent nine potential
#| matches.
knitr::include_graphics("diagrams/join/setup2.png", dpi = 270)
```
In an actual join, matches will be indicated with dots, as in @fig-join-inner.
The number of dots = the number of matches = the number of rows in the output, a new data frame that contains the key, the x values, and the y values.
The join shown here is a so-called **inner join**, where the output contains only the rows that appear in both `x` and `y`.
The number of dots equals the number of matches, which in turn equals the number of rows in the output, a new data frame that contains the key, the x values, and the y values.
The join shown here is a so-called **inner join**, where rows match if the keys are equal, so that the output contains only the rows with keys that appear in both `x` and `y`.
```{r}
#| label: fig-join-inner
#| echo: false
#| out-width: ~
#| fig-cap: >
#| A join showing which rows in the x table match rows in the y table.
#| An inner join matches rows in `x` to rows in `y` that have the
#| same value of `key`. Each match becomes a row in the output.
#| fig-alt: >
#| Keys 1 and 2 in x and y data frames are matched and indicated with lines
#| joining these rows with dot in the middle. Hence, there are two dots in
#| this diagram. The resulting joined data frame has two rows and 3 columns:
#| key, val_x, and val_y. Values in the key column are 1 and 2, the matched
#| values.
#| Keys 1 and 2 appear in both x and y, so there values are equal and
#| we get a match, indicated by a dot. Each dot corresponds to a row
#| in the output, so the resulting joined data frame has two rows.
knitr::include_graphics("diagrams/join/inner.png", dpi = 270)
```
@ -470,65 +466,64 @@ knitr::include_graphics("diagrams/join/inner.png", dpi = 270)
An **outer join** keeps observations that appear in at least one of the data frames.
These joins work by adding an additional "virtual" observation to each data frame.
This observation has a key that matches if no other key matches, and values filled with `NA`.
There are three types of outer joins:
- A **left join** keeps all observations in `x`, @fig-join-left.
Every row of `x` is preserved in the output because it can fall back to matching a row of `NA`s in `y`.
```{r}
#| label: fig-join-left
#| echo: false
#| out-width: ~
#| fig-cap: >
#| A visual representation of the left join. Every row of `x` is
#| preserved in the output because it can fallback to matching a
#| row of `NA`s in `y`.
#| A visual representation of the left join where row in `x` appears
#| in the output.
#| fig-alt: >
#| Left join: keys 1 and 2 from x are matched to those in y, key 3 is
#| also carried along to the joined result since it's on the left data
#| frame, but key 4 from y is not carried along since it's on the right
#| but not on the left. The result has 3 rows: keys 1, 2, and 3,
#| all values from val_x, and the corresponding values from val_y for
#| keys 1 and 2 with an NA for key 3, val_y.
#| Compared to the inner join, the `y` table gets a new virtual row
#| that will match any row in `x` that doesn't otherwise have a match.
#| This means that the output now has three rows. For key = 3, which
#| matches this virtual row, the value of val_y is NA.
knitr::include_graphics("diagrams/join/left.png", dpi = 270)
```
- A **right join** keeps all observations in `y`, @fig-join-right.
Every row of `y` is preserved in the output because it can fall back to matching a row of `NA`s in `x`.
Note the output will consist of all `x` rows that match a row in `y`, then all the rows of `y` that didn't match in `x`.
```{r}
#| label: fig-join-right
#| echo: false
#| out-width: ~
#| fig-cap: >
#| A visual representation of the right join. Every row of `y` is
#| preserved in the output because it can fallback to matching a
#| row of `NA`s in `x`.
#| A visual representation of the right join where every row of `y`
#| appears in the output.
#| fig-alt: >
#| Keys 1 and 2 from x are matched to those in y, key 4 is
#| also carried along to the joined result since it's on the right data frame,
#| but key 3 from x is not carried along since it's on the left but not on the
#| right. The result is a data frame with 3 rows: keys 1, 2, and 4, all values
#| from val_y, and the corresponding values from val_x for keys 1 and 2 with
#| an NA for key 4, val_x.
#| also carried along to the joined result since it's on the right data
#| frame, but key 3 from x is not carried along since it's on the left
#| but not on the right. The result is a data frame with 3 rows: keys
#| 1, 2, and 4, all values from val_y, and the corresponding values
#| from val_x for keys 1 and 2 with an NA for key 4, val_x.
knitr::include_graphics("diagrams/join/right.png", dpi = 270)
```
- A **full join** keeps all observations in `x` and `y`, @fig-join-full.
Every row of `x` and `y` `is` included in the output because both `x` and `y` have a fall back row of `NA`s.
Note the output will consist of all `x` rows followed by the remaining `y` rows.
```{r}
#| label: fig-join-full
#| echo: false
#| out-width: ~
#| fig-cap: >
#| A visual representation of the full join. Every row of `x` and `y`
#| is included in the output because both `x` and `y` have a fallback
#| row of `NA`s.
#| A visual representation of the full join where every row in `x`
#| and `y` appears in the output.
#| fig-alt: >
#| The result has 4 rows: keys 1, 2, 3, and 4 with all values
#| from val_x and val_y, however key 2, val_y and key 4, val_x are NAs since
#| those keys aren't present in their respective data frames.
#| from val_x and val_y, however key 2, val_y and key 4, val_x are NAs
#| since those keys don't have a match in the other data frames.
knitr::include_graphics("diagrams/join/full.png", dpi = 270)
```
@ -544,28 +539,30 @@ This, however, is not a great representation because while it might jog your mem
#| Venn diagrams showing the difference between inner, left, right, and
#| full joins.
#| fig-alt: >
#| Venn diagrams for inner, full, left, and right joins. Each join represented
#| with two intersecting circles representing data frames x and y, with x on
#| the right and y on the left. Shading indicates the result of the join.
#| Inner join: Only intersection is shaded. Full join: Everything is shaded.
#| Left join: Only x is shaded, but not the area in y that doesn't intersect
#| with x. Right join: Only y is shaded, but not the area in x that doesn't
#| intersect with y.
#| Venn diagrams for inner, full, left, and right joins. Each join
#| represented with two intersecting circles representing data frames x
#| and y, with x on the right and y on the left. Shading indicates the
#| result of the join.
#|
#| Inner join: Only intersection is shaded.
#| Full join: Everything is shaded.
#| Left join: All of x is shaded.
#| Right: All of y is shaded.
knitr::include_graphics("diagrams/join/venn.png", dpi = 270)
```
### Row matches
### Row matching
While the most visible impact of a join is on the columns, joins also affect the rows.
To understand what's going let's first narrow our focus to the `inner_join()` and think about each row in `x`.
What happens to each row of `x` depends on how many rows it matches in `y`:
So far we've explored what happens if there's either zero or one matches.
What happens if there's more than one match?
To understand what's going let's first narrow our focus to the `inner_join()` and then consider the three possible options for each row in `x`:
- If it doesn't match anything, it's dropped.
- If it matches 1 row, it's kept as is.
- If it matches more than 1 row, it's duplicated once for each match.
@fig-join-match-types illustrates these three possibilities.
These three options are illustrated in @fig-join-match-type.
```{r}
#| label: fig-join-match-types
@ -578,19 +575,21 @@ What happens to each row of `x` depends on how many rows it matches in `y`:
#| `x` and three rows in the output, there isn't a direct
#| correspondence between the rows.
#| fig-alt: >
#| TBA
#| A join diagram where x has key values 1, 2, and 3, and y has
#| key values 1, 2, 2. The output has three rows because key 1 matches
#| one row, key 2 matches two rows, and key 3 matches zero rows.
knitr::include_graphics("diagrams/join/match-types.png", dpi = 270)
```
In principle, this means that there are no guarantees about the number of rows in the output of an `inner_join()`:
- There might be the same number of rows if every row in `x` matches one row in `y`.
- There might be fewer rows if some rows in `x` don't match any rows in `y`.
- There might be more rows if some rows in `x` match multiple rows in `y`.
- There might be fewer rows if some rows in `x` match no rows in `y`.
- There might be the same number of rows if every row in `x` matches one row in `y`.
- There might be the same number of rows if the number of multiple matches precisely balances out with the rows that don't match.
This is pretty dangerous so by default dplyr will warn whenever there are multiple matches:
Row expansion is a fundamental property of joins, but it feels dangerous to us so dplyr will warn whenever there are multiple matches:
```{r}
df1 <- tibble(key = c(1, 2, 3), val_x = c("x1", "x2", "x3"))
@ -602,12 +601,12 @@ df1 |>
This is another reason we recommend the `left_join()` --- every row in `x` is guaranteed to match a "virtual" row in `y` so it'll never drop rows, and you'll always get a warning when it duplicates rows.
You can gain further more control with two arguments:
You can further control over row matching with two arguments:
- `unmatched` controls what happens if a row in `x` doesn't match any rows in `y`. It defaults to `"drop"` which will silently drop any unmatched rows.
- `multiple` controls what happens if a row in `x` matches more than one row in `y`. For equi-joins, it defaults to `"warn"` which emits a warning message if there are any multiple matches.
- `unmatched` controls what happens when in `x` fails to match any rows in `y`. It defaults to `"drop"` which will silently drop any unmatched rows.
- `multiple` controls what happens when a row in `x` matches more than one row in `y`. For equi-joins, it defaults to `"warn"` which emits a warning message if there are any multiple matches.
There are two common cases in which you might want to customize: enforcing a one-to-one mapping or allowing multiple joins.
There are two common cases in which you might want to override the default: enforcing a one-to-one mapping or allowing multiple joins.
### One-to-one mapping
@ -619,13 +618,13 @@ df1 |>
inner_join(df2, join_by(key), unmatched = "error", multiple = "error")
```
(`unmatched = "error"` is not useful with `left_join()` because as described above, a `left_join()` always matches a virtual row in `y` filled with missing values).
Note that `unmatched = "error"` is not useful with `left_join()` because, as described above, every row in `x` has a fallback match to a virtual row in `y` filled with missing values.
### Allow multiple rows
Sometimes it's useful to expand the number of rows in the output.
This often comes about by flipping the direction of the question you're asking.
For example, as we've seen above, it's natural to ask for additional information about the plane that flew each flight:
Sometimes it's useful to deliberately expand the number of rows in the output.
A natural way that this comes about is when you flip the direction of the question you're asking.
For example, as we've seen above, it's natural to supplement the `flights` data with information about the plane that flew each flight:
```{r}
#| results: false
@ -633,7 +632,7 @@ flights2 |>
left_join(planes, by = "tailnum")
```
But it's also reasonable to ask what flights did each plane perform?
But it's also reasonable to ask what flights did each plane fly?
```{r}
plane_flights <- planes |>
@ -651,9 +650,10 @@ plane_flights
### Filtering joins {#sec-non-equi-joins}
The number of matches is closely related to what the filtering joins too.
The semi-join keeps rows in `x` that have one or more matches in `y`, as in @fig-join-semi. The anti-join keeps rows in `x` that don't have a match in `y`, as in @fig-join-anti.
In both cases, nly the existence of a match is important; it doesn't matter which observation is matched.
The number of matches is also closely related to the filtering joins.
The semi-join keeps rows in `x` that have one or more matches in `y`, as in @fig-join-semi.
The anti-join keeps rows in `x` that don't have a match in `y`, as in @fig-join-anti.
In both cases, only the existence of a match is important; it doesn't matter which observation is matched.
This means that filtering joins never duplicate rows like mutating joins do.
```{r}
@ -670,7 +670,7 @@ This means that filtering joins never duplicate rows like mutating joins do.
#| two results in a data frame with two rows and two columns (key and val_x),
#| with keys 1 and 2 (the only keys that match between the two data frames).
knitr::include_graphics("diagrams/join/semi.png")
knitr::include_graphics("diagrams/join/semi.png", dpi = 270)
```
```{r}
@ -692,10 +692,11 @@ knitr::include_graphics("diagrams/join/anti.png", dpi = 270)
## Non-equi joins
So far we've discussed **equi-joins**, joins where the keys in x must equal the keys in y for rows to match.
This allows us to make an important simplification in both the diagrams and the return values of the join frames: we only ever include the join key from one table.
We can request that dplyr keep both keys with `keep = TRUE`.
This is shown in the code below and in @fig-inner-both.
So far you've only seen **equi-joins**, joins where the two rows match if the keys in equal the keys in y.
Now we're going to relax that restriction and discuss other ways of determining if a pair of rows match.
But before you learn about equi-joins we need to revisit a simplification we made above: because the x keys and y are equal, we only need to show one in the output.
We can request that dplyr keep both keys with `keep = TRUE`, leading to the code below and the re-drawn `inner_join()` in @fig-inner-both.
```{r}
x |> left_join(y, by = "key", keep = TRUE)
@ -704,9 +705,12 @@ x |> left_join(y, by = "key", keep = TRUE)
```{r}
#| label: fig-inner-both
#| fig-cap: >
#| Inner join showing keys from both `x` and `y`. This is not the
#| default because for equi-joins, the keys are the same so showing
#| both doesn't add anything.
#| An inner join showing both `x` and `y` keys in the output.
#| fig-alt: >
#| A join diagram showing an inner join betwen x and y. The result
#| now includes four columns: key.x, val_x, key.y, and val_y. The
#| values of key.x and key.y are identical, which is why we usually
#| omit one.
#| echo: false
#| out-width: ~
@ -714,39 +718,36 @@ knitr::include_graphics("diagrams/join/inner-both.png", dpi = 270)
```
This distinction between the keys becomes much more important as we move away from equi-joins because the key values are much more likely to be different.
Because of this, dplyr defaults to showing both keys.
For example, instead of requiring that the `x` and `y` keys be equal, we could request that key from `x` be less than the key from `y`, as in the code below and @fig-join-gte.
```{r}
x |> inner_join(y, join_by(key >= key))
```
For example, instead matching when the `x$key` and `y$key` are equal, we could match whenever the `x$key` is greater than or equal the `y$key`, leading to @fig-join-gte.
```{r}
#| label: fig-join-gte
#| echo: false
#| fig-cap: >
#| A non-equijoin where the `x` key must be less than the `y` key.
#| A non-equi join where the `x` key must greater than or equal to
#| than the `y` key. Many rows generate multiple matches.
#| fig-alt: >
#| A join diagram illustrating join_by(key >= key). The first row
#| of x matches one row of y and the second and thirds rows each match
#| two rows. This means the output has five rows containing each of the
#| following (key.x, key.y) pairs: (1, 1), (2, 1), (2, 2), (3, 1),
#| (3, 2).
knitr::include_graphics("diagrams/join/gte.png", dpi = 270)
```
Non-equi join isn't a particularly useful term because it only tells you what the join is not, not what it is. dplyr helps a bit by identifying three useful types of non-equi join
Non-equi-join isn't particularly useful as term because it only tells you what the join is not, not what it is. dplyr helps a bit by identifying four particularly useful types of non-equi-join:
- **Cross-joins** have no join keys
- **Cross-joins** match every pair of rows.
- **Inequality-joins** use `<`, `<=`, `>`, `>=` instead of `==`.
- **Rolling joins** use `following(x, y)` and `preceding(x, y).`
- **Overlap joins** use `between(x$val, y$lower, y$upper)`, `within(x$lower, x$upper, y$lower, y$upper)` and `overlaps(x$lower, x$upper, y$lower, y$upper).`
- **Rolling joins** are similar to inequality joins but only find the closest match.
- **Overlap joins** are a special type of inequality join designed to work with ranges.
Each of these is described in more detail in the following sections.
### Cross-joins
```{r}
df <- tibble(name = c("John", "Simon", "Tracy", "Max"))
df |> left_join(df, join_by())
```
This is sometimes called a **self-join** because we're joining a table to itself.
A cross-join matches everything, as in @fig-cross-join, generating the Cartesian product of rows.
This means the output will have `nrow(x) * nrow(y)` rows.
```{r}
#| label: fig-join-cross
@ -754,21 +755,23 @@ This is sometimes called a **self-join** because we're joining a table to itself
#| out-width: ~
#| fig-cap: >
#| A cross join matches each row in `x` with every row in `y`.
#| fig-alt: >
#| A join diagram showing a dot for every combination of x and y.
knitr::include_graphics("diagrams/join/cross.png", dpi = 270)
```
Cross-joins are useful when you want to generate permutations.
For example, the code below generates every possible pair of names.
This is sometimes called a **self-join** because we're joining a table to itself.
```{r}
df <- tibble(name = c("John", "Simon", "Tracy", "Max"))
df |> left_join(df, join_by())
```
### Inequality joins
Inequality joins are extremely general, so general that it's hard to find specific meaning use cases.
One small useful technique is to generate all pairs:
```{r}
df <- tibble(id = 1:4, name = c("John", "Simon", "Tracy", "Max"))
df |> left_join(df, join_by(id < id))
```
Here we perform a self-join (i.e we join a table to itself), then use the inequality join to ensure that we one of the two possible pairs (e.g. just (a, b) not also (b, a)) and don't match the same row.
Inequality joins use `<`, `<=`, `>=`, or `>` to restrict the set of possible matches, as in @fig-join-gte and @fig-join-lt.
```{r}
#| label: fig-cross-lt
@ -780,28 +783,34 @@ Here we perform a self-join (i.e we join a table to itself), then use the inequa
knitr::include_graphics("diagrams/join/cross-lt.png", dpi = 270)
```
### Rolling joins
Rolling joins are a special type of inequality join where instead of getting *every* row that satisfies the inequality, you get just the closest row.
They're particularly useful when you have two tables of dates that don't perfectly line up and you want to find (e.g.) the closest date in table 1 that comes before (or after) some date in table 2.
@fig-join-following.
Inequality joins are extremely general, so general that it's hard to come up with meaningful specific use cases.
One small useful technique is to filter the cross-join so that instead of generating all permutations, we generate all combinations.
```{r}
#| label: fig-join-following
df <- tibble(id = 1:4, name = c("John", "Simon", "Tracy", "Max"))
df |> left_join(df, join_by(id < id))
```
### Rolling joins
Rolling joins are a special type of inequality join where instead of getting *every* row that satisfies the inequality, you get just the closest row, as in @fig-join-closest. You can turn any inequality join into a rolling join by adding `closest()`.
For example `join_by(closest(x <= y))` finds the smallest `y` that's greater than or equal to x, and `join_by(closest(x > y))` finds the biggest `y` that's less than x.
```{r}
#| label: fig-join-closest
#| echo: false
#| out-width: ~
#| fig-cap: >
#| A following join is similar to a greater-than-or-equal inequality join
#| but only matches the first value.
knitr::include_graphics("diagrams/join/following.png", dpi = 270)
knitr::include_graphics("diagrams/join/closest.png", dpi = 270)
```
You can turn any inequality join into a rolling join by adding `closest()`.
For example `join_by(closest(x <= y))` finds the smallest `y` that's greater than or equal to x, and `join_by(closest(x > y))` finds the biggest `y` that's less than x.
Rolling joins are particularly useful when you have two tables of dates that don't perfectly line up and you want to find (e.g.) the closest date in table 1 that comes before (or after) some date in table 2.
For example, imagine that you're in charge of office birthdays.
Your company is rather stingy so instead of having individual parties, you only have a party once each quarter.
Parties are always on a Monday, and you skip the first week of January since a lot of people are on holiday and the first Monday of Q3 is July 4, so that has to be pushed back a week.
Your company is rather cheap so instead of having individual parties, you only have a party once each quarter.
Parties are always on a Monday, and you skip the first week of January since a lot of people are on holiday and the first Monday of Q3 2022 is July 4, so that has to be pushed back a week.
That leads to the following party days:
```{r}
@ -811,10 +820,9 @@ parties <- tibble(
)
```
Then we have a table of employees along with their birthdays:
Now imagine that we have a table of employee birthdays:
```{r}
set.seed(1014)
employees <- tibble(
name = wakefield::name(100),
birthday = lubridate::ymd("2022-01-01") + (sample(365, 100, replace = TRUE) - 1)
@ -822,10 +830,7 @@ employees <- tibble(
employees
```
To find out which party each employee will use to celebrate their birthday, we can use a rolling join.
We have to frame the
We want to find the first party that's before their birthday so we can use following rolling join:
For each employee we want to find the first party date that comes after (or on) their birthday:
```{r}
#| eval: false
@ -841,8 +846,15 @@ employees |>
### Overlap joins
There's one problem with the strategy uses for assigning birthday parties above: there's no party preceding the birthdays Jan 1-9.
So maybe we'd be better off being explicit about the date ranges that each party spans, and make a special case for those early bithdays:
Overlap joins provide three helpers that use inequality joins to make it easier to work with intervals:
- `between(x, y_lower, y_upper)` is short for `x >= y_lower, x <= y_upper`.
- `within(x_lower, x_upper, y_lower, y_upper)` is short for `x_lower >= y_lower, x_upper <= y_upper`.
- `overlaps(x_lower, x_upper, y_lower, y_upper)` is short for `x_lower <= y_upper, x_upper >= y_lower`.
Let's continue the birthday example to see how you might use them.
There's one problem with the strategy used above: there's no party preceding the birthdays Jan 1-9.
So it might be better to to be explicit about the date ranges that each party spans, and make a special case for those early bithdays:
```{r}
parties <- tibble(
@ -854,6 +866,27 @@ parties <- tibble(
parties
```
I'm hopelessly bad at data entry so I also want to check that my party periods don't overlap.
I can perform an self-join and check to see if any start-end interval overlaps with any other:
```{r}
parties |>
inner_join(parties, join_by(overlaps(start, end, start, end), q < q)) |>
select(start.x, end.x, start.y, end.y)
```
Let's fix that problem and continue:
```{r}
parties <- tibble(
q = 1:4,
party = lubridate::ymd(c("2022-01-10", "2022-04-04", "2022-07-11", "2022-10-03")),
start = lubridate::ymd(c("2022-01-01", "2022-04-04", "2022-07-11", "2022-10-03")),
end = lubridate::ymd(c("2022-04-03", "2022-07-10", "2022-10-02", "2022-12-31"))
)
```
Now we can match each employee to their party.
This is a good place to use `unmatched = "error"` because I want to find out if any employees didn't get assigned a birthday.
```{r}
@ -861,24 +894,6 @@ employees |>
inner_join(parties, join_by(between(birthday, start, end)), unmatched = "error")
```
We could also flip the question around and ask which employees will celebrate in each party.
This requires explicitly specifying which table each variable comes from since otherwise `between()` assumes that the first argument comes from `x` and the second and third come from `y`.
```{r}
parties |>
inner_join(employees, join_by(between(y$birthday, x$start, x$end)))
```
Finally, I'm hopelessly bad at data entry so I also want to check that my party periods don't overlap.
I can perform an self-join and use an `overlaps()` join:
```{r}
parties |>
inner_join(parties, join_by(overlaps(start, end, start, end), q < q))
```
In other situations you might instead use `within()` which for each row in `x` finds all rows in `y` where the x internal is within the y interval.
### Exercises
1. Can you explain what's happening the keys in this equi-join?
@ -889,3 +904,7 @@ In other situations you might instead use `within()` which for each row in `x` f
x |> full_join(y, by = "key", keep = TRUE)
```
2. When finding if any party period overlapped with another party period I used `q < q` in the `join_by()`?
Why?
What happens if you remove this inequality?