From 1eed88433c537515adca360a0e37ed3b38124ffb Mon Sep 17 00:00:00 2001 From: Hadley Wickham Date: Wed, 1 Mar 2023 14:01:24 -0600 Subject: [PATCH] Updates for new relationship argument (#1331) --- DESCRIPTION | 1 + joins.qmd | 85 +++++++++++------------------------------------------ 2 files changed, 18 insertions(+), 68 deletions(-) diff --git a/DESCRIPTION b/DESCRIPTION index 8d0350b..f77d7a4 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -36,5 +36,6 @@ Suggests: jpeg, knitr, sessioninfo +Remotes: tidyverse/dplyr Encoding: UTF-8 License: CC NC ND 3.0 diff --git a/joins.qmd b/joins.qmd index 3559634..db6cccc 100644 --- a/joins.qmd +++ b/joins.qmd @@ -412,8 +412,7 @@ flights2 |> ## How do joins work? Now that you've used joins a few times it's time to learn more about how they work, focusing on how each row in `x` matches rows in `y`. -We'll begin by using @fig-join-setup to introduce a visual representation of the two simple tibbles defined below. -In these examples we'll use a single key called `key` and a single value column (`val_x` and `val_y`), but the ideas all generalize to multiple keys and multiple values. +We'll begin by introducing a visual representation of joins, using the simple tibbles defined below and shown in @fig-join-setup. In these examples we'll use a single key called `key` and a single value column (`val_x` and `val_y`), but the ideas all generalize to multiple keys and multiple values. ```{r} x <- tribble( @@ -446,7 +445,8 @@ y <- tribble( knitr::include_graphics("diagrams/join/setup.png", dpi = 270) ``` -@fig-join-setup2 shows all potential matches between `x` and `y` as the intersection between lines drawn from each row of `x` and each row of `y`. +@fig-join-setup2 introduces the foundation for our visual representation. +It shows all potential matches between `x` and `y` as the intersection between lines drawn from each row of `x` and each row of `y`. The rows and columns in the output are primarily determined by `x`, so the `x` table is horizontal and lines up with the output. ```{r} @@ -465,8 +465,9 @@ The rows and columns in the output are primarily determined by `x`, so the `x` t knitr::include_graphics("diagrams/join/setup2.png", dpi = 270) ``` -In an actual join, matches will be indicated with dots, as in @fig-join-inner. -The number of dots equals the number of matches, which in turn equals the number of rows in the output, a new data frame that contains the key, the x values, and the y values. +To describe a specific type of join, we indicate matches with dots. +The matches determine the rows in the output, a new data frame that contains the key, the x values, and the y values. +For example, @fig-join-inner shows an inner join, where rows are retained if and only if the keys are equal. ```{r} #| label: fig-join-inner @@ -484,7 +485,7 @@ The number of dots equals the number of matches, which in turn equals the number knitr::include_graphics("diagrams/join/inner.png", dpi = 270) ``` -An **outer join** keeps observations that appear in at least one of the data frames. +We can apply the same principles to explain the **outer joins**, which keep observations that appear in at least one of the data frames. These joins work by adding an additional "virtual" observation to each data frame. This observation has a key that matches if no other key matches, and values filled with `NA`. There are three types of outer joins: @@ -606,78 +607,26 @@ There are three possible outcomes for a row in `x`: - If it matches 1 row in `y`, it's preserved. - If it matches more than 1 row in `y`, it's duplicated once for each match. -In principle, this means that there's no guaranteed correspondence between the rows in the output and the rows in the `x`: - -- There might be fewer rows if some rows in `x` don't match any rows in `y`. -- There might be more rows if some rows in `x` match multiple rows in `y`. -- There might be the same number of rows if every row in `x` matches one row in `y`. -- There might be the same number of rows if some rows don't match any rows, and exactly the same number of rows match two rows in `y`!! - -Row expansion is a fundamental property of joins, but it's dangerous because it might happen without you realizing it. -To avoid this problem, dplyr will warn whenever there are multiple matches: +In principle, this means that there's no guaranteed correspondence between the rows in the output and the rows in `x`, but in practice, this rarely causes problems. +There is, however, one particularly dangerous case which can cause a combinatorial explosion of rows. +Imagine joining the following two tables: ```{r} -df1 <- tibble(key = c(1, 2, 3), val_x = c("x1", "x2", "x3")) +df1 <- tibble(key = c(1, 2, 2), val_x = c("x1", "x2", "x3")) df2 <- tibble(key = c(1, 2, 2), val_y = c("y1", "y2", "y3")) +``` +While the first row in `df1` only matches one row in `df2`, the second and third rows both match two rows. +This is sometimes called a `many-to-many` join, and will cause dplyr to emit a warning: + +```{r} df1 |> inner_join(df2, join_by(key)) ``` -You can gain further control over row matching with two arguments: +If you are doing this deliberately, you can set `relationship = "many-to-many"`, as the warning suggests. -- `unmatched` controls what happens when a row in `x` fails to match any rows in `y`. It defaults to `"drop"` which will silently drop any unmatched rows. -- `multiple` controls what happens when a row in `x` matches more than one row in `y`. For equi-joins, it defaults to `"warn"` which emits a warning message if any rows have multiple matches. -There are two common cases in which you might want to override these defaults: enforcing a one-to-one mapping or deliberately allowing the rows to increase. - -### One-to-one mapping - -Both `unmatched` and `multiple` can take value `"error"` which means that the join will fail unless each row in `x` matches exactly one row in `y`: - -```{r} -#| error: true -df1 <- tibble(x = 1) -df2 <- tibble(x = c(1, 1)) -df3 <- tibble(x = 3) - -df1 |> - inner_join(df2, join_by(x), unmatched = "error", multiple = "error") -df1 |> - inner_join(df3, join_by(x), unmatched = "error", multiple = "error") -``` - -Note that `unmatched = "error"` is not useful with `left_join()` because, as described above, every row in `x` has a fallback match to a virtual row in `y`. - -### Allow multiple rows - -Sometimes it's useful to deliberately expand the number of rows in the output. -This can come about naturally if you "flip" the direction of the question you're asking. -For example, as we've seen above, it's natural to supplement the `flights` data with information about the plane that flew each flight: - -```{r} -#| results: false -flights2 |> - left_join(planes, by = "tailnum") -``` - -But it's also reasonable to ask what flights did each plane fly: - -```{r} -plane_flights <- planes |> - select(tailnum, type, engines, seats) |> - left_join(flights2, by = "tailnum") -``` - -Since this duplicates rows in `x` (the planes), we need to explicitly say that we're ok with the multiple matches by setting `multiple = "all"`: - -```{r} -plane_flights <- planes |> - select(tailnum, type, engines, seats) |> - left_join(flights2, by = "tailnum", multiple = "all") - -plane_flights -``` ### Filtering joins {#sec-non-equi-joins}