Edits to joins chapter (#1086)

* Add missing word

* Delete a word

* Add missing word

* Don't say "value of a primary key"; use more parallel language

* Typo

* How about "Now"?

* Comma, wording, grammar

* Plural

* 'Special' used in same same sense, unquoted, in previous exercise

* Add word, remove 's'

* Add words

* Subject-verb

* Don't use 'key' in a non-join-y way

* Copy edits to match details

* Wording

* Add words
This commit is contained in:
Jennifer (Jenny) Bryan 2022-09-16 05:39:03 -07:00 committed by GitHub
parent 4ac50eb359
commit 0c9acc7074
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 33 additions and 33 deletions

View File

@ -17,9 +17,9 @@ This chapter will introduce you to two important types of joins:
- Filtering joins, filter observations from one data frame based on whether or not they match an observation in another.
We'll begin by discussing keys, the variables used to connect a pair of data frames in a join.
You'll then see how to use joins to a variety of challenges from the nycflights13 dataset.
You'll then see how to use joins to tackle a variety of challenges from the nycflights13 dataset.
Next we'll discuss how joins work, focusing on their action on the rows.
We'll finish up by with a discussion of non-equi-joins, a family of joins that provide a more flexible way of matching keys than the default equality relationship.
We'll finish up with a discussion of non-equi-joins, a family of joins that provide a more flexible way of matching keys than the default equality relationship.
If you're familiar with SQL, you should find the ideas in this chapter familiar, as their realization in dplyr is very similar.
@ -46,7 +46,7 @@ You'll also learn how to check that your keys are valid, and what to do if your
Every join involves a pair of keys: a primary key and a foreign key.
A **primary key** is a variable (or group of variables) that uniquely identifies an observation.
A **foreign key** is the value of a primary key in another table so can be used to lookup the corresponding observation.
A **foreign key** is the corresponding variable (or groups of variables) in another table.
Let's make those terms concrete by looking at four of the data frames in nycfights13:
- `airlines` lets you look up the full carrier name from its abbreviated code.
@ -57,7 +57,7 @@ Let's make those terms concrete by looking at four of the data frames in nycfigh
```
- `airports` gives information about each airport.
Its primary key is the three `faa` airport code.
Its primary key is the three letter `faa` airport code.
```{r}
airports
@ -80,7 +80,7 @@ Let's make those terms concrete by looking at four of the data frames in nycfigh
These datasets are all connected via the `flights` data frame because the `tailnum`, `carrier`, `origin`, `dest`, and `time_hour` variables are all foreign keys:
- `flights$tailnum` connects to primary key `planes$tailnum`.
- `flights$carrier` connects to primary key `airlines$carrer`.
- `flights$carrier` connects to primary key `airlines$carrier`.
- `flights$origin` connects to primary key `airports$faa`.
- `flights$dest` connects to primary key `airports$faa` .
- `flights$origin`-`flights$time_hour` connects to primary key `weather$origin`-`weather$time_hour`.
@ -115,7 +115,7 @@ knitr::include_graphics("diagrams/relational.png", dpi = 270)
### Checking primary keys
That that we've identified the primary keys in each table, it's good practice to verify that they do indeed uniquely identify each observation.
Now that that we've identified the primary keys in each table, it's good practice to verify that they do indeed uniquely identify each observation.
One way to do that is to `count()` the primary keys and look for entries where `n` is greater than one.
This reveals that `planes` and `weather` both look good:
@ -144,7 +144,7 @@ weather |>
So far we haven't talked about the primary key for `flights`.
It's not super important here, because there are no data frames that use it as a foreign key, but it's still useful to consider because it's easier to work with observations if have some way to describe them to others.
After a little thinking and experimentation we discovered that there are three variables that together uniquely identifies each flight:
After a little thinking and experimentation, we determined that there are three variables that together uniquely identify each flight:
```{r}
flights |>
@ -180,13 +180,13 @@ Surrogate keys can be particular useful when communicating to other humans: it's
1. We forgot to draw the relationship between `weather` and `airports` in @fig-flights-relationships.
What is the relationship and how should it appear in the diagram?
2. `weather` only contains information for the three origin airport in NYC.
2. `weather` only contains information for the three origin airports in NYC.
If it contained weather records for all airports in the USA, what additional connection would it make to `flights`?
3. The `year`, `month`, `day`, `hour`, and `origin` variables almost form a compound key for `weather`, but there's one hour that has duplicate observations.
Can you figure out what's special about that hour?
4. We know that some days of the year are "special" and fewer people than usual fly on them.
4. We know that some days of the year are special and fewer people than usual fly on them.
How might you represent that data as a data frame?
What would be the primary key?
How would it connect to the existing data frames?
@ -199,10 +199,10 @@ Surrogate keys can be particular useful when communicating to other humans: it's
Now that you understand how data frames are connected via keys, we can start using joins to better understand the `flights` dataset.
dplyr provides six join functions: `left_join()`, `inner_join()`, `right_join()`, `semi_join()`, and `anti_join()`.
They all the same interface: they take a pair of data frames `x` and `y` and return a data frame.
They all have the same interface: they take a pair of data frames `x` and `y` and return a data frame.
The order of the rows and columns in the output is primarily determined by `x`.
In this section, you'll learn how to use one mutating joins, `left_join()`, and two filtering joins, `semi_join()` and `anti_join()`.
In this section, you'll learn how to use one mutating join, `left_join()`, and two filtering joins, `semi_join()` and `anti_join()`.
In the next section, you'll learn exactly how these functions work, and about the remaining `inner_join()`, `right_join()` and `full_join()`.
### Mutating joins
@ -267,7 +267,7 @@ flights2 |>
left_join(planes)
```
We get a lot of missing matches our join is trying to use both `tailnum` and `year`.
We get a lot of missing matches because our join is trying to use both `tailnum` and `year`.
Both `flights` and `planes` have a `year` column but they mean different things: `flights$year` is year the flight occurred and `planes$year` is the year the plane was built.
We only want to join on `tailnum` so we need to provide an explicit specification with `join_by()`:
@ -295,14 +295,14 @@ In older code you might see a different way of specifying the join keys, using a
- `by = "x"` corresponds to `join_by(x)`.
- `by = c("a" = "x")` corresponds to `join_by(a == x)`.
Now that it exists, we prefer `join_by()` since provides a more flexible specification that supports more types of join, as you'll learn in @sec-non-equi-joins.
Now that it exists, we prefer `join_by()` since it provides a more flexible specification that supports more types of join, as you'll learn in @sec-non-equi-joins.
### Filtering joins
As you might guess the primary action of a **filtering join** is to filter the rows.
There are two types: semi-joins and anti-joins.
**Semi-joins** keep all rows in `x` that have a match in `y`.
For example, we could use to filter the `airports` dataset to show just the origin airports:
For example, we could use a semi-join to filter the `airports` dataset to show just the origin airports:
```{r}
airports |>
@ -423,8 +423,8 @@ y <- tribble(
#| out-width: ~
#| fig-cap: >
#| Graphical representation of two simple tables. The coloured `key`
#| columns map background colour to key value. The grey columns represents
#| the "value" columns that is carried along for the ride.
#| columns map background colour to key value. The grey columns represent
#| the "value" columns that are carried along for the ride.
#| fig-alt: >
#| x and y are two data frames with 2 columns and 3 rows each. The first
#| column in each is the key and the second is the value. The contents of
@ -518,7 +518,7 @@ There are three types of outer joins:
```
- A **full join** keeps all observations that appear in `x` or `y`, @fig-join-full.
Every row of `x` and `y` `is` included in the output because both `x` and `y` have a fall back row of `NA`s.
Every row of `x` and `y` is included in the output because both `x` and `y` have a fall back row of `NA`s.
Note the output will consist of all `x` rows followed by the remaining `y` rows.
```{r}
@ -571,7 +571,7 @@ To understand what's going let's first narrow our focus to the `inner_join()` an
#| echo: false
#| out-width: ~
#| fig-cap: >
#| The three key ways a row in `x` can match. `x1` matches
#| The three ways a row in `x` can match. `x1` matches
#| one row in `y`, `x2` matches two rows in `y`, `x3` matches
#| zero rows in y. Note that while there are three rows in
#| `x` and three rows in the output, there isn't a direct
@ -584,20 +584,20 @@ To understand what's going let's first narrow our focus to the `inner_join()` an
knitr::include_graphics("diagrams/join/match-types.png", dpi = 270)
```
There are three possible outcomes for a row:
There are three possible outcomes for a row in `x`:
- If it doesn't match anything, it's dropped.
- If it matches 1 row, it's kept as is.
- If it matches more than 1 row, it's duplicated once for each match.
- If it matches 1 row in `y`, it's kept as is.
- If it matches more than 1 row in `y`, it's duplicated once for each match.
In principle, this means that there are no guarantees about the number of rows in the output of an `inner_join()`:
In principle, this means that there are no guarantees about the number of rows in the output of an `inner_join()`, compared to the number of rows in `x`.
- There might be fewer rows if some rows in `x` don't match any rows in `y`.
- There might be more rows if some rows in `x` match multiple rows in `y`.
- There might be the same number of rows if every row in `x` matches one row in `y`.
- There might be the same number of rows if some rows don't match any rows, and exactly the same number of rows match two rows in `y`!!
Row expansion is a fundamental property of joins, but it's dangerous because it might by hidden.
Row expansion is a fundamental property of joins, but it's dangerous because it might happen without you realizing it.
To avoid this problem, dplyr will warn whenever there are multiple matches:
```{r}
@ -612,7 +612,7 @@ This is another reason we recommend `left_join()` --- if it runs without warning
You can gain further control over row matching with two arguments:
- `unmatched` controls what happens when in `x` fails to match any rows in `y`. It defaults to `"drop"` which will silently drop any unmatched rows.
- `unmatched` controls what happens when a row in `x` fails to match any rows in `y`. It defaults to `"drop"` which will silently drop any unmatched rows.
- `multiple` controls what happens when a row in `x` matches more than one row in `y`. For equi-joins, it defaults to `"warn"` which emits a warning message if any rows have multiple matches.
There are two common cases in which you might want to override these defaults: enforcing a one-to-one mapping or deliberately allowing the rows to increase.
@ -638,7 +638,7 @@ Note that `unmatched = "error"` is not useful with `left_join()` because, as des
### Allow multiple rows
Sometimes it's useful to deliberately expand the number of rows in the output.
This can come about naturally if "flip" the direction of the question you're asking.
This can come about naturally if you "flip" the direction of the question you're asking.
For example, as we've seen above, it's natural to supplement the `flights` data with information about the plane that flew each flight:
```{r}
@ -655,7 +655,7 @@ plane_flights <- planes |>
left_join(flights2, by = "tailnum")
```
Since this duplicate rows in `x` (the planes), we need to explicitly say we're ok with the multiple matches by setting `multiple = "all"`:
Since this duplicates rows in `x` (the planes), we need to explicitly say we're ok with the multiple matches by setting `multiple = "all"`:
```{r}
plane_flights <- planes |>
@ -670,7 +670,7 @@ plane_flights
The number of matches also determines the behavior of the filtering joins.
The semi-join keeps rows in `x` that have one or more matches in `y`, as in @fig-join-semi.
The anti-join keeps rows in `x` that don't have a match in `y`, as in @fig-join-anti.
In both cases, only the existence of a match is important; it doesn't matter how many times its match.
In both cases, only the existence of a match is important; it doesn't matter how many times it matches.
This means that filtering joins never duplicate rows like mutating joins do.
```{r}
@ -709,7 +709,7 @@ knitr::include_graphics("diagrams/join/anti.png", dpi = 270)
## Non-equi joins
So far you've only seen **equi-joins**, joins where the two rows match if the `x` keys equal the `y` keys.
So far you've only seen **equi-joins**, joins where the two rows match if the `x` keys are exactly equal to the `y` keys.
Now we're going to relax that restriction and discuss other ways of determining if a pair of rows match.
But before we can do that, we need to revisit a simplification we made above.
@ -736,7 +736,7 @@ knitr::include_graphics("diagrams/join/inner-both.png", dpi = 270)
```
When we move away from equi-joins we'll always show the keys, because the key values will often different.
For example, instead matching when the `x$key` and `y$key` are equal, we could match whenever the `x$key` is greater than or equal the `y$key`, leading to @fig-join-gte.
For example, instead of matching only when the `x$key` and `y$key` are equal, we could match whenever the `x$key` is greater than or equal to the `y$key`, leading to @fig-join-gte.
dplyr's join functions understand this distinction so will always show both keys when you perform a non-equi-join.
```{r}
@ -882,7 +882,7 @@ parties
```
Hadley is hopelessly bad at data entry so he also wanted to check that the party periods don't overlap.
You can perform an self-join and check to see if any start-end interval overlaps with any other:
You can perform a self-join and check to see if any start-end interval overlaps with any other:
```{r}
parties |>
@ -911,7 +911,7 @@ employees |>
### Exercises
1. Can you explain what's happening the keys in this equi-join?
1. Can you explain what's happening with the keys in this equi-join?
Why are they different?
```{r}
@ -927,11 +927,11 @@ employees |>
## Summary
In this chapter, you've learned how to use mutating and filtering joins to combine data from a pair of data frames.
Along the way you learned how to identify keys, and the between primary and foreign keys.
Along the way you learned how to identify keys, and the difference between primary and foreign keys.
You also understand how joins work and how to figure out how many rows the output will have.
Finally, you've gained a glimpse into the power of non-equi-joins and seen a few interesting use cases.
This chapter concludes the "Transform" part of the book where the focus was on the tools you could use with individual columns and tibbles.
You learned about dplyr and base functions for working with logical vectors, numbers, and complete tables, stringr functions for working strings, lubridate functions for working with date-times, and forcats functions for working with factors.
In the next part of the book, you'll learn more getting various types of data into R in a tidy form.
In the next part of the book, you'll learn more about getting various types of data into R in a tidy form.