Light updates to joins chapter
This commit is contained in:
parent
0705aceba7
commit
ca38492660
|
@ -1,4 +1,4 @@
|
||||||
# Two-table verbs {#sec-relational-data}
|
# Joins {#sec-relational-data}
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
#| results: "asis"
|
#| results: "asis"
|
||||||
|
@ -14,31 +14,24 @@ Waiting on <https://github.com/tidyverse/dplyr/pull/5910>
|
||||||
<!-- TODO: redraw all diagrams to match O'Reilly style -->
|
<!-- TODO: redraw all diagrams to match O'Reilly style -->
|
||||||
|
|
||||||
It's rare that a data analysis involves only a single data frame.
|
It's rare that a data analysis involves only a single data frame.
|
||||||
Typically you have many data frames, and you must combine them to answer the questions that you're interested in.
|
Typically you have many data frames, and you must **join** them together to answer the questions that you're interested in.
|
||||||
|
|
||||||
All the verbs in this chapter use a pair of data frames.
|
All the verbs in this chapter use a pair of data frames.
|
||||||
Fortunately this is enough, since you can combine three data frames by combining two pairs.
|
Fortunately this is enough, since you can combine three data frames by combining two pairs.
|
||||||
Sometimes both elements of a pair will be the same data frame.
|
Sometimes both elements of a pair will be the same data frame.
|
||||||
This is needed if, for example, you have a data frame of people, and each person has a reference to their parents.
|
This is needed if, for example, you have a data frame of people, and each person has a reference to their parents.
|
||||||
|
|
||||||
There are three families of verbs designed to work with pairs of data frames:
|
There are two important types of joins.
|
||||||
|
**Mutating joins** adds new variables to one data frame from matching observations in another.
|
||||||
|
**Filtering joins**, which filters observations from one data frame based on whether or not they match an observation in another.
|
||||||
|
|
||||||
- **Mutating joins**, which adds new variables to one data frame from matching observations in another.
|
If you're familiar with SQL, you should find these ideas very familiar as their instantiation in dplyr is very similar.
|
||||||
|
We'll point out any important differences as we go.
|
||||||
- **Filtering joins**, which filters observations from one data frame based on whether or not they match an observation in another.
|
Don't worry if you're not familiar with SQL, we'll back to it in @sec-import-databases.
|
||||||
|
|
||||||
- **Set operations**, which treat observations as if they were set elements.
|
|
||||||
|
|
||||||
The most common place to find relational data is in a *relational* database management system (or RDBMS), a term that encompasses almost all modern databases.
|
|
||||||
If you've used a database before, you've almost certainly used SQL.
|
|
||||||
If so, you should find the concepts in this chapter familiar, although their expression in dplyr is a little different.
|
|
||||||
One other major terminology difference between databases and R is that what we generally refer to as data frames in R while the same concept is referred to as "table" in databases.
|
|
||||||
Hence you'll see references to one-table and two-table verbs in dplyr documentation.
|
|
||||||
Generally, dplyr is a little easier to use than SQL because dplyr is specialized to do data analysis: it makes common data analysis operations easier, at the expense of making it more difficult to do other things that aren't commonly needed for data analysis.
|
|
||||||
If you're not familiar with databases or SQL, you'll learn more about them in [Chapter -@sec-import-databases].
|
|
||||||
|
|
||||||
### Prerequisites
|
### Prerequisites
|
||||||
|
|
||||||
We will explore relational data from `nycflights13` using the two-table verbs from dplyr.
|
We will explore relational data from nycflights13 using the join functions from dplyr.
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
#| label: setup
|
#| label: setup
|
||||||
|
@ -50,7 +43,7 @@ library(nycflights13)
|
||||||
|
|
||||||
## nycflights13 {#sec-nycflights13-relational}
|
## nycflights13 {#sec-nycflights13-relational}
|
||||||
|
|
||||||
nycflights13 contains five tibbles : `airlines`, `airports`, `weather` and `planes` which are all related to the `flights` data frame that you used in [Chapter -@sec-data-transform] on data transformation:
|
nycflights13 contains five tibbles : `airlines`, `airports`, `weather` and `planes` which are all related to the `flights` data frame that you used in @sec-data-transform on data transformation:
|
||||||
|
|
||||||
- `airlines` lets you look up the full carrier name from its abbreviated code:
|
- `airlines` lets you look up the full carrier name from its abbreviated code:
|
||||||
|
|
||||||
|
@ -253,7 +246,7 @@ We'll then use that to explain the four mutating join functions: the inner join,
|
||||||
When working with real data, keys don't always uniquely identify observations, so next we'll talk about what happens when there isn't a unique match.
|
When working with real data, keys don't always uniquely identify observations, so next we'll talk about what happens when there isn't a unique match.
|
||||||
Finally, you'll learn how to tell dplyr which variables are the keys for a given join.
|
Finally, you'll learn how to tell dplyr which variables are the keys for a given join.
|
||||||
|
|
||||||
### Understanding joins
|
## Join types
|
||||||
|
|
||||||
To help you learn how joins work, I'm going to use a visual representation:
|
To help you learn how joins work, I'm going to use a visual representation:
|
||||||
|
|
||||||
|
@ -727,42 +720,3 @@ Your own data is unlikely to be so nice, so there are a few things that you shou
|
||||||
|
|
||||||
Be aware that simply checking the number of rows before and after the join is not sufficient to ensure that your join has gone smoothly.
|
Be aware that simply checking the number of rows before and after the join is not sufficient to ensure that your join has gone smoothly.
|
||||||
If you have an inner join with duplicate keys in both data frames, you might get unlucky as the number of dropped rows might exactly equal the number of duplicated rows!
|
If you have an inner join with duplicate keys in both data frames, you might get unlucky as the number of dropped rows might exactly equal the number of duplicated rows!
|
||||||
|
|
||||||
## Set operations {#sec-set-operations}
|
|
||||||
|
|
||||||
The final type of two-table verb are the set operations.
|
|
||||||
Generally, I use these the least frequently, but they are occasionally useful when you want to break a single complex filter into simpler pieces.
|
|
||||||
All these operations work with a complete row, comparing the values of every variable.
|
|
||||||
These expect the `x` and `y` inputs to have the same variables, and treat the observations like sets:
|
|
||||||
|
|
||||||
- `intersect(x, y)`: return only observations in both `x` and `y`.
|
|
||||||
- `union(x, y)`: return unique observations in `x` and `y`.
|
|
||||||
- `setdiff(x, y)`: return observations in `x`, but not in `y`.
|
|
||||||
|
|
||||||
Given this simple data:
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
df1 <- tribble(
|
|
||||||
~x, ~y,
|
|
||||||
1, 1,
|
|
||||||
2, 1
|
|
||||||
)
|
|
||||||
df2 <- tribble(
|
|
||||||
~x, ~y,
|
|
||||||
1, 1,
|
|
||||||
1, 2
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
The four possibilities are:
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
intersect(df1, df2)
|
|
||||||
|
|
||||||
# Note that we get 3 rows, not 4
|
|
||||||
union(df1, df2)
|
|
||||||
|
|
||||||
setdiff(df1, df2)
|
|
||||||
|
|
||||||
setdiff(df2, df1)
|
|
||||||
```
|
|
||||||
|
|
Loading…
Reference in New Issue