Relations are always defined between a pair of data frames.
All other relations are built up from this simple idea: the relations of three or more data frames are always a property of the relations between each pair.
Sometimes both elements of a pair can be the same data frame!
This is needed if, for example, you have a data frame of people, and each person has a reference to their parents.
The most common place to find relational data is in a *relational* database management system (or RDBMS), a term that encompasses almost all modern databases.
If you've used a database before, you've almost certainly used SQL.
If so, you should find the concepts in this chapter familiar, although their expression in dplyr is a little different.
One other major terminology difference between databases and R is that what we generally refer to as data frames in R while the same concept is referred to as "table" in databases.
Hence you'll see references to one-table and two-table verbs in dplyr documentation.
Generally, dplyr is a little easier to use than SQL because dplyr is specialised to do data analysis: it makes common data analysis operations easier, at the expense of making it more difficult to do other things that aren't commonly needed for data analysis.
nycflights13 contains five tibbles : `airlines`, `airports`, `weather` and `planes` which are all related to the `flights` data frame that you used in Chapter \@ref(data-transform) on data transformation:
```{r, echo = FALSE, fig.alt = "Diagram of the relationships between airports, planes, flights, weather, and airlines datasets from the nycflights13 package. The faa variable in the airports data frame is connected to the origin and dest variables in the flights data frame. The tailnum variable in the planes data frame is connected to the tailnum variable in flights. The year, month, day, hour, and origin variables are connected to the variables with the same name in the flights data frame. And finally the carrier variables in the airlines data frame is connected to the carrier variable in the flights data frame. There are no direct connections between airports, planes, airlines, and weather data frames."}
When starting to work with this data, I had naively assumed that each flight number would be only used once per day: that would make it much easier to communicate problems with a specific flight.
For example, each flight has one plane, but each plane has many flights.
In other data, you'll occasionally see a 1-to-1 relationship.
You can think of this as a special case of 1-to-many.
You can model many-to-many relations with a many-to-1 relation plus a 1-to-many relation.
For example, in this data there's a many-to-many relationship between airlines and airports: each airline flies to many airports; each airport hosts many airlines.
```{r, echo = FALSE, out.width = NULL, fig.alt = "x and y are two data frames with 2 columns and 3 rows each. The first column in each is the key and the second is the value. The contents of these data frames are given in the subsequent code chunk."}
```{r, echo = FALSE, out.width = NULL, fig.alt = "x and y data frames placed next to each other. with the key variable moved up front in y so that the key variable in x and key variable in y appear next to each other."}
(If you look closely, you might notice that we've switched the order of the key and value columns in `x`. This is to emphasise that joins match based on the key; the value is just carried along for the ride.)
```{r join-inner, echo = FALSE, out.width = NULL, fig.alt = "Keys 1 and 2 in x and y data frames are matched and indicated with lines joining these rows with dot in the middle. Hence, there are two dots in this diagram. The resulting joined data frame has two rows and 3 columns: key, val_x, and val_y. Values in the key column are 1 and 2 (the matched values)."}
(To be precise, this is an inner **equijoin** because the keys are matched using the equality operator. Since most joins are equijoins we usually drop that specification.)
```{r, echo = FALSE, out.width = NULL, fig.alt = "Three diagrams for left, right, and full joins. In each diagram data frame x is on the left and y is on the right. The result of the join is always a data frame with three columns (key, val_x, and val_y). Left join: keys 1 and 2 from x are matched to those in y, key 3 is also carried along to the joined result since it's on the left data frame, but key 4 from y is not carried along since it's on the right but not on the left. The result is a data frame with 3 rows: keys 1, 2, and 3, all values from val_x, and the corresponding values from val_y for keys 1 and 2 with an NA for key 3, val_y. Right join: keys 1 and 2 from x are matched to those in y, key 4 is also carried along to the joined result since it's on the right data frame, but key 3 from x is not carried along since it's on the left but not on the right. The result is a data frame with 3 rows: keys 1, 2, and 4, all values from val_y, and the corresponding values from val_x for keys 1 and 2 with an NA for key 4, val_x. Full join: The resulting data frame has 4 rows: keys 1, 2, 3, and 4 with all values from val_x and val_y, however key 2, val_y and key 4, val_x are NAs since those keys aren't present in their respective data frames."}
The most commonly used join is the left join: you use this whenever you look up additional data from another data frame, because it preserves the original observations even when there isn't a match.
```{r, echo = FALSE, out.width = NULL, fig.alt = "Venn diagrams for inner, full, left, and right joins. Each join represented with two intersecting circles representing data frames x and y, with x on the right and y on the left. Shading indicates the result of the join. Inner join: Only intersection is shaded. Full join: Everything is shaded. Left join: Only x is shaded, but not the area in y that doesn't intersect with x. Right join: Only y is shaded, but not the area in x that doesn't intersect with y."}
It might jog your memory about which join preserves the observations in which data frame, but it suffers from a major limitation: a Venn diagram can't show what happens when keys don't uniquely identify an observation.
```{r, echo = FALSE, out.width = NULL, fig.alt = "Diagram describing a left join where one of the data frames (x) has duplicate keys. Data frame x is on the left, has 4 rows and 2 columns (key, val_x), and has the keys 1, 2, 2, and 1. Data frame y is on the right, has 2 rows and 2 columns (key, val_y), and has the keys 1 and 2. Left joining these two data frames yields a data frame with 4 rows (keys 1, 2, 2, and 1) and 3 columns (val_x, key, val_y). All values from x$val_x are carried along, values in y for key 1 and 2 are duplicated."}
```{r, echo = FALSE, out.width = NULL, fig.alt = "Diagram describing a left join where both data frames (x and y) have duplicate keys. Data frame x is on the left, has 4 rows and 2 columns (key, val_x), and has the keys 1, 2, 2, and 3. Data frame y is on the right, has 4 rows and 2 columns (key, val_y), and has the keys 1, 2, 2, and 3 as well. Left joining these two data frames yields a data frame with 6 rows (keys 1, 2, 2, 2, 2, and 3) and 3 columns (key, val_x, val_y). All values from both datasets are included."}
Note that the `year` variables (which appear in both input data frames, but are not constrained to be equal) are disambiguated in the output with a suffix.
For example, if we want to draw a map we need to combine the flights data with the airports data which contains the location (`lat` and `lon`) of each airport.
Each flight has an origin and destination `airport`, so we need to specify which one we want to join to:
The advantages of the specific dplyr verbs is that they more clearly convey the intent of your code: the difference between the joins is really important but concealed in the arguments of `merge()`.
dplyr's joins are considerably faster and don't mess with the order of the rows.
Joining different variables between the data frames, e.g. `inner_join(x, y, by = c("a" = "b"))` uses a slightly different syntax in SQL: `SELECT * FROM x INNER JOIN y ON x.a = y.b`.
As this syntax suggests, SQL supports a wider range of join types than dplyr because you can connect the data frames using constraints other than equality (sometimes called non-equijoins).
Instead you can use a semi-join, which connects the two data frames like a mutating join, but instead of adding new columns, only keeps the rows in `x` that have a match in `y`:
```{r, echo = FALSE, out.width = NULL, fig.alt = "Diagram of a semi join. Data frame x is on the left and has two columns (key and val_x) with keys 1, 2, and 3. Diagram y is on the right and also has two columns (key and val_y) with keys 1, 2, and 4. Semi joining these two results in a data frame with two rows and two columns (key and val_x), with keys 1 and 2 (the only keys that match between the two data frames)."}
```{r, echo = FALSE, out.width = NULL, fig.alt = "Diagram of a semi join with data frames with duplicated keys. Data frame x is on the left and has two columns (key and val_x) with keys 1, 2, 2, and 3. Diagram y is on the right and also has two columns (key and val_y) with keys 1, 2, 2, and 3 as well. Semi joining these two results in a data frame with four rows and two columns (key and val_x), with keys 1, 2, 2, and 3 (the matching keys, each appearing as many times as they do in x)."}
```{r, echo = FALSE, out.width = NULL, fig.alt = "Diagram of an anti join. Data frame x is on the left and has two columns (key and val_x) with keys 1, 2, and 3. Diagram y is on the right and also has two columns (key and val_y) with keys 1, 2, and 4. Anti joining these two results in a data frame with one row and two columns (key and val_x), with keys 3 only (the only key in x that is not in y)."}
Anti-joins are useful for diagnosing join mismatches.
For example, when connecting `flights` and `planes`, you might be interested to know that there are many `flights` that don't have a match in `planes`:
You should usually do this based on your understanding of the data, not empirically by looking for a combination of variables that give a unique identifier.
If you just look for variables without thinking about what they mean, you might get (un)lucky and find a combination that's unique in your current data but the relationship might not be true in general.
If you do have missing keys, you'll need to be thoughtful about your use of inner vs. outer joins, carefully considering whether or not you want to drop rows that don't have a match.
If you have an inner join with duplicate keys in both data frames, you might get unlucky as the number of dropped rows might exactly equal the number of duplicated rows!