Joins proofing

This commit is contained in:
Hadley Wickham 2022-10-12 10:36:02 -05:00
parent 5485a91b49
commit 3e167168e7
1 changed files with 6 additions and 9 deletions

View File

@ -13,24 +13,22 @@ It's rare that a data analysis involves only a single data frame.
Typically you have many data frames, and you must **join** them together to answer the questions that you're interested in.
This chapter will introduce you to two important types of joins:
- Mutating joins, add new variables to one data frame from matching observations in another.
- Filtering joins, filter observations from one data frame based on whether or not they match an observation in another.
- Mutating joins, which add new variables to one data frame from matching observations in another.
- Filtering joins, which filter observations from one data frame based on whether or not they match an observation in another.
We'll begin by discussing keys, the variables used to connect a pair of data frames in a join.
You'll then see how to use joins to tackle a variety of challenges from the nycflights13 dataset.
We cement the theory with an examination of the keys in the nycflights13 datasets, then use that knowledge to start joining data frames together.
Next we'll discuss how joins work, focusing on their action on the rows.
We'll finish up with a discussion of non-equi-joins, a family of joins that provide a more flexible way of matching keys than the default equality relationship.
If you're familiar with SQL, you should find the ideas in this chapter familiar, as their realization in dplyr is very similar.
### Prerequisites
::: callout-important
This chapter relies on features only found in dplyr 1.1.0, which is still in development.
If you want to live life on the edge you can get the dev version with `devtools::install_github("tidyverse/dplyr")`.
If you want to live life on the edge, you can get the dev version with `devtools::install_github("tidyverse/dplyr")`.
:::
We'll explore the five related datasets from nycflights13 using the join functions from dplyr.
In this chapter, we'll explore the five related datasets from nycflights13 using the join functions from dplyr.
```{r}
#| label: setup
@ -42,8 +40,7 @@ library(nycflights13)
## Keys
To understand joins, you need to first understand how two tables might be connected.
The connection between a pair of tables is defined by a pair of keys, which each consist of one or more variables.
To understand joins, you need to first understand how two tables can be connected through a pair of keys, with on each table.
In this section, you'll learn about the two types of key and their realization in the datasets of the nycflights13 package.
You'll also learn how to check that your keys are valid, and what to do if your table lacks a key.