Use dev dbplyr

This commit is contained in:
Hadley Wickham 2022-08-04 08:00:38 -05:00
parent 226e0061ad
commit c83d21200d
2 changed files with 56 additions and 55 deletions

View File

@ -48,6 +48,7 @@ Suggests:
tidymodels,
xml2
Remotes:
tidyverse/dbplyr,
tidyverse/stringr,
tidyverse/tidyr,
jennybc/repurrrsive

View File

@ -13,13 +13,13 @@ A huge amount of data lives in databases, so it's essential that you know how to
Sometimes you can ask someone to download a snapshot into a .csv for you, but this gets painful quickly: every time you need to make a change you'll have to communicate with another human.
You want to be able to reach into the database directly to get the data you need, when you need it.
In this chapter, you'll first learn the basics of the DBI package: how to use it to connect to a database and then retrieve data with a SQL[^import-databases-1] query.
In this chapter, you'll first learn the basics of the DBI package: how to use it to connect to a database and then retrieve data with a SQL[^databases-1] query.
**SQL**, short for **s**tructured **q**uery **l**anguage, is the lingua franca of databases, and is an important language for all data scientists to learn.
That said, we're not going to start with SQL, but instead we'll teach you dbplyr, which can translate your dplyr code to the SQL.
We'll use that as way to teach you some of the most important features of SQL.
You won't become a SQL master by the end of the chapter, but you will be able to identify the most important components and understand what they do.
[^import-databases-1]: SQL is either pronounced "s"-"q"-"l" or "sequel".
[^databases-1]: SQL is either pronounced "s"-"q"-"l" or "sequel".
### Prerequisites
@ -73,10 +73,10 @@ This uses the ODBC protocol supported by many DBMS.
odbc requires a little more setup because you'll also need to install an ODBC driver and tell the odbc package where to find it.
Concretely, you create a database connection using `DBI::dbConnect()`.
The first argument selects the DBMS[^import-databases-2], then the second and subsequent arguments describe how to connect to it (i.e. where it lives and the credentials that you need to access it).
The first argument selects the DBMS[^databases-2], then the second and subsequent arguments describe how to connect to it (i.e. where it lives and the credentials that you need to access it).
The following code shows a couple of typical examples:
[^import-databases-2]: Typically, this is the only function you'll use from the client package, so we recommend using `::` to pull out that one function, rather than loading the complete package with `library()`.
[^databases-2]: Typically, this is the only function you'll use from the client package, so we recommend using `::` to pull out that one function, rather than loading the complete package with `library()`.
```{r}
#| eval: false
@ -133,16 +133,16 @@ dbWriteTable(con, "diamonds", ggplot2::diamonds)
If you're using duckdb in a real project, I highly recommend learning about `duckdb_read_csv()` and `duckdb_register_arrow()`.
These give you powerful and performant ways to quickly load data directly into duckdb, without having to first load it in to R.
## Database basics
## DBI basics
Now that we've connected to a database with some data in it, lets perform some basic operations with DBI.
### What's there?
The most important database objects for data scientists are tables.
DBI provides two useful functions to either list all the tables in the database[^import-databases-3] or to check if a specific table already exists:
DBI provides two useful functions to either list all the tables in the database[^databases-3] or to check if a specific table already exists:
[^import-databases-3]: At least, all the tables that you have permission to see.
[^databases-3]: At least, all the tables that you have permission to see.
```{r}
dbListTables(con)
@ -279,14 +279,14 @@ Common statements include `CREATE` for defining new tables, `INSERT` for adding
We will on focus on `SELECT` statements, also called **queries**, because they are almost exclusively what you'll use as a data scientist.
A query is made up of **clauses**.
There are five important clauses: `SELECT`, `FROM`, `WHERE`, `ORDER BY`, and `GROUP BY`. Every query must have the `SELECT`[^import-databases-4] and `FROM`[^import-databases-5] clauses and the simplest query is `SELECT * FROM table`, which selects all columns from the specified table
. This is what dplyr generates for an adulterated table
There are five important clauses: `SELECT`, `FROM`, `WHERE`, `ORDER BY`, and `GROUP BY`. Every query must have the `SELECT`[^databases-4] and `FROM`[^databases-5] clauses and the simplest query is `SELECT * FROM table`, which selects all columns from the specified table
. This is what dplyr generates for an unadulterated table
:
[^import-databases-4]: Confusingly, depending on the context, `SELECT` is either a statement or a clause.
[^databases-4]: Confusingly, depending on the context, `SELECT` is either a statement or a clause.
To avoid this confusion, we'll generally use query instead of `SELECT` statement.
[^import-databases-5]: Ok, technically, only the `SELECT` is required, since you can write queries like `SELECT 1+1` to perform basic calculations.
[^databases-5]: Ok, technically, only the `SELECT` is required, since you can write queries like `SELECT 1+1` to perform basic calculations.
But if you want to work with data (as you always do!) you'll also need a `FROM` clause.
```{r}
@ -334,14 +334,16 @@ The `SELECT` clause is the workhorse of queries and performs the same job as `se
```{r}
planes |>
select(tailnum, type, manufacturer, model) |>
select(tailnum, type, manufacturer, model, year) |>
show_query()
planes |>
select(tailnum, type, manufacturer, model, year) |>
rename(year_built = year) |>
show_query()
planes |>
select(tailnum, type, manufacturer, model, year) |>
relocate(manufacturer, model, .before = type) |>
show_query()
```
@ -350,42 +352,48 @@ This example also shows you how SQL does renaming.
In SQL terminology renaming is called **aliasing** and is done with `AS`.
Note that unlike with `mutate()`, the old name is on the left and the new name is on the right.
::: callout-note
In the examples above note that `"year"` and `"type"` are wrapped in double quotes.
That's because these are **reserved words** in duckdb, so dbplyr quotes them to avoid any potential confusion between column/table names and SQL operators.
When working with other databases you're likely to see every variable name quotes because only a handful of client packages, like duckdb, know what all the reserved words are, so they quote everything to be safe.
``` sql
SELECT "tailnum", "type", "manufacturer", "model", "year"
FROM "planes"
```
Some other database systems use backticks instead of quotes:
``` sql
SELECT `tailnum`, `type`, `manufacturer`, `model`, `year`
FROM `planes`
```
:::
The translations for `mutate()` are similarly straightforward: each variable becomes a new expression in `SELECT`:
```{r}
diamonds_db |>
flights |>
mutate(
price_per_carat = price / carat
speed = distance / (air_time / 60)
) |>
show_query()
```
We'll come back to the translation of individual components (like `/`) in @sec-sql-expressions.
::: callout-note
When working with other databases you're likely to see variable names wrapped in some sort of quote character, like this:
### FROM
``` sql
SELECT "year", "month", "day", "dep_time", "dep_delay"
FROM "flights"
```
Or like this:
``` sql
SELECT `year`, `month`, `day`, `dep_time`, `dep_delay`
FROM `flights`
```
Quoting is only required for **reserved words** like `SELECT` or `FROM` to avoid confusion between column/tables names and SQL operators.
But only a handful of client packages, like duckdb, know what all the reserved words are, so most packages quote everything just to be safe.
:::
The `FROM` clause defines the data source.
It's going to be rather uninteresting for a little while, because we're just using single tables.
You'll see more complex examples once we hit the join functions.
### GROUP BY
`group_by()` is translated to the `GROUP BY`[^import-databases-6] clause and `summarise()` is translated to the `SELECT` clause:
`group_by()` is translated to the `GROUP BY`[^databases-6] clause and `summarise()` is translated to the `SELECT` clause:
[^import-databases-6]: This is no coincidence: the dplyr function name was inspired by the SQL clause.
[^databases-6]: This is no coincidence: the dplyr function name was inspired by the SQL clause.
```{r}
diamonds_db |>
@ -430,7 +438,7 @@ flights |>
SQL uses `NULL` instead of `NA`.
`NULL`s behave similarly to `NA`s.
The main difference is that while they're "infectious" in comparisons and arithmetic, they are silently dropped when summarizing.
dbplyr will remind you about this behaviour the first time you hit it:
dbplyr will remind you about this behavior the first time you hit it:
```{r}
flights |>
@ -438,7 +446,7 @@ flights |>
summarise(delay = mean(arr_delay))
```
If you want to learn more about how NULLs work, I recomend "[*Three valued logic*](https://modern-sql.com/concept/three-valued-logic)" by Markus Winand,
If you want to learn more about how NULLs work, I recommend "[*Three valued logic*](https://modern-sql.com/concept/three-valued-logic)" by Markus Winand.
In general, you can work with `NULL`s using the functions you'd use for `NA`s in R:
@ -455,6 +463,17 @@ In this case, you could drop the parentheses and use a special operator that's e
WHERE "dep_delay" IS NOT NULL
```
Note that if you `filter()` a variable that you created using a summarize, dbplyr will generate a `HAVING` clause, rather than a `FROM` clause.
This is a one of the idiosyncracies of SQL created because `WHERE` is evaluated before `SELECT`, so it needs another clause that's evaluated afterwards.
```{r}
diamonds_db |>
group_by(cut) |>
summarise(n = n()) |>
filter(n > 100) |>
show_query()
```
### ORDER BY
Ordering rows involves a straightforward translation from `arrange()` to the `ORDER BY` clause:
@ -501,33 +520,14 @@ As dbplyr improves over time, these cases will get rarer but will probably never
### Joins
If you're familiar with dplyr's joins, SQL joins are very similar.
Unfortunately, dbplyr's current translations are rather verbose[^import-databases-7].
Here's a simple example:
[^import-databases-7]: We're working on doing better in the future, so if you're lucky it'll be better by the time you're reading this 😃
```{r}
flights |>
left_join(planes |> rename(year_built = year), by = "tailnum") |>
show_query()
```
If you were writing this by hand, you'd probably write this as:
``` sql
SELECT
flights.*,
year as year_built,
"type",
manufacturer,
model,
engines,
seats,
speed
FROM flights
LEFT JOIN planes ON (flights.tailnum = planes.tailnum)
```
The main thing to notice here is the syntax: SQL joins use sub-clauses of the `FROM` clause to bring in additional tables, using `ON` to define how the tables are related.
dplyr's names for these functions are so closely connected to SQL that you can easily guess the equivalent SQL for `inner_join()`, `right_join()`, and `full_join():`
@ -641,7 +641,7 @@ Here's a couple of simple examples:
```{r}
flights |>
mutate_query(
description = if_else(arr_deay > 0, "delayed", "on-time")
description = if_else(arr_delay > 0, "delayed", "on-time")
)
flights |>
mutate_query(