Use dev dbplyr
This commit is contained in:
parent
226e0061ad
commit
c83d21200d
|
@ -48,6 +48,7 @@ Suggests:
|
||||||
tidymodels,
|
tidymodels,
|
||||||
xml2
|
xml2
|
||||||
Remotes:
|
Remotes:
|
||||||
|
tidyverse/dbplyr,
|
||||||
tidyverse/stringr,
|
tidyverse/stringr,
|
||||||
tidyverse/tidyr,
|
tidyverse/tidyr,
|
||||||
jennybc/repurrrsive
|
jennybc/repurrrsive
|
||||||
|
|
110
databases.qmd
110
databases.qmd
|
@ -13,13 +13,13 @@ A huge amount of data lives in databases, so it's essential that you know how to
|
||||||
Sometimes you can ask someone to download a snapshot into a .csv for you, but this gets painful quickly: every time you need to make a change you'll have to communicate with another human.
|
Sometimes you can ask someone to download a snapshot into a .csv for you, but this gets painful quickly: every time you need to make a change you'll have to communicate with another human.
|
||||||
You want to be able to reach into the database directly to get the data you need, when you need it.
|
You want to be able to reach into the database directly to get the data you need, when you need it.
|
||||||
|
|
||||||
In this chapter, you'll first learn the basics of the DBI package: how to use it to connect to a database and then retrieve data with a SQL[^import-databases-1] query.
|
In this chapter, you'll first learn the basics of the DBI package: how to use it to connect to a database and then retrieve data with a SQL[^databases-1] query.
|
||||||
**SQL**, short for **s**tructured **q**uery **l**anguage, is the lingua franca of databases, and is an important language for all data scientists to learn.
|
**SQL**, short for **s**tructured **q**uery **l**anguage, is the lingua franca of databases, and is an important language for all data scientists to learn.
|
||||||
That said, we're not going to start with SQL, but instead we'll teach you dbplyr, which can translate your dplyr code to the SQL.
|
That said, we're not going to start with SQL, but instead we'll teach you dbplyr, which can translate your dplyr code to the SQL.
|
||||||
We'll use that as way to teach you some of the most important features of SQL.
|
We'll use that as way to teach you some of the most important features of SQL.
|
||||||
You won't become a SQL master by the end of the chapter, but you will be able to identify the most important components and understand what they do.
|
You won't become a SQL master by the end of the chapter, but you will be able to identify the most important components and understand what they do.
|
||||||
|
|
||||||
[^import-databases-1]: SQL is either pronounced "s"-"q"-"l" or "sequel".
|
[^databases-1]: SQL is either pronounced "s"-"q"-"l" or "sequel".
|
||||||
|
|
||||||
### Prerequisites
|
### Prerequisites
|
||||||
|
|
||||||
|
@ -73,10 +73,10 @@ This uses the ODBC protocol supported by many DBMS.
|
||||||
odbc requires a little more setup because you'll also need to install an ODBC driver and tell the odbc package where to find it.
|
odbc requires a little more setup because you'll also need to install an ODBC driver and tell the odbc package where to find it.
|
||||||
|
|
||||||
Concretely, you create a database connection using `DBI::dbConnect()`.
|
Concretely, you create a database connection using `DBI::dbConnect()`.
|
||||||
The first argument selects the DBMS[^import-databases-2], then the second and subsequent arguments describe how to connect to it (i.e. where it lives and the credentials that you need to access it).
|
The first argument selects the DBMS[^databases-2], then the second and subsequent arguments describe how to connect to it (i.e. where it lives and the credentials that you need to access it).
|
||||||
The following code shows a couple of typical examples:
|
The following code shows a couple of typical examples:
|
||||||
|
|
||||||
[^import-databases-2]: Typically, this is the only function you'll use from the client package, so we recommend using `::` to pull out that one function, rather than loading the complete package with `library()`.
|
[^databases-2]: Typically, this is the only function you'll use from the client package, so we recommend using `::` to pull out that one function, rather than loading the complete package with `library()`.
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
#| eval: false
|
#| eval: false
|
||||||
|
@ -133,16 +133,16 @@ dbWriteTable(con, "diamonds", ggplot2::diamonds)
|
||||||
If you're using duckdb in a real project, I highly recommend learning about `duckdb_read_csv()` and `duckdb_register_arrow()`.
|
If you're using duckdb in a real project, I highly recommend learning about `duckdb_read_csv()` and `duckdb_register_arrow()`.
|
||||||
These give you powerful and performant ways to quickly load data directly into duckdb, without having to first load it in to R.
|
These give you powerful and performant ways to quickly load data directly into duckdb, without having to first load it in to R.
|
||||||
|
|
||||||
## Database basics
|
## DBI basics
|
||||||
|
|
||||||
Now that we've connected to a database with some data in it, lets perform some basic operations with DBI.
|
Now that we've connected to a database with some data in it, lets perform some basic operations with DBI.
|
||||||
|
|
||||||
### What's there?
|
### What's there?
|
||||||
|
|
||||||
The most important database objects for data scientists are tables.
|
The most important database objects for data scientists are tables.
|
||||||
DBI provides two useful functions to either list all the tables in the database[^import-databases-3] or to check if a specific table already exists:
|
DBI provides two useful functions to either list all the tables in the database[^databases-3] or to check if a specific table already exists:
|
||||||
|
|
||||||
[^import-databases-3]: At least, all the tables that you have permission to see.
|
[^databases-3]: At least, all the tables that you have permission to see.
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
dbListTables(con)
|
dbListTables(con)
|
||||||
|
@ -279,14 +279,14 @@ Common statements include `CREATE` for defining new tables, `INSERT` for adding
|
||||||
We will on focus on `SELECT` statements, also called **queries**, because they are almost exclusively what you'll use as a data scientist.
|
We will on focus on `SELECT` statements, also called **queries**, because they are almost exclusively what you'll use as a data scientist.
|
||||||
|
|
||||||
A query is made up of **clauses**.
|
A query is made up of **clauses**.
|
||||||
There are five important clauses: `SELECT`, `FROM`, `WHERE`, `ORDER BY`, and `GROUP BY`. Every query must have the `SELECT`[^import-databases-4] and `FROM`[^import-databases-5] clauses and the simplest query is `SELECT * FROM table`, which selects all columns from the specified table
|
There are five important clauses: `SELECT`, `FROM`, `WHERE`, `ORDER BY`, and `GROUP BY`. Every query must have the `SELECT`[^databases-4] and `FROM`[^databases-5] clauses and the simplest query is `SELECT * FROM table`, which selects all columns from the specified table
|
||||||
. This is what dplyr generates for an adulterated table
|
. This is what dplyr generates for an unadulterated table
|
||||||
:
|
:
|
||||||
|
|
||||||
[^import-databases-4]: Confusingly, depending on the context, `SELECT` is either a statement or a clause.
|
[^databases-4]: Confusingly, depending on the context, `SELECT` is either a statement or a clause.
|
||||||
To avoid this confusion, we'll generally use query instead of `SELECT` statement.
|
To avoid this confusion, we'll generally use query instead of `SELECT` statement.
|
||||||
|
|
||||||
[^import-databases-5]: Ok, technically, only the `SELECT` is required, since you can write queries like `SELECT 1+1` to perform basic calculations.
|
[^databases-5]: Ok, technically, only the `SELECT` is required, since you can write queries like `SELECT 1+1` to perform basic calculations.
|
||||||
But if you want to work with data (as you always do!) you'll also need a `FROM` clause.
|
But if you want to work with data (as you always do!) you'll also need a `FROM` clause.
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
|
@ -334,14 +334,16 @@ The `SELECT` clause is the workhorse of queries and performs the same job as `se
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
planes |>
|
planes |>
|
||||||
select(tailnum, type, manufacturer, model) |>
|
select(tailnum, type, manufacturer, model, year) |>
|
||||||
show_query()
|
show_query()
|
||||||
|
|
||||||
planes |>
|
planes |>
|
||||||
|
select(tailnum, type, manufacturer, model, year) |>
|
||||||
rename(year_built = year) |>
|
rename(year_built = year) |>
|
||||||
show_query()
|
show_query()
|
||||||
|
|
||||||
planes |>
|
planes |>
|
||||||
|
select(tailnum, type, manufacturer, model, year) |>
|
||||||
relocate(manufacturer, model, .before = type) |>
|
relocate(manufacturer, model, .before = type) |>
|
||||||
show_query()
|
show_query()
|
||||||
```
|
```
|
||||||
|
@ -350,42 +352,48 @@ This example also shows you how SQL does renaming.
|
||||||
In SQL terminology renaming is called **aliasing** and is done with `AS`.
|
In SQL terminology renaming is called **aliasing** and is done with `AS`.
|
||||||
Note that unlike with `mutate()`, the old name is on the left and the new name is on the right.
|
Note that unlike with `mutate()`, the old name is on the left and the new name is on the right.
|
||||||
|
|
||||||
|
::: callout-note
|
||||||
|
In the examples above note that `"year"` and `"type"` are wrapped in double quotes.
|
||||||
|
That's because these are **reserved words** in duckdb, so dbplyr quotes them to avoid any potential confusion between column/table names and SQL operators.
|
||||||
|
|
||||||
|
When working with other databases you're likely to see every variable name quotes because only a handful of client packages, like duckdb, know what all the reserved words are, so they quote everything to be safe.
|
||||||
|
|
||||||
|
``` sql
|
||||||
|
SELECT "tailnum", "type", "manufacturer", "model", "year"
|
||||||
|
FROM "planes"
|
||||||
|
```
|
||||||
|
|
||||||
|
Some other database systems use backticks instead of quotes:
|
||||||
|
|
||||||
|
``` sql
|
||||||
|
SELECT `tailnum`, `type`, `manufacturer`, `model`, `year`
|
||||||
|
FROM `planes`
|
||||||
|
```
|
||||||
|
:::
|
||||||
|
|
||||||
The translations for `mutate()` are similarly straightforward: each variable becomes a new expression in `SELECT`:
|
The translations for `mutate()` are similarly straightforward: each variable becomes a new expression in `SELECT`:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
diamonds_db |>
|
flights |>
|
||||||
mutate(
|
mutate(
|
||||||
price_per_carat = price / carat
|
speed = distance / (air_time / 60)
|
||||||
) |>
|
) |>
|
||||||
show_query()
|
show_query()
|
||||||
```
|
```
|
||||||
|
|
||||||
We'll come back to the translation of individual components (like `/`) in @sec-sql-expressions.
|
We'll come back to the translation of individual components (like `/`) in @sec-sql-expressions.
|
||||||
|
|
||||||
::: callout-note
|
### FROM
|
||||||
When working with other databases you're likely to see variable names wrapped in some sort of quote character, like this:
|
|
||||||
|
|
||||||
``` sql
|
The `FROM` clause defines the data source.
|
||||||
SELECT "year", "month", "day", "dep_time", "dep_delay"
|
It's going to be rather uninteresting for a little while, because we're just using single tables.
|
||||||
FROM "flights"
|
You'll see more complex examples once we hit the join functions.
|
||||||
```
|
|
||||||
|
|
||||||
Or like this:
|
|
||||||
|
|
||||||
``` sql
|
|
||||||
SELECT `year`, `month`, `day`, `dep_time`, `dep_delay`
|
|
||||||
FROM `flights`
|
|
||||||
```
|
|
||||||
|
|
||||||
Quoting is only required for **reserved words** like `SELECT` or `FROM` to avoid confusion between column/tables names and SQL operators.
|
|
||||||
But only a handful of client packages, like duckdb, know what all the reserved words are, so most packages quote everything just to be safe.
|
|
||||||
:::
|
|
||||||
|
|
||||||
### GROUP BY
|
### GROUP BY
|
||||||
|
|
||||||
`group_by()` is translated to the `GROUP BY`[^import-databases-6] clause and `summarise()` is translated to the `SELECT` clause:
|
`group_by()` is translated to the `GROUP BY`[^databases-6] clause and `summarise()` is translated to the `SELECT` clause:
|
||||||
|
|
||||||
[^import-databases-6]: This is no coincidence: the dplyr function name was inspired by the SQL clause.
|
[^databases-6]: This is no coincidence: the dplyr function name was inspired by the SQL clause.
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
diamonds_db |>
|
diamonds_db |>
|
||||||
|
@ -430,7 +438,7 @@ flights |>
|
||||||
SQL uses `NULL` instead of `NA`.
|
SQL uses `NULL` instead of `NA`.
|
||||||
`NULL`s behave similarly to `NA`s.
|
`NULL`s behave similarly to `NA`s.
|
||||||
The main difference is that while they're "infectious" in comparisons and arithmetic, they are silently dropped when summarizing.
|
The main difference is that while they're "infectious" in comparisons and arithmetic, they are silently dropped when summarizing.
|
||||||
dbplyr will remind you about this behaviour the first time you hit it:
|
dbplyr will remind you about this behavior the first time you hit it:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
flights |>
|
flights |>
|
||||||
|
@ -438,7 +446,7 @@ flights |>
|
||||||
summarise(delay = mean(arr_delay))
|
summarise(delay = mean(arr_delay))
|
||||||
```
|
```
|
||||||
|
|
||||||
If you want to learn more about how NULLs work, I recomend "[*Three valued logic*](https://modern-sql.com/concept/three-valued-logic)" by Markus Winand,
|
If you want to learn more about how NULLs work, I recommend "[*Three valued logic*](https://modern-sql.com/concept/three-valued-logic)" by Markus Winand.
|
||||||
|
|
||||||
In general, you can work with `NULL`s using the functions you'd use for `NA`s in R:
|
In general, you can work with `NULL`s using the functions you'd use for `NA`s in R:
|
||||||
|
|
||||||
|
@ -455,6 +463,17 @@ In this case, you could drop the parentheses and use a special operator that's e
|
||||||
WHERE "dep_delay" IS NOT NULL
|
WHERE "dep_delay" IS NOT NULL
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Note that if you `filter()` a variable that you created using a summarize, dbplyr will generate a `HAVING` clause, rather than a `FROM` clause.
|
||||||
|
This is a one of the idiosyncracies of SQL created because `WHERE` is evaluated before `SELECT`, so it needs another clause that's evaluated afterwards.
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
diamonds_db |>
|
||||||
|
group_by(cut) |>
|
||||||
|
summarise(n = n()) |>
|
||||||
|
filter(n > 100) |>
|
||||||
|
show_query()
|
||||||
|
```
|
||||||
|
|
||||||
### ORDER BY
|
### ORDER BY
|
||||||
|
|
||||||
Ordering rows involves a straightforward translation from `arrange()` to the `ORDER BY` clause:
|
Ordering rows involves a straightforward translation from `arrange()` to the `ORDER BY` clause:
|
||||||
|
@ -501,33 +520,14 @@ As dbplyr improves over time, these cases will get rarer but will probably never
|
||||||
### Joins
|
### Joins
|
||||||
|
|
||||||
If you're familiar with dplyr's joins, SQL joins are very similar.
|
If you're familiar with dplyr's joins, SQL joins are very similar.
|
||||||
Unfortunately, dbplyr's current translations are rather verbose[^import-databases-7].
|
|
||||||
Here's a simple example:
|
Here's a simple example:
|
||||||
|
|
||||||
[^import-databases-7]: We're working on doing better in the future, so if you're lucky it'll be better by the time you're reading this 😃
|
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
flights |>
|
flights |>
|
||||||
left_join(planes |> rename(year_built = year), by = "tailnum") |>
|
left_join(planes |> rename(year_built = year), by = "tailnum") |>
|
||||||
show_query()
|
show_query()
|
||||||
```
|
```
|
||||||
|
|
||||||
If you were writing this by hand, you'd probably write this as:
|
|
||||||
|
|
||||||
``` sql
|
|
||||||
SELECT
|
|
||||||
flights.*,
|
|
||||||
year as year_built,
|
|
||||||
"type",
|
|
||||||
manufacturer,
|
|
||||||
model,
|
|
||||||
engines,
|
|
||||||
seats,
|
|
||||||
speed
|
|
||||||
FROM flights
|
|
||||||
LEFT JOIN planes ON (flights.tailnum = planes.tailnum)
|
|
||||||
```
|
|
||||||
|
|
||||||
The main thing to notice here is the syntax: SQL joins use sub-clauses of the `FROM` clause to bring in additional tables, using `ON` to define how the tables are related.
|
The main thing to notice here is the syntax: SQL joins use sub-clauses of the `FROM` clause to bring in additional tables, using `ON` to define how the tables are related.
|
||||||
|
|
||||||
dplyr's names for these functions are so closely connected to SQL that you can easily guess the equivalent SQL for `inner_join()`, `right_join()`, and `full_join():`
|
dplyr's names for these functions are so closely connected to SQL that you can easily guess the equivalent SQL for `inner_join()`, `right_join()`, and `full_join():`
|
||||||
|
@ -641,7 +641,7 @@ Here's a couple of simple examples:
|
||||||
```{r}
|
```{r}
|
||||||
flights |>
|
flights |>
|
||||||
mutate_query(
|
mutate_query(
|
||||||
description = if_else(arr_deay > 0, "delayed", "on-time")
|
description = if_else(arr_delay > 0, "delayed", "on-time")
|
||||||
)
|
)
|
||||||
flights |>
|
flights |>
|
||||||
mutate_query(
|
mutate_query(
|
||||||
|
|
Loading…
Reference in New Issue