r4ds/import-databases.qmd

# Databases {#sec-import-databases}

```{r}
#| results: "asis"
#| echo: false
source("_common.R")
status("drafting")
```

## Introduction

A huge amount of data lives in databases, and it's essential that as a data scientist you know how to access it.
Sometimes it's possible to get someone to download a snapshot into a .csv for you, but this is generally not desirable as the iteration speed is very slow.
You want to be able to reach into the database directly to get the data you need, when you need it.

In this chapter, you'll first learn the basics of the DBI package: how to use it to connect to a database and how to retrieve data by executing an SQL query.
**SQL**, short for **s**tructured **q**uery **l**anguage, is the lingua franca of databases, and is an important language for you to learn as a data scientist.
However, we're not going to start with SQL, but instead we'll teach you dbplyr, which can convert your dplyr code to the equivalent SQL.
We'll use that as way to teach you some of the most important features of SQL.
You won't become a SQL master by the end of the chapter, but you will be able to identify the most important components and understand what they do.

The main focus of this chapter, is working with data that already exists, data that someone else has collected in a database for you, as this represents the most common case.
But as we go along, we will point out a few tips and tricks for getting your own data into a database.

### Prerequisites

In this chapter, we'll add DBI and dbplyr into the mix.
DBI provides a low-level interface for connecting to databases and executing SQL.
dbplyr is a high-level interface that works with dplyr verbs to automatically generate SQL and then executes it using DBI.

```{r}
#| label: setup
#| message: false
library(DBI)
library(dbplyr)
library(tidyverse)
```

## Database basics

At the simplest level, you can think about a database as a collection of data frames, called **tables** in database terminology.
Like a data.frame, a database table is a collection of named columns, where every value in the column is the same type.
There are three high level differences between data frames and database tables:

-   Database tables are stored on disk and can be arbitrarily large.
    Data frames are stored in memory, and hence can't be bigger than your memory.

-   Database tables usually have indexes.
    Much like an index of a book, a database index makes it possible to find rows of interest without having to read every row.
    Data frames and tibbles don't have indexes, but data.tables do, which is one of the reasons that they're so fast.

-   Most classical databases are optimized for rapidly collecting data, not analyzing existing data.
    These databases are called **row-oriented** because the data is stored row-by-row, rather than column-by-column like R.
    More recently, there's been much development of **column-oriented** databases that make analyzing the existing data much faster.

Databases are run by database management systems (**DBMS** for short), which are typically run on a powerful central server.
Popular open source DBMS's of this nature are MariaDB, PostgreSQL, and SQL, and commercial equivalents include SQL Server and Oracle.
Today, many DBMS's run in the cloud, like Snowflake, Amazon's RedShift, and Google's BigQuery.

## Connecting to a database

To connect to the database from R, you'll use a pair of packages:

-   You'll always use DBI (**d**ata**b**ase **i**nterface), provides a set of generic functions that perform connect to the database, upload data, run queries, and so on.

-   You'll also use a DBMS client package package specific to the DBMS you're connecting to.
    This package translates the generic commands into the specifics needed for a given DBMS.
    For example, if you're connecting to Postgres you'll use the RPostgres package.
    If you're connecting to MariaDB or MySQL, you'll use the RMariaDB package.

If you can't find a specific package for your DBMS, you can usually use the generic odbc package instead.
This uses the widespread ODBC standard.
odbc requires a little more setup because you'll also need to install and configure an ODBC driver.

Concretely, to create a database connection using `DBI::dbConnect()`.
The first argument specifies the DBMS and the second and subsequent arguments describe where the database lives and any credentials that you'll need to access it.
The following code shows are few typical examples:

```{r}
#| eval: false
con <- DBI::dbConnect(
  RMariaDB::MariaDB(), 
  username = "foo"
)
con <- DBI::dbConnect(
  RPostgres::Postgres(), 
  hostname = "databases.mycompany.com", 
  port = 1234
)
```

There's a lot of variation from DBMs to DBMS so unfortunately we can't cover all the details here.
So to connect the database you care about, you'll need to do a little research.
Typically you can ask the other data scientists in your team or talk to your DBA (**d**ata**b**ase **a**dministrator).
The initial setup will often take a little fiddling (and maybe some googling) to get right, but you'll generally only need to do it once.

When you're done with the connection it's good practice to close it with `dbDisconnect(con)`.
This frees up resources on the database server for us by other people.

### In this book

Setting up a DBMS would be a pain for this book, so we'll instead use a self-contained DBMS that lives entirely in an R package: duckdb.
Thanks to the magic of DBI, the only difference between using duckdb and any other DBMS is how you'll connect to the database.
This makes it great to teach with because you can easily run this code as well as easily take what you learn and apply it elsewhere.

Connecting to duckdb is particularly simple because the defaults create a temporary database that is deleted when you quite R.
That's great for learning because it guarantees that you'll start from a clean slate every time you restart R:

```{r}
con <- DBI::dbConnect(duckdb::duckdb())
```

If you want to use duckdb for a real data analysis project[^import-databases-1], you'll also need to supply the `dbdir` argument to tell duckdb where to store the database files.
Assuming you're using a project (Chapter -@sec-workflow-scripts-projects)), it's reasonable to store it in the `duckdb` directory of the current project:

[^import-databases-1]: Which we highly recommend: it's a great database for data science.

```{r}
#| eval: false
con <- DBI::dbConnect(duckdb::duckdb(), dbdir = "duckdb")
```

duckdb is a high-performance database that's designed very much with the needs of the data scientist in mind, and the developers very much understand R and the types of real problems that R users face.
As you'll see in this chapter, it's really easy to get started with but it can also handle very large datasets.

### Load some data {#sec-load-data}

Since this is a temporary database, we need to start by adding some data.
Here we'll use the `mpg` and `diamonds` datasets from ggplot2.

```{r}
dbWriteTable(con, "mpg", ggplot2::mpg)
dbWriteTable(con, "diamonds", ggplot2::diamonds)
```

If you're using duckdb in a real project, I highly recommend learning about `duckdb_read_csv()` and `duckdb_register_arrow()`.
These give you powerful and performant ways to quickly load data directly into duckdb, without having to first load it in to R.

## Database basics

Now that we've connected to a database with some data in it, lets perform some basic operations with DBI.

### What's there?

The most important database objects for data scientists are tables.
DBI provides two useful functions to either list all the tables in the database[^import-databases-2] or to check if a specific table already exists:

[^import-databases-2]: At least, all the tables that you have permission to see.

```{r}
dbListTables(con)
dbExistsTable(con, "foo")
```

### Extract some data

Once you've determined a table exists, you can retrieve it with `dbReadTable()`:

```{r}
con |> 
  dbReadTable("diamonds") |> 
  as_tibble()
```

`dbReadTable()` returns a `data.frame` so I use `as_tibble()` to convert it into a tibble so that it prints nicely.

```{=html}
<!-- 
Notice something important with the diamonds dataset: the `cut`, `color`, and `clarity` columns were originally ordered factors, but now they're regular factors.
This particularly case isn't very important since ordered factors are barely different to regular factors, but it's good to know that the way that the database represents data can be slightly different to the way R represents data.
In this case, we're actually quite lucky because most databases don't support factors at all and would've converted the column to a string.
Again, not that important, because most of the time you'll be working with data that lives in a database, but good to be aware of if you're storing your own data into a database.
Generally you can expect numbers, strings, dates, and date-times to convert just fine, but other types may not. 
-->
```
In real life, it's rare that you'll use `dbReadTable()` because the whole reason you're using a database is that there's too much data to fit in a data frame, and you want to use the database to bring back only a subset of the rows and columns.

### Run a query {#sec-dbGetQuery}

The way you'll usually retrieve data is with `dbGetQuery()`.
It takes a database connection and some SQL code and returns a data frame:

```{r}
con |> 
  dbGetQuery("
    SELECT carat, cut, clarity, color, price 
    FROM diamonds 
    WHERE price > 15000
  ") |> 
  as_tibble()
```

Don't worry if you've never seen SQL code before as you'll learn more about it shortly.
But if read it carefully, you might guess that it selects five columns of the diamonds dataset and the rows where `price` is greater than 15,000.

You'll need to be a little careful with `dbGetQuery()` since it can potentially return more data than you have memory.
We won't discuss it further here, but if you're dealing with very large datasets it's possible to deal with a "page" of data at a time by using `dbSendQuery()` to get a "result set" which you can page through by calling `dbFetch()` until `dbHasCompleted()` returns `TRUE`.

### Other functions

There are lots of other functions in DBI that you might find useful if you're managing your own data (like `dbWriteTable()` which we used in @sec-load-data), but we're going to skip past them in the interests of staying focused on working with data that already lives in a database.

## dbplyr basics

Now that you've learned the low-level basics for connecting to a database and running a query, we're going to switch it up a bit and learn a bit about dbplyr.
dbplyr is a dplyr **backend**, which means that you write the dplyr code that you're already familiar with and dbplyr translates it to run in a different way, in this case to SQL.

To use dbplyr, you must first use `tbl()` to create an object that represents a database table[^import-databases-3]:

[^import-databases-3]: If you want to mix SQL and dbplyr, you can also create a tbl from a SQL query with `tbl(con, SQL("SELECT * FROM foo")).`

```{r}
diamonds_db <- tbl(con, "diamonds")
diamonds_db
```

This object is **lazy**; when you use dplyr verbs on it, dplyr doesn't do any work: it just records the sequence of operations that you want to perform and only performs them when needed.
For example, take the following pipeline:

```{r}
big_diamonds_db <- diamonds_db |> 
  filter(price > 15000) |> 
  select(carat:clarity, price)

big_diamonds_db
```

You can tell this object represents a database query because it prints the DBMS name at the top, and while it tells you the number of columns, it typically it won't tell you the number of rows.
This is because finding the total number of rows usually requires executing the complete query, something we're trying to avoid.

You can see the SQL the dbplyr generates by a dbplyr query by calling `show_query()`:

```{r}
big_diamonds_db |>
  show_query()
```

To get all the data back into R, you call `collect()`.
Behind the scenes, this generates the SQL, calls `dbGetQuery()` to get the data, then turns the result into a tibble:

```{r}
big_diamonds <- big_diamonds_db |> 
  collect()
big_diamonds
```

## SQL

The rest of the chapter will teach you a little about SQL through the lens of dbplyr.
It's a rather non-traditional introduction to SQL but I hope it will get you quickly up to speed with the basics.
It will hopefully help you understand the parallels between SQL and dplyr but it's not going to give you much practice writing SQL.
For that, I'd recommend [*SQL for Data Scientists*](https://sqlfordatascientists.com)by Renée M. P. Teate.
It's an introduction to SQL designed specifically for the needs of data scientists, and includes examples of the sort of highly interconnected data you're likely to encounter in real organizations.

Luckily, if you understand dplyr you're in a great place to quickly pick up SQL because so many of the concepts are the same.

We'll explore the relationship between dplyr and SQL using a couple of old friends from the nycflights dataset, the `flights` and `planes` tibbles.
These are easy to get into our learning database because dbplyr has a function designed for this exact scenario.

```{r}
dbplyr::copy_nycflights13(con)
flights <- tbl(con, "flights")
planes <- tbl(con, "planes")

options(dplyr.strict_sql = TRUE)
```

### SQL basics

Instead of functions, like R, SQL has **statements**.
Common statements include `CREATE` for defining new tables, `INSERT` for adding data, and `SELECT` for retrieving data.
We're going to focus on `SELECT` statements, which are commonly called **queries**, because they are almost exclusively what you'll use as a data scientist.
Your job is usually to analyse existing data, and in most cases you won't even have permission to modify the data.

A query is made up of **clauses**.
Every query must have two clauses: `SELECT`[^import-databases-4] and `FROM`[^import-databases-5]. The simplest query is `SELECT * FROM tablename`, which selects all columns from the specified table
. This is what dplyr generates for an adulterated table
:

[^import-databases-4]: Confusingly, depending on the context, `SELECT` is either a statement or a clause.
    To avoid this confusion, we'll generally use query instead of `SELECT` statement.

[^import-databases-5]: Ok, technically, only the `SELECT` is required, since you can write queries like `SELECT 1+1` to perform basic calculations.
    But if you want to work with data (as you always do!) you'll also need a `FROM` clause.

```{r}
flights |> show_query()
planes |> show_query()
```

There are three other important clauses: `WHERE`, `ORDER BY`, and `GROUP BY`. `WHERE` and `ORDER BY` control which rows are included and how they are ordered:

```{r}
flights |> 
  filter(dest == "IAH") |> 
  arrange(dep_delay) |>
  show_query()
```

While `GROUP BY` works just like `dplyr::group_by()` causing aggregation to happen by group:

```{r}
flights |> 
  group_by(dest) |> 
  summarise(dep_delay = mean(dep_delay, na.rm = TRUE)) |> 
  show_query()
```

There are two important differences between dplyr verbs and SELECT clauses:

-   In SQL, case doesn't matter: you can write `select`, `SELECT`, or even `SeLeCt`. In this book we'll stick with the common convention of writing SQL keywords in uppercase to distinguish them from table or variables names.
-   In SQL, order matters: you must always write the clauses in the order `SELECT`, `FROM`, `WHERE`, `GROUP BY`, `ORDER BY`. Confusingly, this order doesn't match how they clauses actually evaluated which is first `FROM`, then `WHERE`, `GROUP BY`, `SELECT`, and `ORDER BY`.

The following sections explore each clause in more detail.

::: callout-note
Note that while SQL is a standard, it is an extremely complex standard and no database follows it exactly.
This means that while the main components that we'll focus on in this book are very similar between DBMSs, there are a lot of minor variations.
Fortunately, dbplyr knows about this problem and generates different translations for different databases.
It's not perfect, but it's continually improving, and if you hit a problem you can file an issue at [on GitHub](https://github.com/tidyverse/dbplyr/issues/) to help us improve it.
:::

### SELECT

The `SELECT` clause is the workhorse of queries, and is equivalent to `select()`, `mutate()`, `rename()`, `relocate()`, and, as you'll learn in the next section, `summarize()`.
`select()`, `rename()`, and `relocate()` have very direct translations to `SELECT` as they affect where a column appears (if at all) along with its name:

```{r}
flights |> 
  select(year:day, starts_with("dep")) |> 
  show_query()

flights |> 
  rename(tail_num = tailnum) |> 
  show_query()

flights |> 
  relocate(hour:minute, .after = day) |> 
  show_query()
```

This example also shows you how SQL does renaming.
In SQL terminology renaming is called **aliasing** and is done with `AS`.
Note that unlike with `mutate()`, the old name is on the left and the new name is on the right.

The translations for `mutate()` are similarly straightforward:

```{r}
diamonds_db |> 
  mutate(price_per_carat = price / carat) |> 
  show_query()
```

We'll come back to the translation of individual components (like `/`) in @sec-sql-expressions.

::: callout-note
When working with other databases you're likely to see variable names wrapped in some sort of quote, like this:

``` sql
SELECT "year", "month", "day", "dep_time", "dep_delay"
FROM "flights"
```

Or maybe:

``` sql
SELECT `year`, `month`, `day`, `dep_time`, `dep_delay`
FROM `flights`
```

You only need quote to **reserved words** like `SELECT` or `FROM` to avoid confusion between column/tables names and SQL operators.
But only a handle of client packages, like duckdb, know what all the reserved words are, so most packages with quote everything just to be safe.
:::

### GROUP BY

`group_by()` is translated to the `GROUP BY`[^import-databases-6] clause and `summarise()` is translated to the `SELECT` clause:

[^import-databases-6]: This is no coincidence: the dplyr function name was inspired by the SQL clause.

```{r}
diamonds_db |> 
  group_by(cut) |> 
  summarise(
    n = n(),
    avg_price = mean(price, na.rm = TRUE)
  ) |> 
  show_query()
```

We'll come back to what's happening with translation `n()` and `mean()` in @sec-sql-expressions.

### WHERE

`filter()` is translated to the `WHERE` clause:

```{r}
flights |> 
  filter(dest == "IAH" | dest == "HOU") |> 
  show_query()

flights |> 
  filter(arr_delay > 0 & arr_delay < 20) |> 
  show_query()
```

There are a few important details to note here:

-   `|` becomes `OR` and `&` becomes `AND`.
-   SQL uses `=` for comparison, not `==`. SQL doesn't have assignment, so there's no potential for confusion there.
-   SQL uses only `''` for strings, not `""`. In SQL, `""` is generally equivalent to R's ``` `` ```.

Another useful SQL operator is `IN`, which is very close to R's `%in%`:

```{r}
flights |> 
  filter(dest %in% c("IAH", "HOU")) |> 
  show_query()
```

SQL uses `NULL` instead of `NA`.
`NULL`s behave similarly to `NA`s.
The main difference is that while they're "infectious" in comparisons and arithmetic, they are silently dropped when summarizing.
dbplyr will remind you about this behaviour the first time you hit it:

```{r}
flights |> 
  group_by(dest) |> 
  summarise(delay = mean(arr_delay))
```

If you want to learn more about how NULLs work, I recomend "[*Three valued logic*](https://modern-sql.com/concept/three-valued-logic)" by Markus Winand,

Otherwise, you can work with `NULL`s using the functions you'd use for `NA`s in R:

```{r}
flights |> 
  filter(!is.na(dep_delay)) |> 
  show_query()
```

This SQL query illustrates one of the drawbacks of dbplyr: while the SQL is correct, it isn't as simple as you might write by hand.
In this case, you could drop the parentheses and use a special operator that's easier to read:

``` sql
WHERE "dep_delay" IS NOT NULL
```

### ORDER BY

Ordering rows involves a straightforward translation from `arrange()` to the `ORDER BY` clause:

```{r}
flights |> 
  arrange(year, month, day, desc(dep_delay)) |> 
  show_query()
```

Notice how `desc()` is translated to `DESC`: this is another dplyr function whose name was directly inspired by SQL.

### Subqueries

Sometimes it's not possible to translate a dplyr pipeline into a single `SELECT` statement and you need to use a subquery.
A **subquery** is just a query used as a data source in the `FROM` clause, instead of the usual table.

dplyr typically uses subqueries to work around limitations of SQL.
For example, expressions in the `SELECT` clause can't refer to columns that were just created.
That means that the following (silly) dplyr pipeline needs to happen in two steps: the first (inner) query computes `year1` and then the second (outer) query can compute `year2`:

```{r}
flights |> 
  mutate(
    year1 = year + 1,
    year2 = year1 + 1
  ) |> 
  show_query()
```

You'll also see this if you attempted to `filter()` a variable that you just created.
Remember, even though `WHERE` is written after `SELECT`, it's evaluated before it, so we need a subquery for this similarly simple case:

```{r}
flights |> 
  mutate(year1 = year + 1) |> 
  filter(year1 == 2014) |> 
  show_query()
```

Sometimes dbplyr will create a subquery where it's not needed because it doesn't yet know how to optimize that translation.
As dbplyr improves over time, these cases will get rarer and rarer but will probably never go away.

### Joins

If you're familiar with dplyr's joins, SQL joins are very similar.
Unfortunately, dbplyr's current translations are rather verbose[^import-databases-7].
Here's a simple example:

[^import-databases-7]: We're working on doing better in the future, so if you're lucky it'll be better by the time you're reading this 😃

```{r}
flights |> 
  left_join(planes, by = "tailnum") |> 
  show_query()
```

If you were writing this by hand, you'd probably write this as:

``` sql
SELECT flights.*, "type", manufacturer, model, engines, seats, speed
FROM flights
LEFT JOIN planes ON (flights.tailnum = planes.tailnum)
```

The main thing to notice here is the syntax: SQL joins use sub-clauses of the `FROM` clause to bring in additional tables, using `ON` to define how the tables are related.

dplyr's names for these functions are so closely connected to SQL that you can easily guess the equivalent SQL for `inner_join()`, `right_join()`, and `full_join():`

``` sql
SELECT flights.*, "type", manufacturer, model, engines, seats, speed
FROM flights
INNER JOIN planes ON (flights.tailnum = planes.tailnum)

SELECT flights.*, "type", manufacturer, model, engines, seats, speed
FROM flights
RIGHT JOIN planes ON (flights.tailnum = planes.tailnum)

SELECT flights.*, "type", manufacturer, model, engines, seats, speed
FROM flights
FULL JOIN planes ON (flights.tailnum = planes.tailnum)
```

When you're working with data from a databases, you're likely to need many more joins that with data from other sources.
That's because database tables are often stored in a highly normalized form, where each "fact" is stored in a single place.
Typically, this involves of complex network of tables connected by primary and foreign keys.
If you hit this scenario, the [dm package](https://cynkra.github.io/dm/), by Tobias Schieferdecker, Kirill Müller, and Darko Bergant, can be a life saver.
It can automatically determine the connections between tables in a database using the constraints the DBAs often supply, automatically visualize them so you can see what's going on, and automatically generate the joins you need to connect one table to another.

### Other verbs

dbplyr also translates other verbs like `distinct()`, `slice_*()`, and `intersect()`, and a growing selection of tidyr functions like `pivot_longer()` and `pivot_wider()`.
The easiest way to see the full set of what's currently available is to visit the dbplyr website: <https://dbplyr.tidyverse.org/reference/>.

## Function translations {#sec-sql-expressions}

So far we've focused on the big picture of how dplyr verbs are translated in to `SELECT` clauses.
Now we're going to zoom in a little and talk about the translation of the R functions that work with individual columns, e.g. what happens when you use `mean(x)` in a `summarize()`?
You can generally trust dbplyr's translations, but again it's a good way to learn a bit more about SQL.

To help you see what's going on, I'm going to make a couple of little helper functions that run a `summarise()` or `mutate()` and show the generated SQL.
That'll make it a little easier to explore a few variations and see how summaries and transformations can differ.

```{r}
summarize_query <- function(df, ...) {
  df |> 
    summarise(...) |> 
    show_query()
}
mutate_query <- function(df, ...) {
  df |> 
    mutate(..., .keep = "none") |> 
    show_query()
}
```

Let's dive in with some summaries!
Some summary functions have a relatively simple translation, like `mean()` which becomes `avg()`.
Other summary functions like `median()` have a much longer translation.

```{r}
flights |> 
  group_by(year, month, day) |>  
  summarize_query(
    mean = mean(arr_delay, na.rm = TRUE),
    median = median(arr_delay, na.rm = TRUE)
  )
```

The syntax looks a more complicated when you summary functions inside a `mutate()` because we now need a **window function**.
You can turn an ordinary aggregation function into a window function by adding `OVER` after it:

```{r}
flights |> 
  group_by(year, month, day) |>  
  mutate_query(
    mean = mean(arr_delay, na.rm = TRUE),
  )
```

You can see here that the grouping moves from a `GROUP BY` clause to the `PARTITION BY` argument to `OVER`.

Window functions encompass all functions that look forward or backwards, like `lead()` and `lag()`:

```{r}
flights |> 
  group_by(dest) |>  
  arrange(time_hour) |> 
  mutate_query(
    lead = lead(arr_delay),
    lag = lag(arr_delay)
  )
```

Here it's important to `arrange()` the data, because SQL tables have no intrinsic order.
In fact, if you don't use `arrange()` you might get the rows back in a different order every time!
Notice for window functions, the ordering information is used in two places.
That's because the `ORDER BY` clause of the main query isn't automatically inherited by `OVER`.

Moving back to regular transformation, another really important SQL function is `CASE WHEN`. It's used for `if_else()` and it also inspired dplyr's `case_when()` function.
Here's a couple of simple examples:

```{r}
flights |> 
  mutate_query(
    description = if_else(arr_deay > 0, "delayed", "on-time")
  )
flights |> 
  mutate_query(
    description = 
      case_when(
        arr_delay < -5 ~ "early", 
        arr_delay < 5 ~ "on-time",
        arr_delay >= 5 ~ "late"
      )
  )
```

`CASE WHEN` is also used for some other functions that don't have a direct translation from R to SQL.
A good example of this is `cut()`:

```{r}
flights |> 
  mutate_query(
    description =  cut(
      arr_delay, 
      breaks = c(-Inf, -5, 5, Inf), 
      labels = c("early", "on-time", "late")
    )
  )
```

dbplyr also translates common string and date-time manipulation functions, which you can learn about in `vignette("translation-function", package = "dbplyr")`.
dbplyr's translation are certainly not perfect, and there are many R functions that aren't translated yet, but dbplyr does a surprisingly good job covering the functions that you'll use most of the time.