Polishing first half

This commit is contained in:
Hadley Wickham 2022-05-24 10:12:41 -05:00
parent 262f4ba02f
commit a3651082c5
1 changed files with 129 additions and 110 deletions

View File

@ -10,117 +10,132 @@ status("drafting")
## Introduction
A huge amount of data lives in databases, and it's essential that as a data scientist you know how to access it.
It's sometimes possible to ask your database administrator (or DBA for short) to download a snapshot into a csv for you, but this is generally not desirable as the iteration speed is very slow.
Sometimes it's possible to get someone to download a snapshot into a .csv for you, but this is generally not desirable as the iteration speed is very slow.
You want to be able to reach into the database directly to get the data you need, when you need it.
That said, it's still a good idea to make friends with your local DBA because as your queries get more complicated they will be able to help you optimize them, either by adding new indices to the database or by helping your polish your SQL code.
Show you how to connect to a database using DBI, and how to an execute a SQL query.
You'll then learn about dbplyr, which automatically converts your dplyr code to SQL.
We'll use that to teach you a little about SQL.
You won't become a SQL master by the end of the chapter, but you'll be able to identify the important components of SQL queries, understand the basic structure of the clauses, and maybe even write a little of your own.
In this chapter, you'll first learn the basics of the DBI package: how to use it to connect to a database and how to retrieve data by executing an SQL query.
**SQL**, short for **s**tructured **q**uery **l**anguage, is the lingua franca of databases, and is an important language for you to learn as a data scientist.
However, we're not going to start with SQL, but instead we'll teach you dbplyr, which can convert your dplyr code to the equivalent SQL.
We'll use that as way to teach you some of the most important features of SQL.
You won't become a SQL master by the end of the chapter, but you will be able to identify the most important components and understand what they do.
Main focus will be working with data that already exists in a database, i.e. data that someone else has collected for you, as this represents the most common case.
But as we go along, we'll also point out a few tips and tricks for getting your own data into a database.
The main focus of this chapter, is working with data that already exists, data that someone else has collected in a database for you, as this represents the most common case.
But as we go along, we will point out a few tips and tricks for getting your own data into a database.
### Prerequisites
In this chapter, we'll add DBI and dbplyr into the mix.
DBI provides a low-level interface for connecting to databases and executing SQL.
dbplyr is a high-level interface that works with dplyr verbs to automatically generate SQL and then executes it using DBI.
```{r}
#| label: setup
#| message: false
library(DBI)
library(dbplyr)
library(tidyverse)
```
## Database basics
At the simplest level a database is just a collection of data frames, called **tables** in database terminology.
At the simplest level, you can think about a database as a collection of data frames, called **tables** in database terminology.
Like a data.frame, a database table is a collection of named columns, where every value in the column is the same type.
At a very high level, there are three main differences between data frames and database tables:
There are three high level differences between data frames and database tables:
- Database tables are stored on disk and can be arbitrarily large.
Data frames are stored in memory, and hence can't be bigger than your memory.
- Databases tables often have indexes.
Much like an index of a book, this makes it possible to find the rows you're looking for without having to read every row.
- Database tables usually have indexes.
Much like an index of a book, a database index makes it possible to find rows of interest without having to read every row.
Data frames and tibbles don't have indexes, but data.tables do, which is one of the reasons that they're so fast.
- Historically, most databases were optimized for rapidly accepting new data, not analyzing existing data.
These databases are called row-oriented because the data is stored row-by-row, rather than column-by-column like R.
In recent times, there's been much development of column-oriented databases that make analyzing the existing data much faster.
- Most classical databases are optimized for rapidly collecting data, not analyzing existing data.
These databases are called **row-oriented** because the data is stored row-by-row, rather than column-by-column like R.
More recently, there's been much development of **column-oriented** databases that make analyzing the existing data much faster.
Databases are run by database management systems (**DBMS** for short), which are typically run on a powerful central server.
Popular open source DBMS's of this nature are MariaDB, PostgreSQL, and SQL, and commercial equivalents include SQL Server and Oracle.
Today, many DBMS's run in the cloud, like Snowflake, Amazon's RedShift, and Google's BigQuery.
## Connecting to a database
When you work with a "real" database, i.e. a database that's run by your organisation, it'll typically run on a powerful central server.
To connect to the database from R, you'll always use two packages:
To connect to the database from R, you'll use a pair of packages:
- DBI, short for database interface, provides a set of generic functions that perform connect to the database, upload data, run queries, and so on.
- A specific database backend does the job of translating the generics commands into the specifics for a given database.
- You'll always use DBI (**d**ata**b**ase **i**nterface), provides a set of generic functions that perform connect to the database, upload data, run queries, and so on.
Backends for common open source databases include RSQlite for SQLite, RPostgres for Postgres and RMariaDB for MariaDB/MySQL.
Many commercial databases use the odbc standard for communication so if you're using Oracle or SQL server you might use the odbc package combined with an odbc driver.
- You'll also use a package specific to the DBMS you're connecting to.
This package translates the generic commands into the specifics needed for a given DBMS.
For example, if you're connecting to Postgres you'll use the RPostgres package.
If you're connecting to MariaDB or MySQL, you'll use the RMariaDB package.
In most cases connecting to the database looks something like this:
If you can't find a specific package for your DBMS, you can usually use the generic odbc package instead.
This uses the widespread ODBC standard.
odbc requires a little more setup because you'll also need to install and configure an ODBC driver.
Concretely, to create a database connection using `DBI::dbConnect()`.
The first argument specifies the DBMS and the second and subsequent arguments describe where the database lives and any credentials that you'll need to access it.
The following code shows are few typical examples:
```{r}
#| eval: false
con <- DBI::dbConnect(RMariaDB::MariaDB(), username = "foo")
con <- DBI::dbConnect(RPostgres::Postgres(), hostname = "databases.mycompany.com", port = 1234)
con <- DBI::dbConnect(
RMariaDB::MariaDB(),
username = "foo"
)
con <- DBI::dbConnect(
RPostgres::Postgres(),
hostname = "databases.mycompany.com",
port = 1234
)
```
You'll get the details from your database administrator or IT department, or by asking other data scientists in your team.
It's not unusual for the initial setup to take a little fiddling to get right, but it's generally something you'll only need to do once.
See more at <https://db.rstudio.com/databases>.
There's a lot of variation from DBMs to DBMS so unfortunately we can't cover all the details here.
So to connect the database you care about, you'll need to do a little research.
Typically you can ask the other data scientists in your team or talk to your DBA (**d**ata**b**ase **a**dministrator).
The initial setup will often take a little fiddling (and maybe some googling) to get right, but you'll generally only need to do it once.
When you're done with the connection it's good practice to close it with `dbDisconnect()`.
This frees up resources on the database server so that others can use them.
When you're done with the connection it's good practice to close it with `dbDisconnect(con)`.
This frees up resources on the database server for us by other people.
### In this book
Setting up a database server would be a pain for this book, so here we'll use a database that allows you to work entirely locally: duckdb.
Fortunately, thanks to the magic of DBI, the only difference is how you'll connect to the database; everything else remains the same.
We'll use the default arguments, which create a temporary database that lives in memory.
That's the easiest for learning because it guarantees that you'll start from a clean slate every time you restart R:
Setting up a DBMS would be a pain for this book, so we'll instead use a self-contained DBMS that lives entirely in an R package: duckdb.
Thanks to the magic of DBI, the only difference between using duckdb and any other DBMS is how you'll connect to the database.
This makes it great to teach with because you can easily run this code as well as easily take what you learn and apply it elsewhere.
Connecting to duckdb is particularly simple because the defaults create a temporary database that is deleted when you quite R.
That's great for learning because it guarantees that you'll start from a clean slate every time you restart R:
```{r}
con <- DBI::dbConnect(duckdb::duckdb())
```
If you want to use duckdb for a real data analysis project, you'll also need to supply the `dbdir` argument to tell duckdb where to store the database files.
If you want to use duckdb for a real data analysis project[^import-databases-1], you'll also need to supply the `dbdir` argument to tell duckdb where to store the database files.
Assuming you're using a project (Chapter -@sec-workflow-scripts-projects)), it's reasonable to store it in the `duckdb` directory of the current project:
[^import-databases-1]: Which we highly recommend: it's a great database for data science.
```{r}
#| eval: false
con <- DBI::dbConnect(duckdb::duckdb(), dbdir = "duckdb")
```
duckdb is a high-performance database that's designed very much with the needs of the data scientist in mind, and the developers very much understand R and the types of real problems that R users face.
As you'll see in this chapter, it's really easy to get started with but it can also handle very large datasets.
### Load some data
### Load some data {#sec-load-data}
Since this is a temporary database, we need to start by adding some data.
This is something that you won't usually need do; in most cases you're connecting to a database specifically because it has the data you need.
I'll copy over the the `mpg` and `diamonds` datasets from ggplot2:
Here we'll use the `mpg` and `diamonds` datasets from ggplot2, and all data in the nycflights13 package.
```{r}
dbWriteTable(con, "mpg", ggplot2::mpg)
dbWriteTable(con, "diamonds", ggplot2::diamonds)
```
And all data in the nycflights13 package.
This is easy because dbplyr has a helper designed specifically for this case.
```{r}
dbplyr::copy_nycflights13(con)
```
We won't show them here, but if you're using duckdb in a real project, I highly recommend learning about `duckdb_read_csv()` and `duckdb_register_arrow()` which give you very powerful tools to quickly load data from disk directly into duckdb, without having to go via R.
<https://duckdb.org/2021/12/03/duck-arrow.html>
If you're using duckdb in a real project, I highly recommend learning about `duckdb_read_csv()` and `duckdb_register_arrow()`.
These give you powerful and performant ways to quickly load data directly into duckdb, without having to first load it in to R.
## Database basics
@ -128,6 +143,11 @@ Now that we've connected to a database with some data in it, lets perform some b
### What's there?
The most important database objects for data scientists are tables.
DBI provides two useful functions to either list all the tables in the database[^import-databases-2] or to check if a specific table already exists:
[^import-databases-2]: At least, all the tables that you have permission to see.
```{r}
dbListTables(con)
dbExistsTable(con, "foo")
@ -135,68 +155,71 @@ dbExistsTable(con, "foo")
### Extract some data
The simplest way to get data out of a database is with `dbReadTable()`:
Once you've determined a table exists, you can retrieve it with `dbReadTable()`:
```{r}
as_tibble(dbReadTable(con, "mpg"))
as_tibble(dbReadTable(con, "diamonds"))
con |>
dbReadTable("diamonds") |>
as_tibble()
```
Note that `dbReadTable()` returns a data frame.
Here I'm using `as_tibble()` to convert it to a tibble because I prefer the way it prints.
`dbReadTable()` returns a `data.frame` so I use `as_tibble()` to convert it into a tibble so that it prints nicely.
```{=html}
<!--
Notice something important with the diamonds dataset: the `cut`, `color`, and `clarity` columns were originally ordered factors, but now they're regular factors.
This particulary case isn't very important since ordered factors are barely different to regular factors, but it's good to know that the way that the database represents data can be slightly different to the way R represents data.
This particularly case isn't very important since ordered factors are barely different to regular factors, but it's good to know that the way that the database represents data can be slightly different to the way R represents data.
In this case, we're actually quite lucky because most databases don't support factors at all and would've converted the column to a string.
Again, not that important, because most of the time you'll be working with data that lives in a database, but good to be aware of if you're storing your own data into a database.
Generally you can expect numbers, strings, dates, and date-times to convert just fine, but other types may not.
-->
```
In real life, it's rare that you'll use `dbReadTable()` because the whole reason you're using a database is that there's too much data to fit in a data frame, and you want to use the database to bring back only a subset of the rows and columns.
But in real life, it's rare that you'll use `dbReadTable()` because the whole reason you're using a database is that there's too much data to fit in a data frame, and you want to make use of the database to bring back only a small snippet.
Instead, you'll want to write a SQL query.
### Run a query {#sec-dbGetQuery}
### Run a query
The way that the vast majority of communication happens with a database is via `dbGetQuery()` which takes a database connection and some SQL code.
SQL, short for structured query language, is the native language of databases.
Here's a little example.
Don't worry if you've never see SQL before, I'll explain what it means shortly.
But hopefully you can guess that it selects 5 columns of the diamonds datasets and all the rows where `price` is greater than 15,000.
The way you'll usually retrieve data is with `dbGetQuery()`.
It takes a database connection and some SQL code and returns a data frame:
```{r}
as_tibble(dbGetQuery(con, "
con |>
dbGetQuery("
SELECT carat, cut, clarity, color, price
FROM diamonds
WHERE price > 15000"
))
WHERE price > 15000
") |>
as_tibble()
```
Again I'm using I'm convert it to a tibble for ease of printing.
Don't worry if you've never seen SQL code before as you'll learn more about it shortly.
But if read it carefully, you might guess that it selects five columns of the diamonds dataset and the rows where `price` is greater than 15,000.
You'll need to be a little careful with `dbGetQuery()` since it can potentially return more data than you have memory.
If you're dealing with very large datasets it's possible to deal with a "page" of data at a time.
In this case, you'll use `dbSendQuery()` to get a "result set" which you can page through by calling `dbFetch()` until `dbHasCompleted()` returns `TRUE`.
We won't discuss it further here, but if you're dealing with very large datasets it's possible to deal with a "page" of data at a time by using `dbSendQuery()` to get a "result set" which you can page through by calling `dbFetch()` until `dbHasCompleted()` returns `TRUE`.
There are lots of other functions in DBI that you might find useful if managing your own data, but we're going to skip past them in the interests of staying focussed on working with data that others have collected.
### Other functions
## dbplyr and SQL
There are lots of other functions in DBI that you might find useful if you're managing your own data (like `dbWriteTable()` which we used in @sec-load-data), but we're going to skip past them in the interests of staying focused on working with data that already lives in a database.
Rather than writing your own SQL, this chapter will focus on generating SQL using dbplyr.
dbplyr is a backend for dplyr that instead of operating on data frames works with database tables by translating your R code in to SQL.
## dbplyr basics
You start by creating a `tbl()`: this creates something that looks like a tibble, but is really a reference to a table in a database[^import-databases-1]:
Now that you've learned the low-level basics for connecting to a database and running a query, we're going to switch it up a bit and learn a bit about dbplyr.
dbplyr is a dplyr **backend**, which means that you write the dplyr code that you're already familiar with and dbplyr translates it to run in a different way, in this case to SQL.
[^import-databases-1]: If you want to mix SQL and dbplyr, you can also create a tbl from a SQL query with `tbl(con, SQL("SELECT * FROM foo")).`
To use dbplyr you start start by creating a `tbl()`: this creates something that looks like a tibble, but is really a reference to a table in a database[^import-databases-3]:
[^import-databases-3]: If you want to mix SQL and dbplyr, you can also create a tbl from a SQL query with `tbl(con, SQL("SELECT * FROM foo")).`
```{r}
diamonds_db <- tbl(con, "diamonds")
diamonds_db
```
You can tell it's a database query because it prints the database name at the top, and typically won't be able to tell you the total number of rows.
This is because finding the total number of rows often requires computing the entire query, which is an expensive operation.
You can tell it's a database query because it prints the database name at the top, and typically it won't be able to tell you the total number of rows.
This is because finding the total number of rows is often an expensive computation for a database.
You can see the SQL generated by a dbplyr query by called `show_query()`.
We can create the SQL above with the following dplyr pipeline:
We can create the SQL from @sec-dbGetQuery with the following dplyr code:
```{r}
big_diamonds_db <- diamonds_db |>
@ -205,7 +228,7 @@ big_diamonds_db <- diamonds_db |>
big_diamonds_db
```
This captures the transformations you want to perform on the data but doesn't actually perform them yet.
`big_diamonds_db` captures the transformations we want to perform on the data but doesn't actually perform them.
Instead, it translates your dplyr code into SQL, which you can see with `show_query()`:
```{r}
@ -213,15 +236,6 @@ big_diamonds_db |>
show_query()
```
This SQL is a little different to what you might write by hand: dbplyr quotes every variable name and may include parentheses when they're not absolutely needed.
If you were to write this by hand, you'd probably do:
``` sql
SELECT carat, cut, color, clarity, price
FROM diamonds
WHERE price > 15000
```
To get the data back into R, we call `collect()`.
Behind the scenes, this generates the SQL, calls `dbGetQuery()`, and turns the result back into a tibble:
@ -231,6 +245,17 @@ big_diamonds <- big_diamonds_db |>
big_diamonds
```
## SQL
This SQL is a little different to what you might write by hand: dbplyr quotes every variable name and may include parentheses when they're not absolutely needed.
If you were to write this by hand, you'd probably do:
``` sql
SELECT carat, cut, color, clarity, price
FROM diamonds
WHERE price > 15000
```
### SQL basics
The basic unit of composition in SQL is not a function, but a **statement**.
@ -243,15 +268,20 @@ In fact, as a data scientist in most cases you won't even be able to run these s
This ensures that there's no way for you to accidentally mess things up.
A `SELECT` statement is often called a query, and a query is made up of clauses.
Every query must have two clauses `SELECT` and `FROM`[^import-databases-2].
Every query must have two clauses `SELECT` and `FROM`[^import-databases-4].
The simplest query is something like `SELECT * FROM tablename` which will select all columns from `tablename`. Other optional clauses allow you
[^import-databases-2]: Ok, technically, only the `SELECT` is required, since you can write queries like `SELECT 1+1` to perform basic calculation.
[^import-databases-4]: Ok, technically, only the `SELECT` is required, since you can write queries like `SELECT 1+1` to perform basic calculation.
But if you want to work with data (as you always do!) you'll also need a `FROM` clause.
The following sections work through the most important optional clauses.
Unlike in R, SQL clauses must come in a specific order: `SELECT`, `FROM`, `WHERE`, `GROUP BY`, `ORDER BY`.
```{r}
flights <- tbl(con, "flights")
planes <- tbl(con, "planes")
```
### SELECT and FROM
The two most important clauses are `FROM`, which determines the source table or tables, and `SELECT` which determines which columns are in the output.
@ -369,18 +399,7 @@ GROUP BY "cut"
HAVING "n" > 10.0
```
## Joins
dbplyr also comes with a helper function that will load nycflights13 into a database.
We'll use that to preload some related tables.
We can use for joins:
Now we can connect to those tables:
```{r}
flights <- tbl(con, "flights")
planes <- tbl(con, "planes")
```
### Joins
```{r}
flights |> inner_join(planes, by = "tailnum") |> show_query()
@ -416,6 +435,8 @@ It's useful if you don't have permission to create temporary tables, and is fast
## SQL expressions {#sec-sql-expressions}
https://dbplyr.tidyverse.org/articles/translation-function.html
Now that you understand the big picture of a SQL query and the equivalence between the SELECT clauses and dplyr verbs, it's time to look more at the details of the conversion of the individual expressions, i.e. what happens when you use `mean(x)` in a `summarize()`?
```{r}
@ -429,8 +450,6 @@ dbplyr::translate_sql(a + 1)
dbplyr::translate_sql(1 + 2 * 3 / 4 ^ 5)
```
<!-- -->
- In R strings are surrounded by `"` or `'` and variable names (if needed) use `` ` ``. In SQL, strings only use `'` and most databases use `"` for variable names.
```{r}
@ -460,7 +479,7 @@ dbplyr::translate_sql(a + 1)
dbplyr also translates common string and date-time manipulation functions.
## SQL dialects
### SQL dialects
Note that every database uses a slightly different dialect of SQL.
For the vast majority of simple examples in this chapter, you won't see any differences.