Working on second half

This commit is contained in:
Hadley Wickham 2022-05-25 17:03:37 -05:00
parent cd16ec24fc
commit a44a1a0551
1 changed files with 142 additions and 98 deletions

View File

@ -126,12 +126,11 @@ As you'll see in this chapter, it's really easy to get started with but it can a
### Load some data {#sec-load-data}
Since this is a temporary database, we need to start by adding some data.
Here we'll use the `mpg` and `diamonds` datasets from ggplot2, and all data in the nycflights13 package.
Here we'll use the `mpg` and `diamonds` datasets from ggplot2.
```{r}
dbWriteTable(con, "mpg", ggplot2::mpg)
dbWriteTable(con, "diamonds", ggplot2::diamonds)
dbplyr::copy_nycflights13(con)
```
If you're using duckdb in a real project, I highly recommend learning about `duckdb_read_csv()` and `duckdb_register_arrow()`.
@ -206,7 +205,7 @@ There are lots of other functions in DBI that you might find useful if you're ma
Now that you've learned the low-level basics for connecting to a database and running a query, we're going to switch it up a bit and learn a bit about dbplyr.
dbplyr is a dplyr **backend**, which means that you write the dplyr code that you're already familiar with and dbplyr translates it to run in a different way, in this case to SQL.
To use dbplyr you start start by creating a `tbl()`: this creates something that looks like a tibble, but is really a reference to a table in a database[^import-databases-3]:
To use dbplyr, you must first use `tbl()` to create an object that represents a database table[^import-databases-3]:
[^import-databases-3]: If you want to mix SQL and dbplyr, you can also create a tbl from a SQL query with `tbl(con, SQL("SELECT * FROM foo")).`
@ -215,29 +214,29 @@ diamonds_db <- tbl(con, "diamonds")
diamonds_db
```
You can tell it's a database query because it prints the database name at the top, and typically it won't be able to tell you the total number of rows.
This is because finding the total number of rows is often an expensive computation for a database.
You can see the SQL generated by a dbplyr query by called `show_query()`.
We can create the SQL from @sec-dbGetQuery with the following dplyr code:
This object is **lazy**; when you use dplyr verbs on it, dplyr doesn't do any work: it just records the sequence of operations that you want to perform and only performs them when needed.
For example, take the following pipeline:
```{r}
big_diamonds_db <- diamonds_db |>
filter(price > 15000) |>
select(carat:clarity, price)
big_diamonds_db
```
`big_diamonds_db` captures the transformations we want to perform on the data but doesn't actually perform them.
Instead, it translates your dplyr code into SQL, which you can see with `show_query()`:
You can tell this object represents a database query because it prints the DBMS name at the top, and while it tells you the number of columns, it typically it won't tell you the number of rows.
This is because finding the total number of rows usually requires executing the complete query, something we're trying to avoid.
You can see the SQL the dbplyr generates by a dbplyr query by calling `show_query()`:
```{r}
big_diamonds_db |>
show_query()
```
To get the data back into R, we call `collect()`.
Behind the scenes, this generates the SQL, calls `dbGetQuery()`, and turns the result back into a tibble:
To get all the data back into R, you call `collect()`.
Behind the scenes, this generates the SQL, calls `dbGetQuery()` to get the data, then turns the result into a tibble:
```{r}
big_diamonds <- big_diamonds_db |>
@ -247,105 +246,167 @@ big_diamonds
## SQL
This SQL is a little different to what you might write by hand: dbplyr quotes every variable name and may include parentheses when they're not absolutely needed.
If you were to write this by hand, you'd probably do:
The rest of the chapter will teach you a little about SQL through the lens of dbplyr.
It's a rather non-traditional introduction to SQL but I hope it will get you quickly up to speed with the basics.
It will hopefully help you understand the parallels between SQL and dplyr but it's not going to give you much practice writing SQL.
For that, I'd recommend [*SQL for Data Scientists*](https://sqlfordatascientists.com)by Renée M. P. Teate.
It's an introduction to SQL designed specifically for the needs of data scientists, and includes examples of the sort of highly interconnected data you're likely to encounter in real organisations.
``` sql
SELECT carat, cut, color, clarity, price
FROM diamonds
WHERE price > 15000
```
### SQL basics
The basic unit of composition in SQL is not a function, but a **statement**.
Common statements include `INSERT` for adding new data, `CREATE` for making new tables, and `UPDATE` for modifying data, and `SELECT` for retrieving data.
Unlike R SQL is (mostly) case insensitive, but by convention, to make them stand out the clauses are usually capitalized like `SELECT`, `FROM`, and `WHERE` above.
We're going to focus on `SELECT` statements because they are almost exclusively what you'll use as a data scientist.
The other statements will be handled by someone else; in the case that you need to update your own database, you can solve most problems with `dbWriteTable()` and/or `dbInsertTable()`.
In fact, as a data scientist in most cases you won't even be able to run these statements because you only have read only access to the database.
This ensures that there's no way for you to accidentally mess things up.
A `SELECT` statement is often called a query, and a query is made up of clauses.
Every query must have two clauses `SELECT` and `FROM`[^import-databases-4].
The simplest query is something like `SELECT * FROM tablename` which will select all columns from `tablename`. Other optional clauses allow you
[^import-databases-4]: Ok, technically, only the `SELECT` is required, since you can write queries like `SELECT 1+1` to perform basic calculation.
But if you want to work with data (as you always do!) you'll also need a `FROM` clause.
The following sections work through the most important optional clauses.
Unlike in R, SQL clauses must come in a specific order: `SELECT`, `FROM`, `WHERE`, `GROUP BY`, `ORDER BY`.
We'll explore the relationship between dplyr and SQL using a couple of old friends from the nycflights dataset, the `flights` and `planes` tibbles.
These are easy to get into our learning database because dbplyr has a function designed for this exact scenario.
```{r}
dbplyr::copy_nycflights13(con)
flights <- tbl(con, "flights")
planes <- tbl(con, "planes")
```
### SELECT and FROM
### SQL basics
The two most important clauses are `FROM`, which determines the source table or tables, and `SELECT` which determines which columns are in the output.
There's no real equivalent to `FROM` in dbplyr; it's just the name of the data frame.
Instead of functions, like R, SQL has **statements**.
Common statements include `CREATE` for defining new tables, `INSERT` for adding data, and `SELECT` for retrieving data.
We're going to focus on `SELECT` statements, aka **queries**, because they are almost exclusively what you'll use as a data scientist.
Your job is usually to analyse existing data, and in most cases you won't even have permission to modify the data.
`SELECT` is the workhorse of SQL queries, and is used for `select()`, `mutate()`, `rename()`, and `relocate()`.
In the next section, you'll see that `SELECT` is *also* used for `summarize()` when paired with `GROUP BY`.
A query is made up of **clauses**.
Every query must have two clauses `SELECT` and `FROM`[^import-databases-4].
The simplest query is uses `SELECT * FROM tablename` to select columns from the specified table.
This is what dplyr generates for an adulterated table:
`select()`, `rename()`, and `relocate()` have very direct translations to `SELECT` --- they just change the number and order of the variables, renaming where necessary with `AS`.
Unlike R, the old name is on the left and the new name is on the right.
[^import-databases-4]: Ok, technically, only the `SELECT` is required, since you can write queries like `SELECT 1+1` to perform basic calculation.
But if you want to work with data (as you always do!) you'll also need a `FROM` clause.
```{r}
diamonds_db |> select(cut:carat) |> show_query()
diamonds_db |> rename(colour = color) |> show_query()
diamonds_db |> relocate(x:z) |> show_query()
flights |> show_query()
planes |> show_query()
```
There are three other important clauses: `WHERE`, `ORDER BY`, and `GROUP BY`. `WHERE` and `ORDER BY` control which rows are included in the result and how they are ordered:
```{r}
flights |>
filter(dest == "IAH") |>
arrange(dep_delay) |>
show_query()
```
While `GROUP BY` works just like `dplyr::group_by()` causing aggregation to happen by group:
```{r}
flights |>
group_by(dest) |>
summarise(dep_delay = mean(dep_delay, na.rm = TRUE)) |>
show_query()
```
There are two important differences between dplyr verbs and SELECT clauses:
- SQL, unlike R, is **case** **insensitive** so you can write `select`, `SELECT`, or even `SeLeCt`. In this book we'll stick with the common convention of writing SQL keywords in uppercase to distinguish them from table or variables names.
- In SQL, order matters. Unlike dplyr, where you can call the verbs in whatever order makes the most sense to you, SQL clauses must come in a specific order: `SELECT`, `FROM`, `WHERE`, `GROUP BY`, `ORDER BY`. Confusingly, this order doesn't match how they are actually evaluated which is `FROM`, `WHERE`, `GROUP BY`, `SELECT`, `ORDER BY`.
The following sections will explore each clause in more detail.
### SELECT
`SELECT` is the workhorse of SQL queries, and is equivalent to `select()`, `mutate()`, `rename()`, `relocate()`, and, as you'll learn in the next section, `summarize()`.
`select()`, `rename()`, and `relocate()` have very direct translations to `SELECT` as they affect where a column appears (if at all) along with its name:
```{r}
flights |>
select(year:day, starts_with("dep")) |>
show_query()
flights |>
rename(tail_num = tailnum) |>
show_query()
flights |>
relocate(hour:minute, .after = day) |>
show_query()
```
This example also shows you have SQL does renaming.
In SQL terminology is called **aliasing** and is done with `AS`; just note that unlike with `mutate()`, the old name is on the left and the new name is on the right.
The translations for `mutate()` are similarly straightforward.
We'll come back to the translation of individual components in @sec-sql-expressions.
```{r}
diamonds_db |> mutate(price_per_carat = price / carat) |> show_query()
```
### WHERE
`filter()` is translated to `WHERE`:
```{r}
diamonds_db |>
filter(carat > 1, colour == "J") |>
mutate(price_per_carat = price / carat) |>
show_query()
```
### GROUP BY
`SELECT` is also used for summaries when pared with `GROUP BY`:
When pared with `group_by()`, `summarise()` is also translated to `SELECT`:
```{r}
diamonds_db |>
group_by(cut) |>
summarise(
n = n(),
avg_price = mean(price)
avg_price = mean(price, na.rm = TRUE)
) |>
show_query()
```
Note the warning: unlike R, missing values (called `NULL` instead of `NA` in SQL) are not infectious in summary statistics.
We'll come back to this challenge a bit later in Section \@ref(sql-expressions).
We'll come back to the translations of `n()` and `mean()` in @sec-sql-expressions.
But it's no coincidence that `group_by()` is translated to `GROUP BY`: the SQL clause inspired the R function name.
###
### WHERE
`filter()` is translated to `WHERE`.
`|` becomes `OR` and `&` becomes `AND:`
```{r}
flights |>
filter(dest == "IAH" | dest == "HOU") |>
show_query()
flights |>
filter(arr_delay > 0 & arr_delay < 20) |>
show_query()
```
Note that SQL uses `=` for comparison, not `==`.
This is super annoying if you're switching between writing R code and SQL!
Another useful SQL function is `IN`, which is very close to R's `%in%`:
```{r}
flights |>
filter(dest %in% c("IAH", "HOU")) |>
show_query()
```
SQL doesn't have `NA`s, but instead has `NULL`s.
They behave very similarly to `NA`s, including their "infectious" properties.
```{r}
flights |>
filter(!is.na(dep_delay)) |>
show_query()
```
This SQL illustrates one of the drawbacks of dbplyr: it doesn't always generate the simplest SQL.
In this case, the parentheses are redundant and you could use the special form `IS NOT NULL` yielding:
``` sql
WHERE "dep_delay" IS NOT NULL
```
### ORDER BY
`arrange()` is translated to `ORDER BY`:
```{r}
diamonds_db |>
arrange(carat, desc(price)) |>
flights |>
arrange(year, month, day, desc(dep_delay)) |>
show_query()
```
And `desc()` becomes `DESC` --- and now you know the inspiration for the function name 😄.
Note that `desc()` becomes `DESC`; this is another R function whose named was directly inspired by SQL.
### Subqueries
@ -376,45 +437,22 @@ diamonds_db |>
```
Sometimes dbplyr uses a subquery where strictly speaking it's not necessary.
For example, take this pipeline that filters on a summary value:
```{r}
diamonds_db |>
group_by(cut) |>
summarise(
n = n(),
avg_price = mean(price)
) |>
filter(n > 10) |>
show_query()
```
In this case it's possible to use the special `HAVING` clause.
This is works the same way as `WHERE` except that it's applied *after* the aggregates have been computed, not before.
``` sql
SELECT "cut", COUNT(*) AS "n", AVG("price") AS "avg_price"
FROM "diamonds"
GROUP BY "cut"
HAVING "n" > 10.0
```
For example, take this pipeline that filters on a summary value.
### Joins
SQL joins are straightforward, but dbplyr's current translation requires spelling out
```{r}
flights |> inner_join(planes, by = "tailnum") |> show_query()
flights |> left_join(planes, by = "tailnum") |> show_query()
flights |> full_join(planes, by = "tailnum") |> show_query()
flights |>
left_join(planes, by = "tailnum") |>
show_query()
```
### Semi and anti-joins
SQL's syntax for semi- and anti-joins are a bit arcane.
I don't remember these and just google if I ever need the syntax outside of SQL.
Instead we'll create some dummy data:
```{r}
flights |> semi_join(planes, by = "tailnum") |> show_query()
flights |> anti_join(planes, by = "tailnum") |> show_query()
```
### Temporary data
@ -433,6 +471,12 @@ Most database will allow you to create temporary tables, even if you don't other
Rather than copying the data to the database, it builds SQL that generates the data inline.
It's useful if you don't have permission to create temporary tables, and is faster than `copy_to()` for small datasets.
### Other statements
in the case that you need to update your own database, you can solve most problems with `dbWriteTable()` and/or `dbInsertTable()`.
In fact, as a data scientist in most cases you won't even be able to run these statements because you only have read only access to the database.
This ensures that there's no way for you to accidentally mess things up.
## SQL expressions {#sec-sql-expressions}
https://dbplyr.tidyverse.org/articles/translation-function.html