Update databases.qmd (#1068)

Fixed a bunch of errors/typos. I am so glad the second version of the book provides such a nicely written chapter on database.
This commit is contained in:
Y. Yu 2022-08-16 12:37:13 -04:00 committed by GitHub
parent d080f3279c
commit 5ac3dac6bd
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 12 additions and 12 deletions

View File

@ -91,7 +91,7 @@ con <- DBI::dbConnect(
)
```
The precise details of the connection varies a lot from DBMS to DBMS so unfortunately we can't cover all the details here.
The precise details of the connection vary a lot from DBMS to DBMS so unfortunately we can't cover all the details here.
This means you'll need to do a little research on your own.
Typically you can ask the other data scientists in your team or talk to your DBA (**d**ata**b**ase **a**dministrator).
The initial setup will often take a little fiddling (and maybe some googling) to get right, but you'll generally only need to do it once.
@ -112,7 +112,7 @@ con <- DBI::dbConnect(duckdb::duckdb())
duckdb is a high-performance database that's designed very much for the needs of a data scientist.
We use it here because it's very to easy to get started with, but it's also capable of handling gigabytes of data with great speed.
If you want to use duckdb for a real data analysis project, you'll also need to supply the `dbdir` argument to make a persistent database and tell duckdb where to save it.
Assuming you're using a project (Chapter -@sec-workflow-scripts-projects)), it's reasonable to store it in the `duckdb` directory of the current project:
Assuming you're using a project (@sec-workflow-scripts-projects), it's reasonable to store it in the `duckdb` directory of the current project:
```{r}
#| eval: false
@ -122,7 +122,7 @@ con <- DBI::dbConnect(duckdb::duckdb(), dbdir = "duckdb")
### Load some data {#sec-load-data}
Since this is a new database, we need to start by adding some data.
Here we'll use add `mpg` and `diamonds` datasets from ggplot2 using `DBI::dbWriteTable()`.
Here we'll add `mpg` and `diamonds` datasets from ggplot2 using `DBI::dbWriteTable()`.
The simplest usage of `dbWriteTable()` needs three arguments: a database connection, the name of the table to create in the database, and a data frame of data.
```{r}
@ -131,11 +131,11 @@ dbWriteTable(con, "diamonds", ggplot2::diamonds)
```
If you're using duckdb in a real project, we highly recommend learning about `duckdb_read_csv()` and `duckdb_register_arrow()`.
These give you powerful and performant ways to quickly load data directly into duckdb, without having to first load it in to R.
These give you powerful and performant ways to quickly load data directly into duckdb, without having to first load it into R.
## DBI basics
Now that we've connected to a database with some data in it, lets perform some basic operations with DBI.
Now that we've connected to a database with some data in it, let's perform some basic operations with DBI.
### What's there?
@ -201,8 +201,8 @@ diamonds_db
```
::: callout-note
There are two other common way to a database.
First, many corporate databases are very large so need some hierarchy to keep all the tables organised.
There are two other common ways to interact with a database.
First, many corporate databases are very large so you need some hierarchy to keep all the tables organised.
In that case you might need to supply a schema, or a catalog and a schema, in order to pick the table you're interested in:
```{r}
@ -233,7 +233,7 @@ big_diamonds_db
You can tell this object represents a database query because it prints the DBMS name at the top, and while it tells you the number of columns, it typically doesn't know the number of rows.
This is because finding the total number of rows usually requires executing the complete query, something we're trying to avoid.
You can see the SQL the dbplyr generates by a dbplyr query by calling `show_query()`:
You can see the SQL code generated by the dbplyr function `show_query()`:
```{r}
big_diamonds_db |>
@ -259,7 +259,7 @@ It's a rather non-traditional introduction to SQL but we hope it will get you qu
Luckily, if you understand dplyr you're in a great place to quickly pick up SQL because so many of the concepts are the same.
We'll explore the relationship between dplyr and SQL using a couple of old friends from the nycflights13 package: `flights` and `planes`.
These dataset are easy to get into our learning database because dbplyr has a function designed for this exact scenario:
These datasets are easy to get into our learning database because dbplyr has a function designed for this exact scenario:
```{r}
dbplyr::copy_nycflights13(con)
@ -280,7 +280,7 @@ We will on focus on `SELECT` statements, also called **queries**, because they a
A query is made up of **clauses**.
There are five important clauses: `SELECT`, `FROM`, `WHERE`, `ORDER BY`, and `GROUP BY`. Every query must have the `SELECT`[^databases-4] and `FROM`[^databases-5] clauses and the simplest query is `SELECT * FROM table`, which selects all columns from the specified table
. This is what dplyr generates for an unadulterated table
. This is what dbplyr generates for an unadulterated table
:
[^databases-4]: Confusingly, depending on the context, `SELECT` is either a statement or a clause.
@ -350,7 +350,7 @@ planes |>
This example also shows you how SQL does renaming.
In SQL terminology renaming is called **aliasing** and is done with `AS`.
Note that unlike with `mutate()`, the old name is on the left and the new name is on the right.
Note that unlike `mutate()`, the old name is on the left and the new name is on the right.
::: callout-note
In the examples above note that `"year"` and `"type"` are wrapped in double quotes.
@ -578,7 +578,7 @@ So far we've focused on the big picture of how dplyr verbs are translated to the
Now we're going to zoom in a little and talk about the translation of the R functions that work with individual columns, e.g. what happens when you use `mean(x)` in a `summarize()`?
To help see what's going on, we'll use a couple of little helper functions that run a `summarise()` or `mutate()` and show the generated SQL.
That will make it a easier to explore a few variations and see how summaries and transformations can differ.
That will make it a little easier to explore a few variations and see how summaries and transformations can differ.
```{r}
summarize_query <- function(df, ...) {