Respond to feedback from twitter
This commit is contained in:
parent
6408e00d93
commit
d411ae3780
|
@ -161,7 +161,7 @@ con |>
|
||||||
|
|
||||||
`dbReadTable()` returns a `data.frame` so I use `as_tibble()` to convert it into a tibble so that it prints nicely.
|
`dbReadTable()` returns a `data.frame` so I use `as_tibble()` to convert it into a tibble so that it prints nicely.
|
||||||
|
|
||||||
In real life, it's rare that you'll use `dbReadTable()` because the whole reason you're using a database is that there's too much data to fit in a data frame, and you want to use the database to bring back only a subset of the rows and columns.
|
In real life, it's rare that you'll use `dbReadTable()` because often database tables are too big to fit in memory, and you want bring back only a subset of the rows and columns.
|
||||||
|
|
||||||
### Run a query {#sec-dbGetQuery}
|
### Run a query {#sec-dbGetQuery}
|
||||||
|
|
||||||
|
@ -169,13 +169,12 @@ The way you'll usually retrieve data is with `dbGetQuery()`.
|
||||||
It takes a database connection and some SQL code and returns a data frame:
|
It takes a database connection and some SQL code and returns a data frame:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
con |>
|
sql <- "
|
||||||
dbGetQuery("
|
|
||||||
SELECT carat, cut, clarity, color, price
|
SELECT carat, cut, clarity, color, price
|
||||||
FROM diamonds
|
FROM diamonds
|
||||||
WHERE price > 15000
|
WHERE price > 15000
|
||||||
") |>
|
"
|
||||||
as_tibble()
|
as_tibble(dbGetQuery(con, sql))
|
||||||
```
|
```
|
||||||
|
|
||||||
Don't worry if you've never seen SQL before; you'll learn more about it shortly.
|
Don't worry if you've never seen SQL before; you'll learn more about it shortly.
|
||||||
|
@ -194,15 +193,32 @@ Now that you've learned the low-level basics for connecting to a database and ru
|
||||||
dbplyr is a dplyr **backend**, which means that you keep writing dplyr code but the backend executes it differently.
|
dbplyr is a dplyr **backend**, which means that you keep writing dplyr code but the backend executes it differently.
|
||||||
In this, dbplyr translates to SQL; other backends include [dtplyr](https://dtplyr.tidyverse.org) which translates to [data.table](https://r-datatable.com), and [multidplyr](https://multidplyr.tidyverse.org) which executes your code on multiple cores.
|
In this, dbplyr translates to SQL; other backends include [dtplyr](https://dtplyr.tidyverse.org) which translates to [data.table](https://r-datatable.com), and [multidplyr](https://multidplyr.tidyverse.org) which executes your code on multiple cores.
|
||||||
|
|
||||||
To use dbplyr, you must first use `tbl()` to create an object that represents a database table[^import-databases-4]:
|
To use dbplyr, you must first use `tbl()` to create an object that represents a database table:
|
||||||
|
|
||||||
[^import-databases-4]: If you want to mix SQL and dbplyr, you can also create a tbl from a SQL query with `tbl(con, sql("SELECT * FROM foo")).`
|
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
diamonds_db <- tbl(con, "diamonds")
|
diamonds_db <- tbl(con, "diamonds")
|
||||||
diamonds_db
|
diamonds_db
|
||||||
```
|
```
|
||||||
|
|
||||||
|
::: callout-note
|
||||||
|
There are two other common way to a database.
|
||||||
|
First, many corporate databases are very large so need some hierarchy to keep all the tables organised.
|
||||||
|
In that case you might need to supply a schema, or a catalog and a schema, in order to pick the table you're interested in:
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
#| eval: false
|
||||||
|
diamonds_db <- tbl(con, in_schema("sales", "diamonds"))
|
||||||
|
diamonds_db <- tbl(con, in_catalog("north_america", "sales", "diamonds"))
|
||||||
|
```
|
||||||
|
|
||||||
|
Other times you might want to use your own SQL query as a starting point:
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
#| eval: false
|
||||||
|
diamonds_db <- tbl(con, sql("SELECT * FROM diamonds"))
|
||||||
|
```
|
||||||
|
:::
|
||||||
|
|
||||||
This object is **lazy**; when you use dplyr verbs on it, dplyr doesn't do any work: it just records the sequence of operations that you want to perform and only performs them when needed.
|
This object is **lazy**; when you use dplyr verbs on it, dplyr doesn't do any work: it just records the sequence of operations that you want to perform and only performs them when needed.
|
||||||
For example, take the following pipeline:
|
For example, take the following pipeline:
|
||||||
|
|
||||||
|
@ -233,6 +249,9 @@ big_diamonds <- big_diamonds_db |>
|
||||||
big_diamonds
|
big_diamonds
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Typically, you'll use dbplyr to select the data you want from the database, performing basic filtering and aggregation using the translations described below.
|
||||||
|
Then, once you're ready to analyse the data with functions that are unique to R, you'll `collect()` the data to get an in-memory tibble, and continue your work with pure R code.
|
||||||
|
|
||||||
## SQL
|
## SQL
|
||||||
|
|
||||||
The rest of the chapter will teach you a little SQL through the lens of dbplyr.
|
The rest of the chapter will teach you a little SQL through the lens of dbplyr.
|
||||||
|
@ -260,14 +279,14 @@ Common statements include `CREATE` for defining new tables, `INSERT` for adding
|
||||||
We will on focus on `SELECT` statements, also called **queries**, because they are almost exclusively what you'll use as a data scientist.
|
We will on focus on `SELECT` statements, also called **queries**, because they are almost exclusively what you'll use as a data scientist.
|
||||||
|
|
||||||
A query is made up of **clauses**.
|
A query is made up of **clauses**.
|
||||||
There are five important clauses: `SELECT`, `FROM`, `WHERE`, `ORDER BY`, and `GROUP BY`. Every query must have the `SELECT`[^import-databases-5] and `FROM`[^import-databases-6] clauses and the simplest query is `SELECT * FROM table`, which selects all columns from the specified table
|
There are five important clauses: `SELECT`, `FROM`, `WHERE`, `ORDER BY`, and `GROUP BY`. Every query must have the `SELECT`[^import-databases-4] and `FROM`[^import-databases-5] clauses and the simplest query is `SELECT * FROM table`, which selects all columns from the specified table
|
||||||
. This is what dplyr generates for an adulterated table
|
. This is what dplyr generates for an adulterated table
|
||||||
:
|
:
|
||||||
|
|
||||||
[^import-databases-5]: Confusingly, depending on the context, `SELECT` is either a statement or a clause.
|
[^import-databases-4]: Confusingly, depending on the context, `SELECT` is either a statement or a clause.
|
||||||
To avoid this confusion, we'll generally use query instead of `SELECT` statement.
|
To avoid this confusion, we'll generally use query instead of `SELECT` statement.
|
||||||
|
|
||||||
[^import-databases-6]: Ok, technically, only the `SELECT` is required, since you can write queries like `SELECT 1+1` to perform basic calculations.
|
[^import-databases-5]: Ok, technically, only the `SELECT` is required, since you can write queries like `SELECT 1+1` to perform basic calculations.
|
||||||
But if you want to work with data (as you always do!) you'll also need a `FROM` clause.
|
But if you want to work with data (as you always do!) you'll also need a `FROM` clause.
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
|
@ -364,9 +383,9 @@ But only a handful of client packages, like duckdb, know what all the reserved w
|
||||||
|
|
||||||
### GROUP BY
|
### GROUP BY
|
||||||
|
|
||||||
`group_by()` is translated to the `GROUP BY`[^import-databases-7] clause and `summarise()` is translated to the `SELECT` clause:
|
`group_by()` is translated to the `GROUP BY`[^import-databases-6] clause and `summarise()` is translated to the `SELECT` clause:
|
||||||
|
|
||||||
[^import-databases-7]: This is no coincidence: the dplyr function name was inspired by the SQL clause.
|
[^import-databases-6]: This is no coincidence: the dplyr function name was inspired by the SQL clause.
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
diamonds_db |>
|
diamonds_db |>
|
||||||
|
@ -482,10 +501,10 @@ As dbplyr improves over time, these cases will get rarer but will probably never
|
||||||
### Joins
|
### Joins
|
||||||
|
|
||||||
If you're familiar with dplyr's joins, SQL joins are very similar.
|
If you're familiar with dplyr's joins, SQL joins are very similar.
|
||||||
Unfortunately, dbplyr's current translations are rather verbose[^import-databases-8].
|
Unfortunately, dbplyr's current translations are rather verbose[^import-databases-7].
|
||||||
Here's a simple example:
|
Here's a simple example:
|
||||||
|
|
||||||
[^import-databases-8]: We're working on doing better in the future, so if you're lucky it'll be better by the time you're reading this 😃
|
[^import-databases-7]: We're working on doing better in the future, so if you're lucky it'll be better by the time you're reading this 😃
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
flights |>
|
flights |>
|
||||||
|
|
Loading…
Reference in New Issue