r4ds/oreilly/databases.html

747 lines
52 KiB
HTML
Raw Normal View History

<section data-type="chapter" id="chp-databases">
2022-11-19 01:55:22 +08:00
<h1><span id="sec-import-databases" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Databases</span></span></h1>
<section id="introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>A huge amount of data lives in databases, so its essential that you know how to access it. Sometimes you can ask someone to download a snapshot into a <code>.csv</code> for you, but this gets painful quickly: every time you need to make a change youll have to communicate with another human. You want to be able to reach into the database directly to get the data you need, when you need it.</p>
<p>In this chapter, youll first learn the basics of the DBI package: how to use it to connect to a database and then retrieve data with a SQL<span data-type="footnote">SQL is either pronounced “s”-“q”-“l” or “sequel”.</span> query. <strong>SQL</strong>, short for <strong>s</strong>tructured <strong>q</strong>uery <strong>l</strong>anguage, is the lingua franca of databases, and is an important language for all data scientists to learn. That said, were not going to start with SQL, but instead well teach you dbplyr, which can translate your dplyr code to the SQL. Well use that as way to teach you some of the most important features of SQL. You wont become a SQL master by the end of the chapter, but you will be able to identify the most important components and understand what they do.</p>
<section id="prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>In this chapter, well introduce DBI and dbplyr. DBI is a low-level interface that connects to databases and executes SQL; dbplyr is a high-level interface that translates your dplyr code to SQL queries then executes them with DBI.</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">library(DBI)
library(dbplyr)
library(tidyverse)</pre>
</div>
</section>
</section>
<section id="database-basics" data-type="sect1">
<h1>
Database basics</h1>
<p>At the simplest level, you can think about a database as a collection of data frames, called <strong>tables</strong> in database terminology. Like a data.frame, a database table is a collection of named columns, where every value in the column is the same type. There are three high level differences between data frames and database tables:</p>
<ul><li><p>Database tables are stored on disk and can be arbitrarily large. Data frames are stored in memory, and are fundamentally limited (although that limit is still plenty large for many problems).</p></li>
<li><p>Database tables almost always have indexes. Much like the index of a book, a database index makes it possible to quickly find rows of interest without having to look at every single row. Data frames and tibbles dont have indexes, but data.tables do, which is one of the reasons that theyre so fast.</p></li>
<li><p>Most classical databases are optimized for rapidly collecting data, not analyzing existing data. These databases are called <strong>row-oriented</strong> because the data is stored row-by-row, rather than column-by-column like R. More recently, theres been much development of <strong>column-oriented</strong> databases that make analyzing the existing data much faster.</p></li>
</ul><p>Databases are run by database management systems (<strong>DBMS</strong>s for short), which come in three basic forms:</p>
<ul><li>
<strong>Client-server</strong> DBMSs run on a powerful central server, which you connect from your computer (the client). They are great for sharing data with multiple people in an organisation. Popular client-server DBMSs include PostgreSQL, MariaDB, SQL Server, and Oracle.</li>
<li>
<strong>Cloud</strong> DBMSs, like Snowflake, Amazons RedShift, and Googles BigQuery, are similar to client server DBMSs, but they run in the cloud. This means that they can easily handle extremely large datasets and can automatically provide more compute resources as needed.</li>
<li>
<strong>In-process</strong> DBMSs, like SQLite or duckdb, run entirely on your computer. Theyre great for working with large datasets where youre the primary user.</li>
</ul></section>
<section id="connecting-to-a-database" data-type="sect1">
<h1>
Connecting to a database</h1>
<p>To connect to the database from R, youll use a pair of packages:</p>
<ul><li><p>Youll always use DBI (<strong>d</strong>ata<strong>b</strong>ase <strong>i</strong>nterface) because it provides a set of generic functions that connect to the database, upload data, run SQL queries, etc.</p></li>
<li><p>Youll also use a package tailored for the DBMS youre connecting to. This package translates the generic DBI commands into the specifics needed for a given DBMS. Theres usually one package for each DMBS, e.g. RPostgres for Postgres and RMariaDB for MySQL.</p></li>
</ul><p>If you cant find a specific package for your DBMS, you can usually use the odbc package instead. This uses the ODBC protocol supported by many DBMS. odbc requires a little more setup because youll also need to install an ODBC driver and tell the odbc package where to find it.</p>
2022-11-19 00:30:32 +08:00
<p>Concretely, you create a database connection using <code><a href="https://dbi.r-dbi.org/reference/dbConnect.html">DBI::dbConnect()</a></code>. The first argument selects the DBMS<span data-type="footnote">Typically, this is the only function youll use from the client package, so we recommend using <code>::</code> to pull out that one function, rather than loading the complete package with <code><a href="https://rdrr.io/r/base/library.html">library()</a></code>.</span>, then the second and subsequent arguments describe how to connect to it (i.e. where it lives and the credentials that you need to access it). The following code shows a couple of typical examples:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">con &lt;- DBI::dbConnect(
RMariaDB::MariaDB(),
username = "foo"
)
con &lt;- DBI::dbConnect(
RPostgres::Postgres(),
hostname = "databases.mycompany.com",
port = 1234
)</pre>
</div>
<p>The precise details of the connection vary a lot from DBMS to DBMS so unfortunately we cant cover all the details here. This means youll need to do a little research on your own. Typically you can ask the other data scientists in your team or talk to your DBA (<strong>d</strong>ata<strong>b</strong>ase <strong>a</strong>dministrator). The initial setup will often take a little fiddling (and maybe some googling) to get right, but youll generally only need to do it once.</p>
<section id="in-this-book" data-type="sect2">
<h2>
In this book</h2>
<p>Setting up a client-server or cloud DBMS would be a pain for this book, so well instead use an in-process DBMS that lives entirely in an R package: duckdb. Thanks to the magic of DBI, the only difference between using duckdb and any other DBMS is how youll connect to the database. This makes it great to teach with because you can easily run this code as well as easily take what you learn and apply it elsewhere.</p>
<p>Connecting to duckdb is particularly simple because the defaults create a temporary database that is deleted when you quit R. Thats great for learning because it guarantees that youll start from a clean slate every time you restart R:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">con &lt;- DBI::dbConnect(duckdb::duckdb())</pre>
</div>
<p>duckdb is a high-performance database thats designed very much for the needs of a data scientist. We use it here because its very to easy to get started with, but its also capable of handling gigabytes of data with great speed. If you want to use duckdb for a real data analysis project, youll also need to supply the <code>dbdir</code> argument to make a persistent database and tell duckdb where to save it. Assuming youre using a project (<a href="#chp-workflow-scripts" data-type="xref">#chp-workflow-scripts</a>), its reasonable to store it in the <code>duckdb</code> directory of the current project:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">con &lt;- DBI::dbConnect(duckdb::duckdb(), dbdir = "duckdb")</pre>
</div>
</section>
<section id="sec-load-data" data-type="sect2">
<h2>
Load some data</h2>
2022-11-19 00:30:32 +08:00
<p>Since this is a new database, we need to start by adding some data. Here well add <code>mpg</code> and <code>diamonds</code> datasets from ggplot2 using <code><a href="https://dbi.r-dbi.org/reference/dbWriteTable.html">DBI::dbWriteTable()</a></code>. The simplest usage of <code><a href="https://dbi.r-dbi.org/reference/dbWriteTable.html">dbWriteTable()</a></code> needs three arguments: a database connection, the name of the table to create in the database, and a data frame of data.</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">dbWriteTable(con, "mpg", ggplot2::mpg)
dbWriteTable(con, "diamonds", ggplot2::diamonds)</pre>
</div>
<p>If youre using duckdb in a real project, we highly recommend learning about <code>duckdb_read_csv()</code> and <code>duckdb_register_arrow()</code>. These give you powerful and performant ways to quickly load data directly into duckdb, without having to first load it into R.</p>
<p>Well also show off a useful technique for loading multiple files into a database in <a href="#sec-save-database" data-type="xref">#sec-save-database</a>.</p>
</section>
</section>
<section id="dbi-basics" data-type="sect1">
<h1>
DBI basics</h1>
<p>Now that weve connected to a database with some data in it, lets perform some basic operations with DBI.</p>
<section id="whats-there" data-type="sect2">
<h2>
Whats there?</h2>
<p>The most important database objects for data scientists are tables. DBI provides two useful functions to either list all the tables in the database<span data-type="footnote">At least, all the tables that you have permission to see.</span> or to check if a specific table already exists:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">dbListTables(con)
#&gt; [1] "diamonds" "mpg"
dbExistsTable(con, "foo")
#&gt; [1] FALSE</pre>
</div>
</section>
<section id="extract-some-data" data-type="sect2">
<h2>
Extract some data</h2>
2022-11-19 00:30:32 +08:00
<p>Once youve determined a table exists, you can retrieve it with <code><a href="https://dbi.r-dbi.org/reference/dbReadTable.html">dbReadTable()</a></code>:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">con |&gt;
dbReadTable("diamonds") |&gt;
as_tibble()
#&gt; # A tibble: 53,940 × 10
#&gt; carat cut color clarity depth table price x y z
#&gt; &lt;dbl&gt; &lt;fct&gt; &lt;fct&gt; &lt;fct&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
#&gt; 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
#&gt; 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
#&gt; 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
#&gt; 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
#&gt; 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
#&gt; # … with 53,934 more rows</pre>
</div>
2022-11-19 00:30:32 +08:00
<p><code><a href="https://dbi.r-dbi.org/reference/dbReadTable.html">dbReadTable()</a></code> returns a <code>data.frame</code> so we use <code><a href="https://tibble.tidyverse.org/reference/as_tibble.html">as_tibble()</a></code> to convert it into a tibble so that it prints nicely.</p>
<p>In real life, its rare that youll use <code><a href="https://dbi.r-dbi.org/reference/dbReadTable.html">dbReadTable()</a></code> because often database tables are too big to fit in memory, and you want bring back only a subset of the rows and columns.</p>
</section>
<section id="sec-dbGetQuery" data-type="sect2">
<h2>
Run a query</h2>
2022-11-19 00:30:32 +08:00
<p>The way youll usually retrieve data is with <code><a href="https://dbi.r-dbi.org/reference/dbGetQuery.html">dbGetQuery()</a></code>. It takes a database connection and some SQL code and returns a data frame:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">sql &lt;- "
SELECT carat, cut, clarity, color, price
FROM diamonds
WHERE price &gt; 15000
"
as_tibble(dbGetQuery(con, sql))
#&gt; # A tibble: 1,655 × 5
#&gt; carat cut clarity color price
#&gt; &lt;dbl&gt; &lt;fct&gt; &lt;fct&gt; &lt;fct&gt; &lt;int&gt;
#&gt; 1 1.54 Premium VS2 E 15002
#&gt; 2 1.19 Ideal VVS1 F 15005
#&gt; 3 2.1 Premium SI1 I 15007
#&gt; 4 1.69 Ideal SI1 D 15011
#&gt; 5 1.5 Very Good VVS2 G 15013
#&gt; 6 1.73 Very Good VS1 G 15014
#&gt; # … with 1,649 more rows</pre>
</div>
<p>Dont worry if youve never seen SQL before; youll learn more about it shortly. But if you read it carefully, you might guess that it selects five columns of the diamonds dataset and all the rows where <code>price</code> is greater than 15,000.</p>
2022-11-19 00:30:32 +08:00
<p>Youll need to be a little careful with <code><a href="https://dbi.r-dbi.org/reference/dbGetQuery.html">dbGetQuery()</a></code> since it can potentially return more data than you have memory. We wont discuss it further here, but if youre dealing with very large datasets its possible to deal with a “page” of data at a time by using <code><a href="https://dbi.r-dbi.org/reference/dbSendQuery.html">dbSendQuery()</a></code> to get a “result set” which you can page through by calling <code><a href="https://dbi.r-dbi.org/reference/dbFetch.html">dbFetch()</a></code> until <code><a href="https://dbi.r-dbi.org/reference/dbHasCompleted.html">dbHasCompleted()</a></code> returns <code>TRUE</code>.</p>
</section>
<section id="other-functions" data-type="sect2">
<h2>
Other functions</h2>
2022-11-19 00:30:32 +08:00
<p>There are lots of other functions in DBI that you might find useful if youre managing your own data (like <code><a href="https://dbi.r-dbi.org/reference/dbWriteTable.html">dbWriteTable()</a></code> which we used in <a href="#sec-load-data" data-type="xref">#sec-load-data</a>), but were going to skip past them in the interest of staying focused on working with data that already lives in a database.</p>
</section>
</section>
<section id="dbplyr-basics" data-type="sect1">
<h1>
dbplyr basics</h1>
2022-11-19 00:30:32 +08:00
<p>Now that youve learned the low-level basics for connecting to a database and running a query, were going to switch it up a bit and learn a bit about dbplyr. dbplyr is a dplyr <strong>backend</strong>, which means that you keep writing dplyr code but the backend executes it differently. In this, dbplyr translates to SQL; other backends include <a href="https://dtplyr.tidyverse.org">dtplyr</a> which translates to <a href="https://r-datatable.com">data.table</a>, and <a href="https://multidplyr.tidyverse.org">multidplyr</a> which executes your code on multiple cores.</p>
<p>To use dbplyr, you must first use <code><a href="https://dplyr.tidyverse.org/reference/tbl.html">tbl()</a></code> to create an object that represents a database table:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">diamonds_db &lt;- tbl(con, "diamonds")
diamonds_db
#&gt; # Source: table&lt;diamonds&gt; [?? x 10]
2023-01-13 07:22:57 +08:00
#&gt; # Database: DuckDB 0.6.1 [root@Darwin 22.1.0:R 4.2.1/:memory:]
#&gt; carat cut color clarity depth table price x y z
#&gt; &lt;dbl&gt; &lt;fct&gt; &lt;fct&gt; &lt;fct&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
#&gt; 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
#&gt; 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
#&gt; 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
#&gt; 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
#&gt; 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
#&gt; # … with more rows</pre>
</div>
<div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>There are two other common ways to interact with a database. First, many corporate databases are very large so you need some hierarchy to keep all the tables organised. In that case you might need to supply a schema, or a catalog and a schema, in order to pick the table youre interested in:</p><div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">diamonds_db &lt;- tbl(con, in_schema("sales", "diamonds"))
diamonds_db &lt;- tbl(con, in_catalog("north_america", "sales", "diamonds"))</pre>
</div><p>Other times you might want to use your own SQL query as a starting point:</p><div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">diamonds_db &lt;- tbl(con, sql("SELECT * FROM diamonds"))</pre>
</div>
2022-11-19 00:30:32 +08:00
<p>Note that while SQL is a standard, it is extremely complex and no database follows it exactly. While the main components that well focus on in this book are very similar between DBMSs, there are many minor variations. Fortunately, dbplyr is designed to handle this problem and generates different translations for different databases. Its not perfect, but its continually improving, and if you hit a problem you can file an issue <a href="https://github.com/tidyverse/dbplyr/issues/">on GitHub</a> to help us do better.</p>
<p>In the examples above note that <code>"year"</code> and <code>"type"</code> are wrapped in double quotes. Thats because these are <strong>reserved words</strong> in duckdb, so dbplyr quotes them to avoid any potential confusion between column/table names and SQL operators.</p><p>When working with other databases youre likely to see every variable name quotes because only a handful of client packages, like duckdb, know what all the reserved words are, so they quote everything to be safe.</p><pre data-type="programlisting" data-code-language="sql">SELECT "tailnum", "type", "manufacturer", "model", "year"
FROM "planes"</pre><p>Some other database systems use backticks instead of quotes:</p><pre data-type="programlisting" data-code-language="sql">SELECT `tailnum`, `type`, `manufacturer`, `model`, `year`
FROM `planes`</pre></div>
<p>This object is <strong>lazy</strong>; when you use dplyr verbs on it, dplyr doesnt do any work: it just records the sequence of operations that you want to perform and only performs them when needed. For example, take the following pipeline:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">big_diamonds_db &lt;- diamonds_db |&gt;
filter(price &gt; 15000) |&gt;
select(carat:clarity, price)
big_diamonds_db
#&gt; # Source: SQL [?? x 5]
2023-01-13 07:22:57 +08:00
#&gt; # Database: DuckDB 0.6.1 [root@Darwin 22.1.0:R 4.2.1/:memory:]
#&gt; carat cut color clarity price
#&gt; &lt;dbl&gt; &lt;fct&gt; &lt;fct&gt; &lt;fct&gt; &lt;int&gt;
#&gt; 1 1.54 Premium E VS2 15002
#&gt; 2 1.19 Ideal F VVS1 15005
#&gt; 3 2.1 Premium I SI1 15007
#&gt; 4 1.69 Ideal D SI1 15011
#&gt; 5 1.5 Very Good G VVS2 15013
#&gt; 6 1.73 Very Good G VS1 15014
#&gt; # … with more rows</pre>
</div>
<p>You can tell this object represents a database query because it prints the DBMS name at the top, and while it tells you the number of columns, it typically doesnt know the number of rows. This is because finding the total number of rows usually requires executing the complete query, something were trying to avoid.</p>
2022-11-19 00:30:32 +08:00
<p>You can see the SQL code generated by the dbplyr function <code><a href="https://dplyr.tidyverse.org/reference/explain.html">show_query()</a></code>:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">big_diamonds_db |&gt;
show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT carat, cut, color, clarity, price
#&gt; FROM diamonds
#&gt; WHERE (price &gt; 15000.0)</pre>
</div>
2022-11-19 00:30:32 +08:00
<p>To get all the data back into R, you call <code><a href="https://dplyr.tidyverse.org/reference/compute.html">collect()</a></code>. Behind the scenes, this generates the SQL, calls <code><a href="https://dbi.r-dbi.org/reference/dbGetQuery.html">dbGetQuery()</a></code> to get the data, then turns the result into a tibble:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">big_diamonds &lt;- big_diamonds_db |&gt;
collect()
big_diamonds
#&gt; # A tibble: 1,655 × 5
#&gt; carat cut color clarity price
#&gt; &lt;dbl&gt; &lt;fct&gt; &lt;fct&gt; &lt;fct&gt; &lt;int&gt;
#&gt; 1 1.54 Premium E VS2 15002
#&gt; 2 1.19 Ideal F VVS1 15005
#&gt; 3 2.1 Premium I SI1 15007
#&gt; 4 1.69 Ideal D SI1 15011
#&gt; 5 1.5 Very Good G VVS2 15013
#&gt; 6 1.73 Very Good G VS1 15014
#&gt; # … with 1,649 more rows</pre>
</div>
2022-11-19 00:30:32 +08:00
<p>Typically, youll use dbplyr to select the data you want from the database, performing basic filtering and aggregation using the translations described below. Then, once youre ready to analyse the data with functions that are unique to R, youll <code><a href="https://dplyr.tidyverse.org/reference/compute.html">collect()</a></code> the data to get an in-memory tibble, and continue your work with pure R code.</p>
</section>
<section id="sql" data-type="sect1">
<h1>
SQL</h1>
<p>The rest of the chapter will teach you a little SQL through the lens of dbplyr. Its a rather non-traditional introduction to SQL but we hope it will get you quickly up to speed with the basics. Luckily, if you understand dplyr youre in a great place to quickly pick up SQL because so many of the concepts are the same.</p>
<p>Well explore the relationship between dplyr and SQL using a couple of old friends from the nycflights13 package: <code>flights</code> and <code>planes</code>. These datasets are easy to get into our learning database because dbplyr has a function designed for this exact scenario:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">dbplyr::copy_nycflights13(con)
#&gt; Creating table: airlines
#&gt; Creating table: airports
#&gt; Creating table: flights
#&gt; Creating table: planes
#&gt; Creating table: weather
flights &lt;- tbl(con, "flights")
planes &lt;- tbl(con, "planes")</pre>
</div>
<div class="cell">
</div>
<section id="sql-basics" data-type="sect2">
<h2>
SQL basics</h2>
<p>The top-level components of SQL are called <strong>statements</strong>. Common statements include <code>CREATE</code> for defining new tables, <code>INSERT</code> for adding data, and <code>SELECT</code> for retrieving data. We will on focus on <code>SELECT</code> statements, also called <strong>queries</strong>, because they are almost exclusively what youll use as a data scientist.</p>
<p>A query is made up of <strong>clauses</strong>. There are five important clauses: <code>SELECT</code>, <code>FROM</code>, <code>WHERE</code>, <code>ORDER BY</code>, and <code>GROUP BY</code>. Every query must have the <code>SELECT</code><span data-type="footnote">Confusingly, depending on the context, <code>SELECT</code> is either a statement or a clause. To avoid this confusion, well generally use query instead of <code>SELECT</code> statement.</span> and <code>FROM</code><span data-type="footnote">Ok, technically, only the <code>SELECT</code> is required, since you can write queries like <code>SELECT 1+1</code> to perform basic calculations. But if you want to work with data (as you always do!) youll also need a <code>FROM</code> clause.</span> clauses and the simplest query is <code>SELECT * FROM table</code>, which selects all columns from the specified table . This is what dbplyr generates for an unadulterated table :</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">flights |&gt; show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT *
#&gt; FROM flights
planes |&gt; show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT *
#&gt; FROM planes</pre>
</div>
<p><code>WHERE</code> and <code>ORDER BY</code> control which rows are included and how they are ordered:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">flights |&gt;
filter(dest == "IAH") |&gt;
arrange(dep_delay) |&gt;
show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT *
#&gt; FROM flights
#&gt; WHERE (dest = 'IAH')
#&gt; ORDER BY dep_delay</pre>
</div>
<p><code>GROUP BY</code> converts the query to a summary, causing aggregation to happen:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">flights |&gt;
group_by(dest) |&gt;
2023-01-13 07:22:57 +08:00
summarize(dep_delay = mean(dep_delay, na.rm = TRUE)) |&gt;
show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT dest, AVG(dep_delay) AS dep_delay
#&gt; FROM flights
#&gt; GROUP BY dest</pre>
</div>
<p>There are two important differences between dplyr verbs and SELECT clauses:</p>
<ul><li>In SQL, case doesnt matter: you can write <code>select</code>, <code>SELECT</code>, or even <code>SeLeCt</code>. In this book well stick with the common convention of writing SQL keywords in uppercase to distinguish them from table or variables names.</li>
<li>In SQL, order matters: you must always write the clauses in the order <code>SELECT</code>, <code>FROM</code>, <code>WHERE</code>, <code>GROUP BY</code>, <code>ORDER BY</code>. Confusingly, this order doesnt match how the clauses actually evaluated which is first <code>FROM</code>, then <code>WHERE</code>, <code>GROUP BY</code>, <code>SELECT</code>, and <code>ORDER BY</code>.</li>
</ul><p>The following sections explore each clause in more detail.</p>
<div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>There are two other common ways to interact with a database. First, many corporate databases are very large so you need some hierarchy to keep all the tables organised. In that case you might need to supply a schema, or a catalog and a schema, in order to pick the table youre interested in:</p><div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">diamonds_db &lt;- tbl(con, in_schema("sales", "diamonds"))
diamonds_db &lt;- tbl(con, in_catalog("north_america", "sales", "diamonds"))</pre>
</div><p>Other times you might want to use your own SQL query as a starting point:</p><div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">diamonds_db &lt;- tbl(con, sql("SELECT * FROM diamonds"))</pre>
</div>
2022-11-19 00:30:32 +08:00
<p>Note that while SQL is a standard, it is extremely complex and no database follows it exactly. While the main components that well focus on in this book are very similar between DBMSs, there are many minor variations. Fortunately, dbplyr is designed to handle this problem and generates different translations for different databases. Its not perfect, but its continually improving, and if you hit a problem you can file an issue <a href="https://github.com/tidyverse/dbplyr/issues/">on GitHub</a> to help us do better.</p>
<p>In the examples above note that <code>"year"</code> and <code>"type"</code> are wrapped in double quotes. Thats because these are <strong>reserved words</strong> in duckdb, so dbplyr quotes them to avoid any potential confusion between column/table names and SQL operators.</p><p>When working with other databases youre likely to see every variable name quotes because only a handful of client packages, like duckdb, know what all the reserved words are, so they quote everything to be safe.</p><pre data-type="programlisting" data-code-language="sql">SELECT "tailnum", "type", "manufacturer", "model", "year"
FROM "planes"</pre><p>Some other database systems use backticks instead of quotes:</p><pre data-type="programlisting" data-code-language="sql">SELECT `tailnum`, `type`, `manufacturer`, `model`, `year`
FROM `planes`</pre></div>
</section>
<section id="select" data-type="sect2">
<h2>
SELECT</h2>
2022-11-19 00:30:32 +08:00
<p>The <code>SELECT</code> clause is the workhorse of queries and performs the same job as <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code>, and, as youll learn in the next section, <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>.</p>
<p><code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code> have very direct translations to <code>SELECT</code> as they just affect where a column appears (if at all) along with its name:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">planes |&gt;
select(tailnum, type, manufacturer, model, year) |&gt;
show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT tailnum, "type", manufacturer, model, "year"
#&gt; FROM planes
planes |&gt;
select(tailnum, type, manufacturer, model, year) |&gt;
rename(year_built = year) |&gt;
show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT tailnum, "type", manufacturer, model, "year" AS year_built
#&gt; FROM planes
planes |&gt;
select(tailnum, type, manufacturer, model, year) |&gt;
relocate(manufacturer, model, .before = type) |&gt;
show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT tailnum, manufacturer, model, "type", "year"
#&gt; FROM planes</pre>
</div>
2022-11-19 00:30:32 +08:00
<p>This example also shows you how SQL does renaming. In SQL terminology renaming is called <strong>aliasing</strong> and is done with <code>AS</code>. Note that unlike <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>, the old name is on the left and the new name is on the right.</p>
<div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>There are two other common ways to interact with a database. First, many corporate databases are very large so you need some hierarchy to keep all the tables organised. In that case you might need to supply a schema, or a catalog and a schema, in order to pick the table youre interested in:</p><div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">diamonds_db &lt;- tbl(con, in_schema("sales", "diamonds"))
diamonds_db &lt;- tbl(con, in_catalog("north_america", "sales", "diamonds"))</pre>
</div><p>Other times you might want to use your own SQL query as a starting point:</p><div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">diamonds_db &lt;- tbl(con, sql("SELECT * FROM diamonds"))</pre>
</div>
2022-11-19 00:30:32 +08:00
<p>Note that while SQL is a standard, it is extremely complex and no database follows it exactly. While the main components that well focus on in this book are very similar between DBMSs, there are many minor variations. Fortunately, dbplyr is designed to handle this problem and generates different translations for different databases. Its not perfect, but its continually improving, and if you hit a problem you can file an issue <a href="https://github.com/tidyverse/dbplyr/issues/">on GitHub</a> to help us do better.</p>
<p>In the examples above note that <code>"year"</code> and <code>"type"</code> are wrapped in double quotes. Thats because these are <strong>reserved words</strong> in duckdb, so dbplyr quotes them to avoid any potential confusion between column/table names and SQL operators.</p><p>When working with other databases youre likely to see every variable name quotes because only a handful of client packages, like duckdb, know what all the reserved words are, so they quote everything to be safe.</p><pre data-type="programlisting" data-code-language="sql">SELECT "tailnum", "type", "manufacturer", "model", "year"
FROM "planes"</pre><p>Some other database systems use backticks instead of quotes:</p><pre data-type="programlisting" data-code-language="sql">SELECT `tailnum`, `type`, `manufacturer`, `model`, `year`
FROM `planes`</pre></div>
2022-11-19 00:30:32 +08:00
<p>The translations for <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> are similarly straightforward: each variable becomes a new expression in <code>SELECT</code>:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">flights |&gt;
mutate(
speed = distance / (air_time / 60)
) |&gt;
show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT *, distance / (air_time / 60.0) AS speed
#&gt; FROM flights</pre>
</div>
<p>Well come back to the translation of individual components (like <code>/</code>) in <a href="#sec-sql-expressions" data-type="xref">#sec-sql-expressions</a>.</p>
</section>
<section id="from" data-type="sect2">
<h2>
FROM</h2>
<p>The <code>FROM</code> clause defines the data source. Its going to be rather uninteresting for a little while, because were just using single tables. Youll see more complex examples once we hit the join functions.</p>
</section>
<section id="group-by" data-type="sect2">
<h2>
GROUP BY</h2>
2023-01-13 07:22:57 +08:00
<p><code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> is translated to the <code>GROUP BY</code><span data-type="footnote">This is no coincidence: the dplyr function name was inspired by the SQL clause.</span> clause and <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code> is translated to the <code>SELECT</code> clause:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">diamonds_db |&gt;
group_by(cut) |&gt;
2023-01-13 07:22:57 +08:00
summarize(
n = n(),
avg_price = mean(price, na.rm = TRUE)
) |&gt;
show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT cut, COUNT(*) AS n, AVG(price) AS avg_price
#&gt; FROM diamonds
#&gt; GROUP BY cut</pre>
</div>
2022-11-19 00:30:32 +08:00
<p>Well come back to whats happening with translation <code><a href="https://dplyr.tidyverse.org/reference/context.html">n()</a></code> and <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code> in <a href="#sec-sql-expressions" data-type="xref">#sec-sql-expressions</a>.</p>
</section>
<section id="where" data-type="sect2">
<h2>
WHERE</h2>
2022-11-19 00:30:32 +08:00
<p><code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> is translated to the <code>WHERE</code> clause:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">flights |&gt;
filter(dest == "IAH" | dest == "HOU") |&gt;
show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT *
#&gt; FROM flights
#&gt; WHERE (dest = 'IAH' OR dest = 'HOU')
flights |&gt;
filter(arr_delay &gt; 0 &amp; arr_delay &lt; 20) |&gt;
show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT *
#&gt; FROM flights
#&gt; WHERE (arr_delay &gt; 0.0 AND arr_delay &lt; 20.0)</pre>
</div>
<p>There are a few important details to note here:</p>
<ul><li>
<code>|</code> becomes <code>OR</code> and <code>&amp;</code> becomes <code>AND</code>.</li>
<li>SQL uses <code>=</code> for comparison, not <code>==</code>. SQL doesnt have assignment, so theres no potential for confusion there.</li>
<li>SQL uses only <code>''</code> for strings, not <code>""</code>. In SQL, <code>""</code> is used to identify variables, like Rs <code>``</code>.</li>
</ul><p>Another useful SQL operator is <code>IN</code>, which is very close to Rs <code>%in%</code>:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">flights |&gt;
filter(dest %in% c("IAH", "HOU")) |&gt;
show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT *
#&gt; FROM flights
#&gt; WHERE (dest IN ('IAH', 'HOU'))</pre>
</div>
<p>SQL uses <code>NULL</code> instead of <code>NA</code>. <code>NULL</code>s behave similarly to <code>NA</code>s. The main difference is that while theyre “infectious” in comparisons and arithmetic, they are silently dropped when summarizing. dbplyr will remind you about this behavior the first time you hit it:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">flights |&gt;
group_by(dest) |&gt;
2023-01-13 07:22:57 +08:00
summarize(delay = mean(arr_delay))
#&gt; Warning: Missing values are always removed in SQL aggregation functions.
#&gt; Use `na.rm = TRUE` to silence this warning
#&gt; This warning is displayed once every 8 hours.
#&gt; # Source: SQL [?? x 2]
2023-01-13 07:22:57 +08:00
#&gt; # Database: DuckDB 0.6.1 [root@Darwin 22.1.0:R 4.2.1/:memory:]
#&gt; dest delay
#&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 ATL 11.3
#&gt; 2 ORD 5.88
#&gt; 3 RDU 10.1
#&gt; 4 IAD 13.9
#&gt; 5 DTW 5.43
#&gt; 6 LAX 0.547
#&gt; # … with more rows</pre>
</div>
2022-11-19 00:30:32 +08:00
<p>If you want to learn more about how NULLs work, you might enjoy “<a href="https://modern-sql.com/concept/three-valued-logic"><em>Three valued logic</em></a>” by Markus Winand.</p>
<p>In general, you can work with <code>NULL</code>s using the functions youd use for <code>NA</code>s in R:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">flights |&gt;
filter(!is.na(dep_delay)) |&gt;
show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT *
#&gt; FROM flights
#&gt; WHERE (NOT((dep_delay IS NULL)))</pre>
</div>
<p>This SQL query illustrates one of the drawbacks of dbplyr: while the SQL is correct, it isnt as simple as you might write by hand. In this case, you could drop the parentheses and use a special operator thats easier to read:</p>
<pre data-type="programlisting" data-code-language="sql">WHERE "dep_delay" IS NOT NULL</pre>
2022-11-19 00:30:32 +08:00
<p>Note that if you <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> a variable that you created using a summarize, dbplyr will generate a <code>HAVING</code> clause, rather than a <code>FROM</code> clause. This is a one of the idiosyncracies of SQL created because <code>WHERE</code> is evaluated before <code>SELECT</code>, so it needs another clause thats evaluated afterwards.</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">diamonds_db |&gt;
group_by(cut) |&gt;
2023-01-13 07:22:57 +08:00
summarize(n = n()) |&gt;
filter(n &gt; 100) |&gt;
show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT cut, COUNT(*) AS n
#&gt; FROM diamonds
#&gt; GROUP BY cut
#&gt; HAVING (COUNT(*) &gt; 100.0)</pre>
</div>
</section>
<section id="order-by" data-type="sect2">
<h2>
ORDER BY</h2>
2022-11-19 00:30:32 +08:00
<p>Ordering rows involves a straightforward translation from <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> to the <code>ORDER BY</code> clause:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">flights |&gt;
arrange(year, month, day, desc(dep_delay)) |&gt;
show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT *
#&gt; FROM flights
#&gt; ORDER BY "year", "month", "day", dep_delay DESC</pre>
</div>
2022-11-19 00:30:32 +08:00
<p>Notice how <code><a href="https://dplyr.tidyverse.org/reference/desc.html">desc()</a></code> is translated to <code>DESC</code>: this is one of the many dplyr functions whose name was directly inspired by SQL.</p>
</section>
<section id="subqueries" data-type="sect2">
<h2>
Subqueries</h2>
<p>Sometimes its not possible to translate a dplyr pipeline into a single <code>SELECT</code> statement and you need to use a subquery. A <strong>subquery</strong> is just a query used as a data source in the <code>FROM</code> clause, instead of the usual table.</p>
<p>dbplyr typically uses subqueries to work around limitations of SQL. For example, expressions in the <code>SELECT</code> clause cant refer to columns that were just created. That means that the following (silly) dplyr pipeline needs to happen in two steps: the first (inner) query computes <code>year1</code> and then the second (outer) query can compute <code>year2</code>.</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">flights |&gt;
mutate(
year1 = year + 1,
year2 = year1 + 1
) |&gt;
show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT *, year1 + 1.0 AS year2
#&gt; FROM (
#&gt; SELECT *, "year" + 1.0 AS year1
#&gt; FROM flights
#&gt; ) q01</pre>
</div>
2022-11-19 00:30:32 +08:00
<p>Youll also see this if you attempted to <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> a variable that you just created. Remember, even though <code>WHERE</code> is written after <code>SELECT</code>, its evaluated before it, so we need a subquery in this (silly) example:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">flights |&gt;
mutate(year1 = year + 1) |&gt;
filter(year1 == 2014) |&gt;
show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT *
#&gt; FROM (
#&gt; SELECT *, "year" + 1.0 AS year1
#&gt; FROM flights
#&gt; ) q01
#&gt; WHERE (year1 = 2014.0)</pre>
</div>
<p>Sometimes dbplyr will create a subquery where its not needed because it doesnt yet know how to optimize that translation. As dbplyr improves over time, these cases will get rarer but will probably never go away.</p>
</section>
<section id="joins" data-type="sect2">
<h2>
Joins</h2>
<p>If youre familiar with dplyrs joins, SQL joins are very similar. Heres a simple example:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">flights |&gt;
left_join(planes |&gt; rename(year_built = year), by = "tailnum") |&gt;
show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT
#&gt; flights.*,
#&gt; planes."year" AS year_built,
#&gt; "type",
#&gt; manufacturer,
#&gt; model,
#&gt; engines,
#&gt; seats,
#&gt; speed,
#&gt; engine
#&gt; FROM flights
#&gt; LEFT JOIN planes
#&gt; ON (flights.tailnum = planes.tailnum)</pre>
</div>
<p>The main thing to notice here is the syntax: SQL joins use sub-clauses of the <code>FROM</code> clause to bring in additional tables, using <code>ON</code> to define how the tables are related.</p>
2022-11-19 00:30:32 +08:00
<p>dplyrs names for these functions are so closely connected to SQL that you can easily guess the equivalent SQL for <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">inner_join()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">right_join()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">full_join()</a></code>:</p>
<pre data-type="programlisting" data-code-language="sql">SELECT flights.*, "type", manufacturer, model, engines, seats, speed
FROM flights
INNER JOIN planes ON (flights.tailnum = planes.tailnum)
SELECT flights.*, "type", manufacturer, model, engines, seats, speed
FROM flights
RIGHT JOIN planes ON (flights.tailnum = planes.tailnum)
SELECT flights.*, "type", manufacturer, model, engines, seats, speed
FROM flights
FULL JOIN planes ON (flights.tailnum = planes.tailnum)</pre>
2022-11-19 00:30:32 +08:00
<p>Youre likely to need many joins when working with data from a database. Thats because database tables are often stored in a highly normalized form, where each “fact” is stored in a single place and to keep a complete dataset for analysis you need to navigate a complex network of tables connected by primary and foreign keys. If you hit this scenario, the <a href="https://cynkra.github.io/dm/">dm package</a>, by Tobias Schieferdecker, Kirill Müller, and Darko Bergant, is a life saver. It can automatically determine the connections between tables using the constraints that DBAs often supply, visualize the connections so you can see whats going on, and generate the joins you need to connect one table to another.</p>
</section>
<section id="other-verbs" data-type="sect2">
<h2>
Other verbs</h2>
2022-11-19 00:30:32 +08:00
<p>dbplyr also translates other verbs like <code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code>, <code>slice_*()</code>, and <code><a href="https://generics.r-lib.org/reference/setops.html">intersect()</a></code>, and a growing selection of tidyr functions like <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code>. The easiest way to see the full set of whats currently available is to visit the dbplyr website: <a href="https://dbplyr.tidyverse.org/reference/" class="uri">https://dbplyr.tidyverse.org/reference/</a>.</p>
</section>
<section id="exercises" data-type="sect2">
<h2>
Exercises</h2>
2022-11-19 00:30:32 +08:00
<ol type="1"><li><p>What is <code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code> translated to? How about <code><a href="https://rdrr.io/r/utils/head.html">head()</a></code>?</p></li>
<li>
<p>Explain what each of the following SQL queries do and try recreate them using dbplyr.</p>
<pre data-type="programlisting" data-code-language="sql">SELECT *
FROM flights
WHERE dep_delay &lt; arr_delay
SELECT *, distance / (airtime / 60) AS speed
FROM flights</pre>
</li>
</ol></section>
</section>
<section id="sec-sql-expressions" data-type="sect1">
<h1>
Function translations</h1>
2022-11-19 00:30:32 +08:00
<p>So far weve focused on the big picture of how dplyr verbs are translated to the clauses of a query. Now were going to zoom in a little and talk about the translation of the R functions that work with individual columns, e.g. what happens when you use <code>mean(x)</code> in a <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>?</p>
2023-01-13 07:22:57 +08:00
<p>To help see whats going on, well use a couple of little helper functions that run a <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code> or <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> and show the generated SQL. That will make it a little easier to explore a few variations and see how summaries and transformations can differ.</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">summarize_query &lt;- function(df, ...) {
df |&gt;
2023-01-13 07:22:57 +08:00
summarize(...) |&gt;
show_query()
}
mutate_query &lt;- function(df, ...) {
df |&gt;
mutate(..., .keep = "none") |&gt;
show_query()
}</pre>
</div>
2022-11-19 00:30:32 +08:00
<p>Lets dive in with some summaries! Looking at the code below youll notice that some summary functions, like <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code>, have a relatively simple translation while others, like <code><a href="https://rdrr.io/r/stats/median.html">median()</a></code>, are much more complex. The complexity is typically higher for operations that are common in statistics but less common in databases.</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">flights |&gt;
group_by(year, month, day) |&gt;
summarize_query(
mean = mean(arr_delay, na.rm = TRUE),
median = median(arr_delay, na.rm = TRUE)
)
#&gt; `summarise()` has grouped output by "year" and "month". You can override
#&gt; using the `.groups` argument.
#&gt; &lt;SQL&gt;
#&gt; SELECT
#&gt; "year",
#&gt; "month",
#&gt; "day",
#&gt; AVG(arr_delay) AS mean,
#&gt; PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY arr_delay) AS median
#&gt; FROM flights
#&gt; GROUP BY "year", "month", "day"</pre>
</div>
2022-11-19 00:30:32 +08:00
<p>The translation of summary functions becomes more complicated when you use them inside a <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> because they have to turn into a window function. In SQL, you turn an ordinary aggregation function into a window function by adding <code>OVER</code> after it:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">flights |&gt;
group_by(year, month, day) |&gt;
mutate_query(
mean = mean(arr_delay, na.rm = TRUE),
)
#&gt; &lt;SQL&gt;
#&gt; SELECT
#&gt; "year",
#&gt; "month",
#&gt; "day",
#&gt; AVG(arr_delay) OVER (PARTITION BY "year", "month", "day") AS mean
#&gt; FROM flights</pre>
</div>
<p>In SQL, the <code>GROUP BY</code> clause is used exclusively for summary so here you can see that the grouping has moved to the <code>PARTITION BY</code> argument to <code>OVER</code>.</p>
2022-11-19 00:30:32 +08:00
<p>Window functions include all functions that look forward or backwards, like <code><a href="https://dplyr.tidyverse.org/reference/lead-lag.html">lead()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/lead-lag.html">lag()</a></code>:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">flights |&gt;
group_by(dest) |&gt;
arrange(time_hour) |&gt;
mutate_query(
lead = lead(arr_delay),
lag = lag(arr_delay)
)
#&gt; &lt;SQL&gt;
#&gt; SELECT
#&gt; dest,
#&gt; LEAD(arr_delay, 1, NULL) OVER (PARTITION BY dest ORDER BY time_hour) AS lead,
#&gt; LAG(arr_delay, 1, NULL) OVER (PARTITION BY dest ORDER BY time_hour) AS lag
#&gt; FROM flights
#&gt; ORDER BY time_hour</pre>
</div>
2022-11-19 00:30:32 +08:00
<p>Here its important to <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> the data, because SQL tables have no intrinsic order. In fact, if you dont use <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> you might get the rows back in a different order every time! Notice for window functions, the ordering information is repeated: the <code>ORDER BY</code> clause of the main query doesnt automatically apply to window functions.</p>
<p>Another important SQL function is <code>CASE WHEN</code>. Its used as the translation of <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code>, the dplyr function that it directly inspired. Heres a couple of simple examples:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">flights |&gt;
mutate_query(
description = if_else(arr_delay &gt; 0, "delayed", "on-time")
)
#&gt; &lt;SQL&gt;
#&gt; SELECT CASE WHEN (arr_delay &gt; 0.0) THEN 'delayed' WHEN NOT (arr_delay &gt; 0.0) THEN 'on-time' END AS description
#&gt; FROM flights
flights |&gt;
mutate_query(
description =
case_when(
arr_delay &lt; -5 ~ "early",
arr_delay &lt; 5 ~ "on-time",
arr_delay &gt;= 5 ~ "late"
)
)
#&gt; &lt;SQL&gt;
#&gt; SELECT CASE
#&gt; WHEN (arr_delay &lt; -5.0) THEN 'early'
#&gt; WHEN (arr_delay &lt; 5.0) THEN 'on-time'
#&gt; WHEN (arr_delay &gt;= 5.0) THEN 'late'
#&gt; END AS description
#&gt; FROM flights</pre>
</div>
2022-11-19 00:30:32 +08:00
<p><code>CASE WHEN</code> is also used for some other functions that dont have a direct translation from R to SQL. A good example of this is <code><a href="https://rdrr.io/r/base/cut.html">cut()</a></code>:</p>
<div class="cell">
2022-11-19 01:26:25 +08:00
<pre data-type="programlisting" data-code-language="r">flights |&gt;
mutate_query(
description = cut(
arr_delay,
breaks = c(-Inf, -5, 5, Inf),
labels = c("early", "on-time", "late")
)
)
#&gt; &lt;SQL&gt;
#&gt; SELECT CASE
#&gt; WHEN (arr_delay &lt;= -5.0) THEN 'early'
#&gt; WHEN (arr_delay &lt;= 5.0) THEN 'on-time'
#&gt; WHEN (arr_delay &gt; 5.0) THEN 'late'
#&gt; END AS description
#&gt; FROM flights</pre>
</div>
2022-11-19 00:30:32 +08:00
<p>dbplyr also translates common string and date-time manipulation functions, which you can learn about in <code><a href="https://dbplyr.tidyverse.org/articles/translation-function.html">vignette("translation-function", package = "dbplyr")</a></code>. dbplyrs translations are certainly not perfect, and there are many R functions that arent translated yet, but dbplyr does a surprisingly good job covering the functions that youll use most of the time.</p>
2023-01-13 07:22:57 +08:00
</section>
2023-01-13 07:22:57 +08:00
<section id="summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter you learned how to access data from databases. We focused on dbplyr, a dplyr “backend” that allows you to write the dplyr code youre familiar with, and have it be automatically translated to SQL. We used that translation to teach you a little SQL; its important to learn some SQL because its <em>the</em> most commonly used language for working with data and knowing some will it easier for you to communicate with other data folks who dont use R. If youve finished this chapter and would like to learn more about SQL. We have two recommendations:</p>
<ul><li>
2022-11-19 00:30:32 +08:00
<a href="https://sqlfordatascientists.com"><em>SQL for Data Scientists</em></a> by Renée M. P. Teate is an introduction to SQL designed specifically for the needs of data scientists, and includes examples of the sort of highly interconnected data youre likely to encounter in real organisations.</li>
<li>
2022-11-19 00:30:32 +08:00
<a href="https://www.practicalsql.com"><em>Practical SQL</em></a> by Anthony DeBarros is written from the perspective of a data journalist (a data scientist specialized in telling compelling stories) and goes into more detail about getting your data into a database and running your own DBMS.</li>
2023-01-13 07:22:57 +08:00
</ul><p>In the next chapter, well learn about another dplyr backend for working with large data: arrow. Arrow is designed for working with large files on disk, and is a natural complement to databases.</p>
</section>
</section>