diff --git a/arrow.qmd b/arrow.qmd
index 39087d5..3c02a03 100644
--- a/arrow.qmd
+++ b/arrow.qmd
@@ -141,6 +141,8 @@ This means that:
- Parquet files are "chunked", which makes it possible to work on different parts of the file at the same time, and, if you're lucky, to skip some chunks all together.
+There's one primary disadvantage to parquet files: they are no longer "human readable", i.e. if you look at a parquet file using `readr::read_file()`, you'll just see a bunch of gibberish.
+
### Partitioning
As datasets get larger and larger, storing all the data in a single file gets increasingly painful and it's often useful to split large datasets across many files.
@@ -262,7 +264,7 @@ The \~100x speedup in performance is attributable to two factors: the multi-file
This massive difference in performance is why it pays off to convert large CSVs to parquet!
-### Using dbplyr with arrow
+### Using duckdb with arrow
There's one last advantage of parquet and arrow --- it's very easy to turn an arrow dataset into a DuckDB database (@sec-import-databases) by calling `arrow::to_duckdb()`:
@@ -278,6 +280,12 @@ seattle_pq |>
The neat thing about `to_duckdb()` is that the transfer doesn't involve any memory copying, and speaks to the goals of the arrow ecosystem: enabling seamless transitions from one computing environment to another.
+### Exercises
+
+1. Figure out the most popular book each year.
+2. Which author has the most books in the Seattle library system?
+3. How has checkouts of books vs ebooks changed over the last 10 years?
+
## Summary
In this chapter, you've been given a taste of the arrow package, which provides a dplyr backend for working with large on-disk datasets.
diff --git a/base-R.qmd b/base-R.qmd
index 7837c24..029c353 100644
--- a/base-R.qmd
+++ b/base-R.qmd
@@ -338,7 +338,7 @@ l <- list(
The difference between `[` and `[[` is particularly important for lists because `[[` drills down into the list while `[` returns a new, smaller list.
To help you remember the difference, take a look at the an unusual pepper shaker shown in @fig-pepper.
If this pepper shaker is your list `pepper`, then, `pepper[1]` is a pepper shaker containing a single pepper packet.
-If we suppose this pepper shaker is a list `pepper`, then, `pepper[1]` is a pepper shaker containing a single pepper packet.
+If we suppose this pepper shaker is a list called `pepper`, then `pepper[1]` is a pepper shaker containing a single pepper packet.
`pepper[2]` would look the same, but would contain the second packet.
`pepper[1:2]` would be a pepper shaker containing two pepper packets.
`pepper[[1]]` would extract the pepper packet itself.
diff --git a/databases.qmd b/databases.qmd
index aae491c..07bd9ad 100644
--- a/databases.qmd
+++ b/databases.qmd
@@ -134,27 +134,15 @@ If you're using duckdb in a real project, we highly recommend learning about `du
These give you powerful and performant ways to quickly load data directly into duckdb, without having to first load it into R.
We'll also show off a useful technique for loading multiple files into a database in @sec-save-database.
-## DBI basics
+### DBI basics
-Now that we've connected to a database with some data in it, let's perform some basic operations with DBI.
-
-### What's there?
-
-The most important database objects for data scientists are tables.
-DBI provides two useful functions to either list all the tables in the database[^databases-3] or to check if a specific table already exists:
+You can check that the data is loaded correctly by using a couple of other DBI functions: `dbListTable()` lists all tables in the database[^databases-3] and `dbReadTable()` retrieves the contents of a table.
[^databases-3]: At least, all the tables that you have permission to see.
```{r}
dbListTables(con)
-dbExistsTable(con, "foo")
-```
-### Extract some data
-
-Once you've determined a table exists, you can retrieve it with `dbReadTable()`:
-
-```{r}
con |>
dbReadTable("diamonds") |>
as_tibble()
@@ -162,12 +150,7 @@ con |>
`dbReadTable()` returns a `data.frame` so we use `as_tibble()` to convert it into a tibble so that it prints nicely.
-In real life, it's rare that you'll use `dbReadTable()` because often database tables are too big to fit in memory, and you want bring back only a subset of the rows and columns.
-
-### Run a query {#sec-dbGetQuery}
-
-The way you'll usually retrieve data is with `dbGetQuery()`.
-It takes a database connection and some SQL code and returns a data frame:
+If you already know SQL, you can use `dbGetQuery()` to get the results of running a query on the database:
```{r}
sql <- "
@@ -178,19 +161,13 @@ sql <- "
as_tibble(dbGetQuery(con, sql))
```
-Don't worry if you've never seen SQL before; you'll learn more about it shortly.
+If you've never seen SQL before, don't worry!
+You'll learn more about it shortly.
But if you read it carefully, you might guess that it selects five columns of the diamonds dataset and all the rows where `price` is greater than 15,000.
-You'll need to be a little careful with `dbGetQuery()` since it can potentially return more data than you have memory.
-We won't discuss it further here, but if you're dealing with very large datasets it's possible to deal with a "page" of data at a time by using `dbSendQuery()` to get a "result set" which you can page through by calling `dbFetch()` until `dbHasCompleted()` returns `TRUE`.
-
-### Other functions
-
-There are lots of other functions in DBI that you might find useful if you're managing your own data (like `dbWriteTable()` which we used in @sec-load-data), but we're going to skip past them in the interest of staying focused on working with data that already lives in a database.
-
## dbplyr basics
-Now that you've learned the low-level basics for connecting to a database and running a query, we're going to switch it up a bit and learn a bit about dbplyr.
+Now that we've connected to a database and loaded up some data, we can start to learn about dbplyr.
dbplyr is a dplyr **backend**, which means that you keep writing dplyr code but the backend executes it differently.
In this, dbplyr translates to SQL; other backends include [dtplyr](https://dtplyr.tidyverse.org) which translates to [data.table](https://r-datatable.com), and [multidplyr](https://multidplyr.tidyverse.org) which executes your code on multiple cores.
@@ -234,7 +211,9 @@ big_diamonds_db
You can tell this object represents a database query because it prints the DBMS name at the top, and while it tells you the number of columns, it typically doesn't know the number of rows.
This is because finding the total number of rows usually requires executing the complete query, something we're trying to avoid.
-You can see the SQL code generated by the dbplyr function `show_query()`:
+You can see the SQL code generated by the dbplyr function `show_query()`.
+If you know dplyr, this is a great way to learn SQL!
+Write some dplyr code, get dbplyr to translate it to SQL, and then try to figure out how the two languages match up.
```{r}
big_diamonds_db |>
@@ -260,7 +239,7 @@ It's a rather non-traditional introduction to SQL but we hope it will get you qu
Luckily, if you understand dplyr you're in a great place to quickly pick up SQL because so many of the concepts are the same.
We'll explore the relationship between dplyr and SQL using a couple of old friends from the nycflights13 package: `flights` and `planes`.
-These datasets are easy to get into our learning database because dbplyr has a function designed for this exact scenario:
+These datasets are easy to get into our learning database because dbplyr comes with a function that copies the tables from nycflights13 to our database:
```{r}
dbplyr::copy_nycflights13(con)
@@ -464,8 +443,8 @@ In this case, you could drop the parentheses and use a special operator that's e
WHERE "dep_delay" IS NOT NULL
```
-Note that if you `filter()` a variable that you created using a summarize, dbplyr will generate a `HAVING` clause, rather than a `FROM` clause.
-This is a one of the idiosyncracies of SQL created because `WHERE` is evaluated before `SELECT`, so it needs another clause that's evaluated afterwards.
+Note that if you `filter()` a variable that you created using a summarize, dbplyr will generate a `HAVING` clause, rather than a `WHERE` clause.
+This is a one of the idiosyncrasies of SQL: `WHERE` is evaluated before `SELECT` and `GROUP BY`, so SQL needs another clause that's evaluated afterwards.
```{r}
diamonds_db |>
@@ -607,7 +586,7 @@ flights |>
)
```
-The translation of summary functions becomes more complicated when you use them inside a `mutate()` because they have to turn into a window function.
+The translation of summary functions becomes more complicated when you use them inside a `mutate()` because they have to turn into so-called **window** functions.
In SQL, you turn an ordinary aggregation function into a window function by adding `OVER` after it:
```{r}
@@ -618,9 +597,9 @@ flights |>
)
```
-In SQL, the `GROUP BY` clause is used exclusively for summary so here you can see that the grouping has moved to the `PARTITION BY` argument to `OVER`.
+In SQL, the `GROUP BY` clause is used exclusively for summaries so here you can see that the grouping has moved to the `PARTITION BY` argument to `OVER`.
-Window functions include all functions that look forward or backwards, like `lead()` and `lag()`:
+Window functions include all functions that look forward or backwards, like `lead()` and `lag()` which look at the "previous" or "next" value respectively:
```{r}
flights |>
@@ -637,7 +616,7 @@ In fact, if you don't use `arrange()` you might get the rows back in a different
Notice for window functions, the ordering information is repeated: the `ORDER BY` clause of the main query doesn't automatically apply to window functions.
Another important SQL function is `CASE WHEN`. It's used as the translation of `if_else()` and `case_when()`, the dplyr function that it directly inspired.
-Here's a couple of simple examples:
+Here are a couple of simple examples:
```{r}
flights |>
diff --git a/rectangling.qmd b/rectangling.qmd
index 75edd61..a92c8f1 100644
--- a/rectangling.qmd
+++ b/rectangling.qmd
@@ -164,7 +164,7 @@ In this chapter, we'll focus on unnesting list-columns out into regular variable
The default print method just displays a rough summary of the contents.
The list column could be arbitrarily complex, so there's no good way to print it.
-If you want to see it, you'll need to pull the list-column out and apply one of the techniques that you've learned above, like `df |> pull(z) |> str()` or `df |> pull(z) |> View()`.
+If you want to see it, you'll need to pull out just the one list-column and apply one of the techniques that you've learned above, like `df |> pull(z) |> str()` or `df |> pull(z) |> View()`.
::: callout-note
## Base R
@@ -240,8 +240,6 @@ df1 |>
unnest_wider(y, names_sep = "_")
```
-You'll notice that `unnest_wider()`, much like `pivot_wider()`, turns implicit missing values in to explicit missing values.
-
### `unnest_longer()`
When each row contains an unnamed list, it's most natural to put each element into its own row with `unnest_longer()`:
@@ -302,7 +300,6 @@ tidyr has a few other useful rectangling functions that we're not going to cover
- `unnest_auto()` automatically picks between `unnest_longer()` and `unnest_wider()` based on the structure of the list-column. It's great for rapid exploration, but ultimately it's a bad idea because it doesn't force you to understand how your data is structured, and makes your code harder to understand.
- `unnest()` expands both rows and columns. It's useful when you have a list-column that contains a 2d structure like a data frame, which you don't see in this book, but you might encounter if you use the [tidymodels](https://www.tmwr.org/base-r.html#combining-base-r-models-and-the-tidyverse) ecosystem.
-- `hoist()` allows you to reach into a deeply nested list and extract just the components that you need. It's mostly equivalent to repeated invocations of `unnest_wider()` + `select()` so read up on it if you're trying to extract just a couple of important variables embedded in a bunch of data that you don't care about.
These functions are good to know about as you might encounter them when reading other people's code or tackling rarer rectangling challenges yourself.
@@ -310,6 +307,7 @@ These functions are good to know about as you might encounter them when reading
1. What happens when you use `unnest_wider()` with unnamed list-columns like `df2`?
What argument is now necessary?
+ What happens to missing values?
2. What happens when you use `unnest_longer()` with named list-columns like `df1`?
What additional information do you get in the output?
@@ -555,8 +553,7 @@ locations |>
Note how we unnest two columns simultaneously by supplying a vector of variable names to `unnest_wider()`.
-This is where `hoist()`, mentioned earlier in the chapter, can be useful.
-Once you've discovered the path to get to the components you're interested in, you can extract them directly using `hoist()`:
+Once you've discovered the path to get to the components you're interested in, you can extract them directly using another tidyr function, `hoist()`:
```{r}
#| results: false
@@ -619,7 +616,7 @@ JSON is a simple format designed to be easily read and written by machines, not
It has six key data types.
Four of them are scalars:
-- The simplest type is a null (`null`) which plays the same role as both `NULL` and `NA` in R. It represents the absence of data.
+- The simplest type is a null (`null`) which plays the same role as `NA` in R. It represents the absence of data.
- A **string** is much like a string in R, but must always use double quotes.
- A **number** is similar to R's numbers: they can use integer (e.g. 123), decimal (e.g. 123.45), or scientific (e.g. 1.23e3) notation. JSON doesn't support `Inf`, `-Inf`, or `NaN`.
- A **boolean** is similar to R's `TRUE` and `FALSE`, but uses lowercase `true` and `false`.
diff --git a/spreadsheets.qmd b/spreadsheets.qmd
index 434efe7..cbc9cc7 100644
--- a/spreadsheets.qmd
+++ b/spreadsheets.qmd
@@ -10,9 +10,8 @@ status("complete")
## Introduction
-So far, you have learned about importing data from plain text files, e.g. `.csv` and `.tsv` files.
-Sometimes you need to analyze data that lives in a spreadsheet.
-This chapter will introduce you to tools for working with data in Excel spreadsheets and Google Sheets.
+In @sec-data-import you learned about importing data from plain text files like `.csv` and `.tsv`.
+Now it's time to learn how to get data out of a spreadsheet, either an Excel spreadsheet or a Google Sheet.
This will build on much of what you've learned in @sec-data-import, but we will also discuss additional considerations and complexities when working with data from spreadsheets.
If you or your collaborators are using spreadsheets for organizing data, we strongly recommend reading the paper "Data Organization in Spreadsheets" by Karl Broman and Kara Woo: ` and ` ` and `
` element above has one child, the `` element. +The **children** are the elements it contains, so the `
` element above has one child, the `` element.
The `` element has no children, but it does have contents (the text "name").
### Attributes
@@ -158,9 +160,9 @@ Attributes are also used to record the destination of links (the `href` attribut
To get started scraping, you'll need the URL of the page you want to scrape, which you can usually copy from your web browser.
You'll then need to read the HTML for that page into R with `read_html()`.
-This returns a `xml_document`[^webscraping-5] object which you'll then manipulate using rvest functions:
+This returns an `xml_document`[^webscraping-6] object which you'll then manipulate using rvest functions:
-[^webscraping-5]: This class comes from the [xml2](https://xml2.r-lib.org) package.
+[^webscraping-6]: This class comes from the [xml2](https://xml2.r-lib.org) package.
xml2 is a low-level package that rvest builds on top of.
```{r}
@@ -218,7 +220,7 @@ html |> html_elements(".important")
html |> html_elements("#first")
```
-Another important function is `html_element()` which always the number of outputs as inputs.
+Another important function is `html_element()` which always returns the same number of outputs as inputs.
If you apply it to a whole document it'll give you the first match:
```{r}
@@ -244,9 +246,9 @@ Here we have an unordered list (`)` where each list item (`
")
```
@@ -265,13 +267,15 @@ characters |> html_element("b")
```
The distinction between `html_element()` and `html_elements()` isn't important for name, but it is important for weight.
-We want to try and get the weight for each character
+We want to get one weight for each character, even if there's no weight ``.
+That's what `html_element()` does:
```{r}
characters |> html_element(".weight")
```
-If we instead used `html_elements()`, we lose the connection between names and weights:
+`html_elements()` finds all weight ``s that are children of `characters`.
+There's only three of these, so we lose the connection between names and weights:
```{r}
characters |> html_elements(".weight")
@@ -281,25 +285,21 @@ Now that you've selected the elements of interest, you'll need to extract the da
### Text and attributes
-`html_text2()`[^webscraping-6] extracts the plain text contents of an HTML element:
+`html_text2()`[^webscraping-7] extracts the plain text contents of an HTML element:
-[^webscraping-6]: rvest also provides `html_text()` but you should almost always use `html_text2()` since it does a better job of converting nested HTML to text.
+[^webscraping-7]: rvest also provides `html_text()` but you should almost always use `html_text2()` since it does a better job of converting nested HTML to text.
```{r}
-html <- minimal_html("
-
-
-")
-html |>
- html_element("ol") |>
- html_elements("li") |>
+characters |>
+ html_element("b") |>
+ html_text2()
+
+characters |>
+ html_element(".weight") |>
html_text2()
```
-Note that the escaped ampersand is automatically converted to `&`; you'll only ever see HTML escapes in the source HTML, not in the data returned by rvest.
+Note that any escapes will be automatically handled; you'll only ever see HTML escapes in the source HTML, not in the data returned by rvest.
`html_attr()` extracts data from attributes:
@@ -411,7 +411,7 @@ section <- html |> html_elements("section")
section
```
-The retrieves seven nodes matching the seven movies found on that page, suggesting that using `section` as a selector is good.
+This retrieves seven elements matching the seven movies found on that page, suggesting that using `section` as a selector is good.
Extracting the individual elements is straightforward since the data is always found in the text.
It's just a matter of finding the right selector:
@@ -425,14 +425,20 @@ Once we've done that for each component, we can wrap all the results up into a t
```{r}
tibble(
- title = section |> html_element("h2") |> html_text2(),
+ title = section |>
+ html_element("h2") |>
+ html_text2(),
released = section |>
html_element("p") |>
html_text2() |>
str_remove("Released: ") |>
parse_date(),
- director = section |> html_element(".director") |> html_text2(),
- intro = section |> html_element(".crawl") |> html_text2()
+ director = section |>
+ html_element(".director") |>
+ html_text2(),
+ intro = section |>
+ html_element(".crawl") |>
+ html_text2()
)
```
@@ -473,22 +479,22 @@ This includes a few empty columns, but overall does a good job of capturing the
However, we need to do some more processing to make it easier to use.
First, we'll rename the columns to be easier to work with, and remove the extraneous whitespace in rank and title.
We will do this with `select()` (instead of `rename()`) to do the renaming and selecting of just these two columns in one step.
-Then, we'll apply `separate_wider_regex()` (from @sec-extract-variables) to pull out the title, year, and rank into their own variables.
+Then we'll remove the new lines and extra spaces, and then apply `separate_wider_regex()` (from @sec-extract-variables) to pull out the title, year, and rank into their own variables.
```{r}
-ratings <- table |>
+ratings <- table |>
select(
rank_title_year = `Rank & Title`,
rating = `IMDb Rating`
) |>
mutate(
- rank_title_year = str_squish(rank_title_year)
+ rank_title_year = str_replace_all(rank_title_year, "\n +", " ")
) |>
separate_wider_regex(
rank_title_year,
patterns = c(
rank = "\\d+", "\\. ",
- title = ".+", " \\(",
+ title = ".+", " +\\(",
year = "\\d+", "\\)"
)
)
@@ -533,7 +539,7 @@ In many cases, that's because you're trying to scrape a website that dynamically
This doesn't currently work with rvest, because rvest downloads the raw HTML and doesn't run any javascript.
It's still possible to scrape these types of sites, but rvest needs to use a more expensive process: fully simulating the web browser including running all javascript.
-This functionality is not available at the time of writing, but it's something we're actively working on and should be available by the time you read this.
+This functionality is not available at the time of writing, but it's something we're actively working on and might be available by the time you read this.
It uses the [chromote package](https://rstudio.github.io/chromote/index.html) which actually runs the Chrome browser in the background, and gives you additional tools to interact with the site, like a human typing text and clicking buttons.
Check out the rvest website for more details.