Technical review comments for import (#1345)

Includes fix for #1342
This commit is contained in:
Hadley Wickham 2023-03-08 07:30:03 -06:00 committed by GitHub
parent 424665c929
commit 08c3cdf6f2
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
6 changed files with 102 additions and 150 deletions

View File

@ -141,6 +141,8 @@ This means that:
- Parquet files are "chunked", which makes it possible to work on different parts of the file at the same time, and, if you're lucky, to skip some chunks all together.
There's one primary disadvantage to parquet files: they are no longer "human readable", i.e. if you look at a parquet file using `readr::read_file()`, you'll just see a bunch of gibberish.
### Partitioning
As datasets get larger and larger, storing all the data in a single file gets increasingly painful and it's often useful to split large datasets across many files.
@ -262,7 +264,7 @@ The \~100x speedup in performance is attributable to two factors: the multi-file
This massive difference in performance is why it pays off to convert large CSVs to parquet!
### Using dbplyr with arrow
### Using duckdb with arrow
There's one last advantage of parquet and arrow --- it's very easy to turn an arrow dataset into a DuckDB database (@sec-import-databases) by calling `arrow::to_duckdb()`:
@ -278,6 +280,12 @@ seattle_pq |>
The neat thing about `to_duckdb()` is that the transfer doesn't involve any memory copying, and speaks to the goals of the arrow ecosystem: enabling seamless transitions from one computing environment to another.
### Exercises
1. Figure out the most popular book each year.
2. Which author has the most books in the Seattle library system?
3. How has checkouts of books vs ebooks changed over the last 10 years?
## Summary
In this chapter, you've been given a taste of the arrow package, which provides a dplyr backend for working with large on-disk datasets.

View File

@ -338,7 +338,7 @@ l <- list(
The difference between `[` and `[[` is particularly important for lists because `[[` drills down into the list while `[` returns a new, smaller list.
To help you remember the difference, take a look at the an unusual pepper shaker shown in @fig-pepper.
If this pepper shaker is your list `pepper`, then, `pepper[1]` is a pepper shaker containing a single pepper packet.
If we suppose this pepper shaker is a list `pepper`, then, `pepper[1]` is a pepper shaker containing a single pepper packet.
If we suppose this pepper shaker is a list called `pepper`, then `pepper[1]` is a pepper shaker containing a single pepper packet.
`pepper[2]` would look the same, but would contain the second packet.
`pepper[1:2]` would be a pepper shaker containing two pepper packets.
`pepper[[1]]` would extract the pepper packet itself.

View File

@ -134,27 +134,15 @@ If you're using duckdb in a real project, we highly recommend learning about `du
These give you powerful and performant ways to quickly load data directly into duckdb, without having to first load it into R.
We'll also show off a useful technique for loading multiple files into a database in @sec-save-database.
## DBI basics
### DBI basics
Now that we've connected to a database with some data in it, let's perform some basic operations with DBI.
### What's there?
The most important database objects for data scientists are tables.
DBI provides two useful functions to either list all the tables in the database[^databases-3] or to check if a specific table already exists:
You can check that the data is loaded correctly by using a couple of other DBI functions: `dbListTable()` lists all tables in the database[^databases-3] and `dbReadTable()` retrieves the contents of a table.
[^databases-3]: At least, all the tables that you have permission to see.
```{r}
dbListTables(con)
dbExistsTable(con, "foo")
```
### Extract some data
Once you've determined a table exists, you can retrieve it with `dbReadTable()`:
```{r}
con |>
dbReadTable("diamonds") |>
as_tibble()
@ -162,12 +150,7 @@ con |>
`dbReadTable()` returns a `data.frame` so we use `as_tibble()` to convert it into a tibble so that it prints nicely.
In real life, it's rare that you'll use `dbReadTable()` because often database tables are too big to fit in memory, and you want bring back only a subset of the rows and columns.
### Run a query {#sec-dbGetQuery}
The way you'll usually retrieve data is with `dbGetQuery()`.
It takes a database connection and some SQL code and returns a data frame:
If you already know SQL, you can use `dbGetQuery()` to get the results of running a query on the database:
```{r}
sql <- "
@ -178,19 +161,13 @@ sql <- "
as_tibble(dbGetQuery(con, sql))
```
Don't worry if you've never seen SQL before; you'll learn more about it shortly.
If you've never seen SQL before, don't worry!
You'll learn more about it shortly.
But if you read it carefully, you might guess that it selects five columns of the diamonds dataset and all the rows where `price` is greater than 15,000.
You'll need to be a little careful with `dbGetQuery()` since it can potentially return more data than you have memory.
We won't discuss it further here, but if you're dealing with very large datasets it's possible to deal with a "page" of data at a time by using `dbSendQuery()` to get a "result set" which you can page through by calling `dbFetch()` until `dbHasCompleted()` returns `TRUE`.
### Other functions
There are lots of other functions in DBI that you might find useful if you're managing your own data (like `dbWriteTable()` which we used in @sec-load-data), but we're going to skip past them in the interest of staying focused on working with data that already lives in a database.
## dbplyr basics
Now that you've learned the low-level basics for connecting to a database and running a query, we're going to switch it up a bit and learn a bit about dbplyr.
Now that we've connected to a database and loaded up some data, we can start to learn about dbplyr.
dbplyr is a dplyr **backend**, which means that you keep writing dplyr code but the backend executes it differently.
In this, dbplyr translates to SQL; other backends include [dtplyr](https://dtplyr.tidyverse.org) which translates to [data.table](https://r-datatable.com), and [multidplyr](https://multidplyr.tidyverse.org) which executes your code on multiple cores.
@ -234,7 +211,9 @@ big_diamonds_db
You can tell this object represents a database query because it prints the DBMS name at the top, and while it tells you the number of columns, it typically doesn't know the number of rows.
This is because finding the total number of rows usually requires executing the complete query, something we're trying to avoid.
You can see the SQL code generated by the dbplyr function `show_query()`:
You can see the SQL code generated by the dbplyr function `show_query()`.
If you know dplyr, this is a great way to learn SQL!
Write some dplyr code, get dbplyr to translate it to SQL, and then try to figure out how the two languages match up.
```{r}
big_diamonds_db |>
@ -260,7 +239,7 @@ It's a rather non-traditional introduction to SQL but we hope it will get you qu
Luckily, if you understand dplyr you're in a great place to quickly pick up SQL because so many of the concepts are the same.
We'll explore the relationship between dplyr and SQL using a couple of old friends from the nycflights13 package: `flights` and `planes`.
These datasets are easy to get into our learning database because dbplyr has a function designed for this exact scenario:
These datasets are easy to get into our learning database because dbplyr comes with a function that copies the tables from nycflights13 to our database:
```{r}
dbplyr::copy_nycflights13(con)
@ -464,8 +443,8 @@ In this case, you could drop the parentheses and use a special operator that's e
WHERE "dep_delay" IS NOT NULL
```
Note that if you `filter()` a variable that you created using a summarize, dbplyr will generate a `HAVING` clause, rather than a `FROM` clause.
This is a one of the idiosyncracies of SQL created because `WHERE` is evaluated before `SELECT`, so it needs another clause that's evaluated afterwards.
Note that if you `filter()` a variable that you created using a summarize, dbplyr will generate a `HAVING` clause, rather than a `WHERE` clause.
This is a one of the idiosyncrasies of SQL: `WHERE` is evaluated before `SELECT` and `GROUP BY`, so SQL needs another clause that's evaluated afterwards.
```{r}
diamonds_db |>
@ -607,7 +586,7 @@ flights |>
)
```
The translation of summary functions becomes more complicated when you use them inside a `mutate()` because they have to turn into a window function.
The translation of summary functions becomes more complicated when you use them inside a `mutate()` because they have to turn into so-called **window** functions.
In SQL, you turn an ordinary aggregation function into a window function by adding `OVER` after it:
```{r}
@ -618,9 +597,9 @@ flights |>
)
```
In SQL, the `GROUP BY` clause is used exclusively for summary so here you can see that the grouping has moved to the `PARTITION BY` argument to `OVER`.
In SQL, the `GROUP BY` clause is used exclusively for summaries so here you can see that the grouping has moved to the `PARTITION BY` argument to `OVER`.
Window functions include all functions that look forward or backwards, like `lead()` and `lag()`:
Window functions include all functions that look forward or backwards, like `lead()` and `lag()` which look at the "previous" or "next" value respectively:
```{r}
flights |>
@ -637,7 +616,7 @@ In fact, if you don't use `arrange()` you might get the rows back in a different
Notice for window functions, the ordering information is repeated: the `ORDER BY` clause of the main query doesn't automatically apply to window functions.
Another important SQL function is `CASE WHEN`. It's used as the translation of `if_else()` and `case_when()`, the dplyr function that it directly inspired.
Here's a couple of simple examples:
Here are a couple of simple examples:
```{r}
flights |>

View File

@ -164,7 +164,7 @@ In this chapter, we'll focus on unnesting list-columns out into regular variable
The default print method just displays a rough summary of the contents.
The list column could be arbitrarily complex, so there's no good way to print it.
If you want to see it, you'll need to pull the list-column out and apply one of the techniques that you've learned above, like `df |> pull(z) |> str()` or `df |> pull(z) |> View()`.
If you want to see it, you'll need to pull out just the one list-column and apply one of the techniques that you've learned above, like `df |> pull(z) |> str()` or `df |> pull(z) |> View()`.
::: callout-note
## Base R
@ -240,8 +240,6 @@ df1 |>
unnest_wider(y, names_sep = "_")
```
You'll notice that `unnest_wider()`, much like `pivot_wider()`, turns implicit missing values in to explicit missing values.
### `unnest_longer()`
When each row contains an unnamed list, it's most natural to put each element into its own row with `unnest_longer()`:
@ -302,7 +300,6 @@ tidyr has a few other useful rectangling functions that we're not going to cover
- `unnest_auto()` automatically picks between `unnest_longer()` and `unnest_wider()` based on the structure of the list-column. It's great for rapid exploration, but ultimately it's a bad idea because it doesn't force you to understand how your data is structured, and makes your code harder to understand.
- `unnest()` expands both rows and columns. It's useful when you have a list-column that contains a 2d structure like a data frame, which you don't see in this book, but you might encounter if you use the [tidymodels](https://www.tmwr.org/base-r.html#combining-base-r-models-and-the-tidyverse) ecosystem.
- `hoist()` allows you to reach into a deeply nested list and extract just the components that you need. It's mostly equivalent to repeated invocations of `unnest_wider()` + `select()` so read up on it if you're trying to extract just a couple of important variables embedded in a bunch of data that you don't care about.
These functions are good to know about as you might encounter them when reading other people's code or tackling rarer rectangling challenges yourself.
@ -310,6 +307,7 @@ These functions are good to know about as you might encounter them when reading
1. What happens when you use `unnest_wider()` with unnamed list-columns like `df2`?
What argument is now necessary?
What happens to missing values?
2. What happens when you use `unnest_longer()` with named list-columns like `df1`?
What additional information do you get in the output?
@ -555,8 +553,7 @@ locations |>
Note how we unnest two columns simultaneously by supplying a vector of variable names to `unnest_wider()`.
This is where `hoist()`, mentioned earlier in the chapter, can be useful.
Once you've discovered the path to get to the components you're interested in, you can extract them directly using `hoist()`:
Once you've discovered the path to get to the components you're interested in, you can extract them directly using another tidyr function, `hoist()`:
```{r}
#| results: false
@ -619,7 +616,7 @@ JSON is a simple format designed to be easily read and written by machines, not
It has six key data types.
Four of them are scalars:
- The simplest type is a null (`null`) which plays the same role as both `NULL` and `NA` in R. It represents the absence of data.
- The simplest type is a null (`null`) which plays the same role as `NA` in R. It represents the absence of data.
- A **string** is much like a string in R, but must always use double quotes.
- A **number** is similar to R's numbers: they can use integer (e.g. 123), decimal (e.g. 123.45), or scientific (e.g. 1.23e3) notation. JSON doesn't support `Inf`, `-Inf`, or `NaN`.
- A **boolean** is similar to R's `TRUE` and `FALSE`, but uses lowercase `true` and `false`.

View File

@ -10,9 +10,8 @@ status("complete")
## Introduction
So far, you have learned about importing data from plain text files, e.g. `.csv` and `.tsv` files.
Sometimes you need to analyze data that lives in a spreadsheet.
This chapter will introduce you to tools for working with data in Excel spreadsheets and Google Sheets.
In @sec-data-import you learned about importing data from plain text files like `.csv` and `.tsv`.
Now it's time to learn how to get data out of a spreadsheet, either an Excel spreadsheet or a Google Sheet.
This will build on much of what you've learned in @sec-data-import, but we will also discuss additional considerations and complexities when working with data from spreadsheets.
If you or your collaborators are using spreadsheets for organizing data, we strongly recommend reading the paper "Data Organization in Spreadsheets" by Karl Broman and Kara Woo: <https://doi.org/10.1080/00031305.2017.1375989>.
@ -24,18 +23,16 @@ The best practices presented in this paper will save you much headache when you
In this section, you'll learn how to load data from Excel spreadsheets in R with the **readxl** package.
This package is non-core tidyverse, so you need to load it explicitly, but it is installed automatically when you install the tidyverse package.
Later, we'll also use the writexl package, which allows us to create Excel spreadsheets.
```{r}
#| message: false
library(readxl)
library(tidyverse)
```
**openxlsx**, **xlsx**, and **XLConnect** can also be used for reading data from and writing data to Excel spreadsheets.
We will discuss openxlsx in @sec-writing-to-excel.
The latter two packages require Java installed on your machine and the rJava package.
Due to potential challenges with installation, we recommend using alternative packages we're introducing in this chapter.
library(writexl)
```
### Getting started
@ -201,6 +198,7 @@ knitr::include_graphics("screenshots/import-spreadsheets-penguins-islands.png")
```
You can read a single worksheet from a spreadsheet with the `sheet` argument in `read_excel()`.
The default, which we've been relying on up until now, is the first sheet.
```{r}
read_excel("data/penguins.xlsx", sheet = "Torgersen Island")
@ -280,45 +278,19 @@ deaths
```
The top three rows and the bottom four rows are not part of the data frame.
We could skip the top three rows with `skip`.
Note that we set `skip = 4` since the fourth row contains column names, not the data.
```{r}
read_excel(deaths_path, skip = 4)
```
We could also set `n_max` to omit the extraneous rows at the bottom.
```{r}
read_excel(deaths_path, skip = 4, n_max = 10)
```
Another approach is using cell ranges.
It's possible to eliminate these extraneous rows using the `skip` and `n_max` arguments, but we recommend using cell ranges.
In Excel, the top left cell is `A1`.
As you move across columns to the right, the cell label moves down the alphabet, i.e.
`B1`, `C1`, etc.
And as you move down a column, the number in the cell label increases, i.e.
`A2`, `A3`, etc.
The data we want to read in starts in cell `A5` and ends in cell `F15`.
In spreadsheet notation, this is `A5:F15`.
Here the data we want to read in starts in cell `A5` and ends in cell `F15`.
In spreadsheet notation, this is `A5:F15`, which we supply to the `range` argument:
- Supply this information to the `range` argument:
```{r}
#| results: "hide"
read_excel(deaths_path, range = "A5:F15")
```
- Specify rows:
```{r}
#| results: "hide"
read_excel(deaths_path, range = cell_rows(c(5, 15)))
```
```{r}
read_excel(deaths_path, range = "A5:F15")
```
### Data types
@ -326,17 +298,17 @@ In CSV files, all values are strings.
This is not particularly true to the data, but it is simple: everything is a string.
The underlying data in Excel spreadsheets is more complex.
A cell can be one of five things:
A cell can be one of four things:
- A boolean, like TRUE, FALSE, or NA
- A boolean, like `TRUE`, `FALSE`, or `NA`.
- A number, like "10" or "10.5"
- A number, like "10" or "10.5".
- A datetime, which can also include time like "11/1/21" or "11/1/21 3:00 PM"
- A datetime, which can also include time like "11/1/21" or "11/1/21 3:00 PM".
- A text string, like "ten"
- A text string, like "ten".
When working with spreadsheet data, it's important to keep in mind that how the underlying data is stored can be very different than what you see in the cell.
When working with spreadsheet data, it's important to keep in mind that the underlying data can be very different than what you see in the cell.
For example, Excel has no notion of an integer.
All numbers are stored as floating points, but you can choose to display the data with a customizable number of decimal points.
Similarly, dates are actually stored as numbers, specifically the number of seconds since January 1, 1970.
@ -353,8 +325,8 @@ In these cases you can set the type for this column to `"list"`, which will load
### Data not in cell values
**tidyxl** is useful for importing non-tabular data from Excel files into R.
For example, tidyxl doesn't coerce a pivot table into a data frame.
Sometimes data is stored in more exotic ways, like the color of the cell background, or whether or not the text is bold.
In such cases, you might find the [tidyxl package](https://nacnudus.github.io/tidyxl/) useful.
See <https://nacnudus.github.io/spreadsheet-munging-strategies/> for more on strategies for working with non-tabular data from Excel.
### Writing to Excel {#sec-writing-to-excel}
@ -371,12 +343,11 @@ bake_sale <- tibble(
bake_sale
```
You can write data back to disk as an Excel file using the `write_xlsx()` from the **writexl** package.
You can write data back to disk as an Excel file using the `write_xlsx()` from the [writexl package](https://docs.ropensci.org/writexl/):
```{r}
#| eval: false
library(writexl)
write_xlsx(bake_sale, path = "data/bake-sale.xlsx")
```
@ -406,7 +377,7 @@ read_excel("data/bake-sale.xlsx")
### Formatted output
The writexl package is a light-weight solution for writing a simple Excel spreadsheet, but if you're interested in additional features like writing to sheets within a spreadsheet and styling, you will want to use the **openxlsx** package.
The writexl package is a light-weight solution for writing a simple Excel spreadsheet, but if you're interested in additional features like writing to sheets within a spreadsheet and styling, you will want to use the [openxlsx package](https://ycphs.github.io/openxlsx).
We won't go into the details of using this package here, but we recommend reading <https://ycphs.github.io/openxlsx/articles/Formatting.html> for an extensive discussion on further formatting functionality for data written from R to Excel with openxlsx.
Note that this package is not part of the tidyverse so the functions and workflows may feel unfamiliar.
@ -578,8 +549,7 @@ This is the same dataset as in @fig-students-excel, except it's stored in a Goog
knitr::include_graphics("screenshots/import-googlesheets-students.png")
```
The first argument to `read_sheet()` is the URL of the file to read.
You can also access this file via <https://pos.it/r4ds-students>, however note that at the time of writing this book you can't read a sheet directly from a short link.
The first argument to `read_sheet()` is the URL of the file to read, and it returns a tibble:
```{r}
#| include: false
@ -590,11 +560,6 @@ gs4_deauth()
```{r}
students_url <- "https://docs.google.com/spreadsheets/d/1V1nPp1tzOuutXFLb3G9Eyxi3qxeEhnOXUzL5_BcCQ0w"
students <- read_sheet(students_url)
```
`read_sheet()` will read the file in as a tibble.
```{r}
students
```
@ -606,12 +571,8 @@ students <- read_sheet(
col_names = c("student_id", "full_name", "favourite_food", "meal_plan", "age"),
skip = 1,
na = c("", "N/A"),
col_types = c("dcccc")
) |>
mutate(
age = if_else(age == "five", "5", age),
age = parse_number(age)
)
col_types = "dcccc"
)
students
```
@ -620,7 +581,7 @@ Note that we defined column types a bit differently here, using short codes.
For example, "dcccc" stands for "double, character, character, character, character".
It's also possible to read individual sheets from Google Sheets as well.
Let's read the penguins Google Sheet at <https://pos.it/r4ds-penguins>, and specifically the "Torgersen Island" sheet in it.
Let's read the "Torgersen Island" sheet from the [penguins Google Sheet](https://pos.it/r4ds-penguins):
```{r}
penguins_url <- "https://docs.google.com/spreadsheets/d/1aFu8lnD_g0yjF5O-K6SFgSEWiHPpgvFCF0NY9D6LXnY"
@ -644,7 +605,8 @@ deaths
### Write sheets
You can write from R to Google Sheets with `write_sheet()`:
You can write from R to Google Sheets with `write_sheet()`.
The first argument is the data frame to write, and the second argument is the name (or other identifier) of the Google Sheet to write to:
```{r}
#| eval: false

View File

@ -12,10 +12,12 @@ status("complete")
This vignette introduces you to the basics of web scraping with [rvest](https://rvest.tidyverse.org).
Web scraping is a very useful tool for extracting data from web pages.
Some websites will offer an API, a set of structured HTTP requests that return data as JSON, which you handle using the techniques from @sec-rectangling.
Where possible, you should use the API, because typically it will give you more reliable data.
Where possible, you should use the API[^webscraping-1], because typically it will give you more reliable data.
Unfortunately, however, programming with web APIs is out of scope for this book.
Instead, we are teaching scraping, a technique that works whether or not a site provides an API.
[^webscraping-1]: And many popular APIs already have CRAN packages that wrap them, so start with a little research first!
In this chapter, we'll first discuss the ethics and legalities of scraping before we dive into the basics of HTML.
You'll then learn the basics of CSS selectors to locate specific elements on the page, and how to use rvest functions to get data from text and attributes out of HTML and into R.
We'll then discuss some techniques to figure out what CSS selector you need for the page you're scraping, before finishing up with a couple of case studies, and a brief discussion of dynamic websites.
@ -40,10 +42,10 @@ Before we get started discussing the code you'll need to perform web scraping, w
Overall, the situation is complicated with regards to both of these.
Legalities depend a lot on where you live.
However, as a general principle, if the data is public, non-personal, and factual, you're likely to be ok[^webscraping-1].
However, as a general principle, if the data is public, non-personal, and factual, you're likely to be ok[^webscraping-2].
These three factors are important because they're connected to the site's terms and conditions, personally identifiable information, and copyright, as we'll discuss below.
[^webscraping-1]: Obviously we're not lawyers, and this is not legal advice.
[^webscraping-2]: Obviously we're not lawyers, and this is not legal advice.
But this is the best summary we can give having read a bunch about this topic.
If the data isn't public, non-personal, or factual or you're scraping the data specifically to make money with it, you'll need to talk to a lawyer.
@ -58,12 +60,12 @@ If you look closely, you'll find many websites include a "terms and conditions"
These pages tend to be a legal land grab where companies make very broad claims.
It's polite to respect these terms of service where possible, but take any claims with a grain of salt.
US courts[^webscraping-2] have generally found that simply putting the terms of service in the footer of the website isn't sufficient for you to be bound by them.
US courts[^webscraping-3] have generally found that simply putting the terms of service in the footer of the website isn't sufficient for you to be bound by them.
Generally, to be bound to the terms of service, you must have taken some explicit action like creating an account or checking a box.
This is why whether or not the data is **public** is important; if you don't need an account to access them, it is unlikely that you are bound to the terms of service.
Note, however, the situation is rather different in Europe where courts have found that terms of service are enforceable even if you don't explicitly agree to them.
[^webscraping-2]: e.g. <https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn>
[^webscraping-3]: e.g. <https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn>
### Personally identifiable information
@ -71,9 +73,9 @@ Even if the data is public, you should be extremely careful about scraping perso
Europe has particularly strict laws about the collection of storage of such data (GDPR), and regardless of where you live you're likely to be entering an ethical quagmire.
For example, in 2016, a group of researchers scraped public profile information (e.g. usernames, age, gender, location, etc.) about 70,000 people on the dating site OkCupid and they publicly released these data without any attempts for anonymization.
While the researchers felt that there was nothing wrong with this since the data were already public, this work was widely condemned due to ethics concerns around identifiability of users whose information was released in the dataset.
If your work involves scraping personally identifiable information, we strongly recommend reading about the OkCupid study as well as similar studies with questionable research ethics involving the acquisition and release of personally identifiable information.[^webscraping-3]
If your work involves scraping personally identifiable information, we strongly recommend reading about the OkCupid study as well as similar studies with questionable research ethics involving the acquisition and release of personally identifiable information.[^webscraping-4]
[^webscraping-3]: One example of an article on the OkCupid study was published by the [https://www.wired.com/2016/05/okcupid-study-reveals-perils-big-data-science](https://www.wired.com/2016/05/okcupid-study-reveals-perils-big-data-science/).
[^webscraping-4]: One example of an article on the OkCupid study was published by the [https://www.wired.com/2016/05/okcupid-study-reveals-perils-big-data-science](https://www.wired.com/2016/05/okcupid-study-reveals-perils-big-data-science/).
### Copyright
@ -111,9 +113,9 @@ HTML stands for **H**yper**T**ext **M**arkup **L**anguage and looks something li
<!--# MCR: Is there a reason why you're using single quotes for HTML stuff? Any objection to changing those to double quotes? -->
HTML has a hierarchical structure formed by **elements** which consist of a start tag (e.g. `<tag>`), optional **attributes** (`id='first'`), an end tag[^webscraping-4] (like `</tag>`), and **contents** (everything in between the start and end tag).
HTML has a hierarchical structure formed by **elements** which consist of a start tag (e.g. `<tag>`), optional **attributes** (`id='first'`), an end tag[^webscraping-5] (like `</tag>`), and **contents** (everything in between the start and end tag).
[^webscraping-4]: A number of tags (including `<p>` and `<li>)` don't require end tags, but we think it's best to include them because it makes seeing the structure of the HTML a little easier.
[^webscraping-5]: A number of tags (including `<p>` and `<li>)` don't require end tags, but we think it's best to include them because it makes seeing the structure of the HTML a little easier.
Since `<` and `>` are used for start and end tags, you can't write them directly.
Instead you have to use the HTML **escapes** `&gt;` (greater than) and `&lt;` (less than).
@ -144,7 +146,7 @@ For example, the following HTML contains paragraph of text, with one word in bol
Hi! My <b>name</b> is Hadley.
</p>
The **children** of a node refers only to elements, so the `<p>` element above has one child, the `<b>` element.
The **children** are the elements it contains, so the `<p>` element above has one child, the `<b>` element.
The `<b>` element has no children, but it does have contents (the text "name").
### Attributes
@ -158,9 +160,9 @@ Attributes are also used to record the destination of links (the `href` attribut
To get started scraping, you'll need the URL of the page you want to scrape, which you can usually copy from your web browser.
You'll then need to read the HTML for that page into R with `read_html()`.
This returns a `xml_document`[^webscraping-5] object which you'll then manipulate using rvest functions:
This returns an `xml_document`[^webscraping-6] object which you'll then manipulate using rvest functions:
[^webscraping-5]: This class comes from the [xml2](https://xml2.r-lib.org) package.
[^webscraping-6]: This class comes from the [xml2](https://xml2.r-lib.org) package.
xml2 is a low-level package that rvest builds on top of.
```{r}
@ -218,7 +220,7 @@ html |> html_elements(".important")
html |> html_elements("#first")
```
Another important function is `html_element()` which always the number of outputs as inputs.
Another important function is `html_element()` which always returns the same number of outputs as inputs.
If you apply it to a whole document it'll give you the first match:
```{r}
@ -244,9 +246,9 @@ Here we have an unordered list (`<ul>)` where each list item (`<li>`) contains s
html <- minimal_html("
<ul>
<li><b>C-3PO</b> is a <i>droid</i> that weighs <span class='weight'>167 kg</span></li>
<li><b>R4-P17</b> is a <i>droid</i></li>
<li><b>R2-D2</b> is a <i>droid</i> that weighs <span class='weight'>96 kg</span></li>
<li><b>Yoda</b> weighs <span class='weight'>66 kg</span></li>
<li><b>R4-P17</b> is a <i>droid</i></li>
</ul>
")
```
@ -265,13 +267,15 @@ characters |> html_element("b")
```
The distinction between `html_element()` and `html_elements()` isn't important for name, but it is important for weight.
We want to try and get the weight for each character
We want to get one weight for each character, even if there's no weight `<span>`.
That's what `html_element()` does:
```{r}
characters |> html_element(".weight")
```
If we instead used `html_elements()`, we lose the connection between names and weights:
`html_elements()` finds all weight `<span>`s that are children of `characters`.
There's only three of these, so we lose the connection between names and weights:
```{r}
characters |> html_elements(".weight")
@ -281,25 +285,21 @@ Now that you've selected the elements of interest, you'll need to extract the da
### Text and attributes
`html_text2()`[^webscraping-6] extracts the plain text contents of an HTML element:
`html_text2()`[^webscraping-7] extracts the plain text contents of an HTML element:
[^webscraping-6]: rvest also provides `html_text()` but you should almost always use `html_text2()` since it does a better job of converting nested HTML to text.
[^webscraping-7]: rvest also provides `html_text()` but you should almost always use `html_text2()` since it does a better job of converting nested HTML to text.
```{r}
html <- minimal_html("
<ol>
<li>apple &amp; pear</li>
<li>banana</li>
<li>pineapple</li>
</ol>
")
html |>
html_element("ol") |>
html_elements("li") |>
characters |>
html_element("b") |>
html_text2()
characters |>
html_element(".weight") |>
html_text2()
```
Note that the escaped ampersand is automatically converted to `&`; you'll only ever see HTML escapes in the source HTML, not in the data returned by rvest.
Note that any escapes will be automatically handled; you'll only ever see HTML escapes in the source HTML, not in the data returned by rvest.
`html_attr()` extracts data from attributes:
@ -411,7 +411,7 @@ section <- html |> html_elements("section")
section
```
The retrieves seven nodes matching the seven movies found on that page, suggesting that using `section` as a selector is good.
This retrieves seven elements matching the seven movies found on that page, suggesting that using `section` as a selector is good.
Extracting the individual elements is straightforward since the data is always found in the text.
It's just a matter of finding the right selector:
@ -425,14 +425,20 @@ Once we've done that for each component, we can wrap all the results up into a t
```{r}
tibble(
title = section |> html_element("h2") |> html_text2(),
title = section |>
html_element("h2") |>
html_text2(),
released = section |>
html_element("p") |>
html_text2() |>
str_remove("Released: ") |>
parse_date(),
director = section |> html_element(".director") |> html_text2(),
intro = section |> html_element(".crawl") |> html_text2()
director = section |>
html_element(".director") |>
html_text2(),
intro = section |>
html_element(".crawl") |>
html_text2()
)
```
@ -473,22 +479,22 @@ This includes a few empty columns, but overall does a good job of capturing the
However, we need to do some more processing to make it easier to use.
First, we'll rename the columns to be easier to work with, and remove the extraneous whitespace in rank and title.
We will do this with `select()` (instead of `rename()`) to do the renaming and selecting of just these two columns in one step.
Then, we'll apply `separate_wider_regex()` (from @sec-extract-variables) to pull out the title, year, and rank into their own variables.
Then we'll remove the new lines and extra spaces, and then apply `separate_wider_regex()` (from @sec-extract-variables) to pull out the title, year, and rank into their own variables.
```{r}
ratings <- table |>
ratings <- table |>
select(
rank_title_year = `Rank & Title`,
rating = `IMDb Rating`
) |>
mutate(
rank_title_year = str_squish(rank_title_year)
rank_title_year = str_replace_all(rank_title_year, "\n +", " ")
) |>
separate_wider_regex(
rank_title_year,
patterns = c(
rank = "\\d+", "\\. ",
title = ".+", " \\(",
title = ".+", " +\\(",
year = "\\d+", "\\)"
)
)
@ -533,7 +539,7 @@ In many cases, that's because you're trying to scrape a website that dynamically
This doesn't currently work with rvest, because rvest downloads the raw HTML and doesn't run any javascript.
It's still possible to scrape these types of sites, but rvest needs to use a more expensive process: fully simulating the web browser including running all javascript.
This functionality is not available at the time of writing, but it's something we're actively working on and should be available by the time you read this.
This functionality is not available at the time of writing, but it's something we're actively working on and might be available by the time you read this.
It uses the [chromote package](https://rstudio.github.io/chromote/index.html) which actually runs the Chrome browser in the background, and gives you additional tools to interact with the site, like a human typing text and clicking buttons.
Check out the rvest website for more details.