Webscraping editor comments

This commit is contained in:
mine-cetinkaya-rundel 2023-03-11 11:25:45 -05:00
parent d58d313b5e
commit 04c0d1907b
1 changed files with 8 additions and 9 deletions

View File

@ -9,7 +9,7 @@ status("complete")
## Introduction
This vignette introduces you to the basics of web scraping with [rvest](https://rvest.tidyverse.org).
This chapter introduces you to the basics of web scraping with [rvest](https://rvest.tidyverse.org).
Web scraping is a very useful tool for extracting data from web pages.
Some websites will offer an API, a set of structured HTTP requests that return data as JSON, which you handle using the techniques from @sec-rectangling.
Where possible, you should use the API[^webscraping-1], because typically it will give you more reliable data.
@ -70,10 +70,10 @@ Note, however, the situation is rather different in Europe where courts have fou
### Personally identifiable information
Even if the data is public, you should be extremely careful about scraping personally identifiable information like names, email addresses, phone numbers, dates of birth, etc.
Europe has particularly strict laws about the collection of storage of such data (GDPR), and regardless of where you live you're likely to be entering an ethical quagmire.
Europe has particularly strict laws about the collection of storage of such data ([GDPR](https://gdpr-info.eu/)), and regardless of where you live you're likely to be entering an ethical quagmire.
For example, in 2016, a group of researchers scraped public profile information (e.g. usernames, age, gender, location, etc.) about 70,000 people on the dating site OkCupid and they publicly released these data without any attempts for anonymization.
While the researchers felt that there was nothing wrong with this since the data were already public, this work was widely condemned due to ethics concerns around identifiability of users whose information was released in the dataset.
If your work involves scraping personally identifiable information, we strongly recommend reading about the OkCupid study as well as similar studies with questionable research ethics involving the acquisition and release of personally identifiable information.[^webscraping-4]
If your work involves scraping personally identifiable information, we strongly recommend reading about the OkCupid study[^webscraping-4] as well as similar studies with questionable research ethics involving the acquisition and release of personally identifiable information.
[^webscraping-4]: One example of an article on the OkCupid study was published by the [https://www.wired.com/2016/05/okcupid-study-reveals-perils-big-data-science](https://www.wired.com/2016/05/okcupid-study-reveals-perils-big-data-science/).
@ -111,8 +111,6 @@ HTML stands for **H**yper**T**ext **M**arkup **L**anguage and looks something li
</body>
```
<!--# MCR: Is there a reason why you're using single quotes for HTML stuff? Any objection to changing those to double quotes? -->
HTML has a hierarchical structure formed by **elements** which consist of a start tag (e.g. `<tag>`), optional **attributes** (`id='first'`), an end tag[^webscraping-5] (like `</tag>`), and **contents** (everything in between the start and end tag).
[^webscraping-5]: A number of tags (including `<p>` and `<li>)` don't require end tags, but we think it's best to include them because it makes seeing the structure of the HTML a little easier.
@ -126,7 +124,7 @@ Web scraping is possible because most pages that contain data that you want to s
### Elements
All up, there are over 100 HTML elements.
There are over 100 HTML elements.
Some of the most important are:
- Every HTML page must be in an `<html>` element, and it must have two children: `<head>`, which contains document metadata like the page title, and `<body>`, which contains the content you see in the browser.
@ -534,14 +532,15 @@ ratings |>
## Dynamic sites
From time-to-time, you'll hit a site where `html_elements()` and friends don't return anything like what you see in the browser.
So far we have focused on websites where `html_elements()` returns what you see in the browser and discussed how to parse what it returns and how to organize that information in tidy data frames.
From time-to-time, however, you'll hit a site where `html_elements()` and friends don't return anything like what you see in the browser.
In many cases, that's because you're trying to scrape a website that dynamically generates the content of the page with javascript.
This doesn't currently work with rvest, because rvest downloads the raw HTML and doesn't run any javascript.
It's still possible to scrape these types of sites, but rvest needs to use a more expensive process: fully simulating the web browser including running all javascript.
This functionality is not available at the time of writing, but it's something we're actively working on and might be available by the time you read this.
It uses the [chromote package](https://rstudio.github.io/chromote/index.html) which actually runs the Chrome browser in the background, and gives you additional tools to interact with the site, like a human typing text and clicking buttons.
Check out the rvest website for more details.
Check out the [rvest website](http://rvest.tidyverse.org/) for more details.
## Summary
@ -552,5 +551,5 @@ We then demonstrated web scraping with two case studies: a simpler scenario on s
Technical details of scraping data off the web can be complex, particularly when dealing with sites, however legal and ethical considerations can be even more complex.
It's important for you to educate yourself about both of these before setting out to scrape data.
This brings us to the end of the wrangling part of the book where you've learned techniques to get data from where it lives (spreadsheets, databases, JSON files, and web sites) into a tidy form in R.
This brings us to the end of the import part of the book where you've learned techniques to get data from where it lives (spreadsheets, databases, JSON files, and web sites) into a tidy form in R.
Now it's time to turn our sights to a new topic: making the most of R as a programming language.