Anchor link instead of footnote
This commit is contained in:
parent
c8b6ec6d96
commit
77c95b7c0c
|
@ -60,22 +60,20 @@ If you look closely, you'll find many websites include a "terms and conditions"
|
|||
These pages tend to be a legal land grab where companies make very broad claims.
|
||||
It's polite to respect these terms of service where possible, but take any claims with a grain of salt.
|
||||
|
||||
US courts[^webscraping-3] have generally found that simply putting the terms of service in the footer of the website isn't sufficient for you to be bound by them.
|
||||
US courts have generally found that simply putting the terms of service in the footer of the website isn't sufficient for you to be bound by them, e.g., [HiQ Labs v. LinkedIn](https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn).
|
||||
Generally, to be bound to the terms of service, you must have taken some explicit action like creating an account or checking a box.
|
||||
This is why whether or not the data is **public** is important; if you don't need an account to access them, it is unlikely that you are bound to the terms of service.
|
||||
Note, however, the situation is rather different in Europe where courts have found that terms of service are enforceable even if you don't explicitly agree to them.
|
||||
|
||||
[^webscraping-3]: e.g., <https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn>
|
||||
|
||||
### Personally identifiable information
|
||||
|
||||
Even if the data is public, you should be extremely careful about scraping personally identifiable information like names, email addresses, phone numbers, dates of birth, etc.
|
||||
Europe has particularly strict laws about the collection or storage of such data ([GDPR](https://gdpr-info.eu/)), and regardless of where you live you're likely to be entering an ethical quagmire.
|
||||
For example, in 2016, a group of researchers scraped public profile information (e.g., usernames, age, gender, location, etc.) about 70,000 people on the dating site OkCupid and they publicly released these data without any attempts for anonymization.
|
||||
While the researchers felt that there was nothing wrong with this since the data were already public, this work was widely condemned due to ethics concerns around identifiability of users whose information was released in the dataset.
|
||||
If your work involves scraping personally identifiable information, we strongly recommend reading about the OkCupid study[^webscraping-4] as well as similar studies with questionable research ethics involving the acquisition and release of personally identifiable information.
|
||||
If your work involves scraping personally identifiable information, we strongly recommend reading about the OkCupid study[^webscraping-3] as well as similar studies with questionable research ethics involving the acquisition and release of personally identifiable information.
|
||||
|
||||
[^webscraping-4]: One example of an article on the OkCupid study was published by Wired, <https://www.wired.com/2016/05/okcupid-study-reveals-perils-big-data-science>.
|
||||
[^webscraping-3]: One example of an article on the OkCupid study was published by Wired, <https://www.wired.com/2016/05/okcupid-study-reveals-perils-big-data-science>.
|
||||
|
||||
### Copyright
|
||||
|
||||
|
@ -111,9 +109,9 @@ HTML stands for **H**yper**T**ext **M**arkup **L**anguage and looks something li
|
|||
</body>
|
||||
```
|
||||
|
||||
HTML has a hierarchical structure formed by **elements** which consist of a start tag (e.g., `<tag>`), optional **attributes** (`id='first'`), an end tag[^webscraping-5] (like `</tag>`), and **contents** (everything in between the start and end tag).
|
||||
HTML has a hierarchical structure formed by **elements** which consist of a start tag (e.g., `<tag>`), optional **attributes** (`id='first'`), an end tag[^webscraping-4] (like `</tag>`), and **contents** (everything in between the start and end tag).
|
||||
|
||||
[^webscraping-5]: A number of tags (including `<p>` and `<li>)` don't require end tags, but we think it's best to include them because it makes seeing the structure of the HTML a little easier.
|
||||
[^webscraping-4]: A number of tags (including `<p>` and `<li>)` don't require end tags, but we think it's best to include them because it makes seeing the structure of the HTML a little easier.
|
||||
|
||||
Since `<` and `>` are used for start and end tags, you can't write them directly.
|
||||
Instead you have to use the HTML **escapes** `>` (greater than) and `<` (less than).
|
||||
|
@ -160,9 +158,9 @@ Attributes are also used to record the destination of links (the `href` attribut
|
|||
|
||||
To get started scraping, you'll need the URL of the page you want to scrape, which you can usually copy from your web browser.
|
||||
You'll then need to read the HTML for that page into R with `read_html()`.
|
||||
This returns an `xml_document`[^webscraping-6] object which you'll then manipulate using rvest functions:
|
||||
This returns an `xml_document`[^webscraping-5] object which you'll then manipulate using rvest functions:
|
||||
|
||||
[^webscraping-6]: This class comes from the [xml2](https://xml2.r-lib.org) package.
|
||||
[^webscraping-5]: This class comes from the [xml2](https://xml2.r-lib.org) package.
|
||||
xml2 is a low-level package that rvest builds on top of.
|
||||
|
||||
```{r}
|
||||
|
@ -285,9 +283,9 @@ Now that you've selected the elements of interest, you'll need to extract the da
|
|||
|
||||
### Text and attributes
|
||||
|
||||
`html_text2()`[^webscraping-7] extracts the plain text contents of an HTML element:
|
||||
`html_text2()`[^webscraping-6] extracts the plain text contents of an HTML element:
|
||||
|
||||
[^webscraping-7]: rvest also provides `html_text()` but you should almost always use `html_text2()` since it does a better job of converting nested HTML to text.
|
||||
[^webscraping-6]: rvest also provides `html_text()` but you should almost always use `html_text2()` since it does a better job of converting nested HTML to text.
|
||||
|
||||
```{r}
|
||||
characters |>
|
||||
|
|
Loading…
Reference in New Issue