Small typos on webscraping chapter (#1381)

This commit is contained in:
alberto-agudo 2023-03-22 00:26:28 +01:00 committed by GitHub
parent e119132cb4
commit ab0c1c44ac
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 2 additions and 2 deletions

View File

@ -70,7 +70,7 @@ Note, however, the situation is rather different in Europe where courts have fou
### Personally identifiable information
Even if the data is public, you should be extremely careful about scraping personally identifiable information like names, email addresses, phone numbers, dates of birth, etc.
Europe has particularly strict laws about the collection of storage of such data ([GDPR](https://gdpr-info.eu/)), and regardless of where you live you're likely to be entering an ethical quagmire.
Europe has particularly strict laws about the collection or storage of such data ([GDPR](https://gdpr-info.eu/)), and regardless of where you live you're likely to be entering an ethical quagmire.
For example, in 2016, a group of researchers scraped public profile information (e.g. usernames, age, gender, location, etc.) about 70,000 people on the dating site OkCupid and they publicly released these data without any attempts for anonymization.
While the researchers felt that there was nothing wrong with this since the data were already public, this work was widely condemned due to ethics concerns around identifiability of users whose information was released in the dataset.
If your work involves scraping personally identifiable information, we strongly recommend reading about the OkCupid study[^webscraping-4] as well as similar studies with questionable research ethics involving the acquisition and release of personally identifiable information.
@ -81,7 +81,7 @@ If your work involves scraping personally identifiable information, we strongly
Finally, you also need to worry about copyright law.
Copyright law is complicated, but it's worth taking a look at the [US law](https://www.law.cornell.edu/uscode/text/17/102) which describes exactly what's protected: "\[...\] original works of authorship fixed in any tangible medium of expression, \[...\]".
It then goes on to describe specific categories that it applies like literary works, musical works, motions pictures and more.
It then goes on to describe specific categories that it applies like literary works, musical works, motion pictures and more.
Notably absent from copyright protection are data.
This means that as long as you limit your scraping to facts, copyright protection does not apply.
(But note that Europe has a separate "[sui generis](https://en.wikipedia.org/wiki/Database_right)" right that protects databases.)