Mild import/wrangling reorg

This commit is contained in:
Hadley Wickham 2022-06-20 10:40:11 -05:00
parent 23bfba6809
commit 8f7748dcb1
12 changed files with 25 additions and 72 deletions

2
.gitignore vendored
View File

@ -14,5 +14,5 @@ libs
_main.*
tmp-pdfcrop-*
figures
/.quarto/
site_libs

View File

@ -81,7 +81,7 @@ To make the discussion easier, let's define some terms:
Tabular data is *tidy* if each value is placed in its own "cell", each variable in its own column, and each observation in its own row.
So far, all of the data that you've seen has been tidy.
In real-life, most data isn't tidy, so we'll come back to these ideas again in [Chapter -@sec-list-columns] and [Chapter -@sec-rectangle-data].
In real-life, most data isn't tidy, so we'll come back to these ideas again in @sec-rectangling.
## Variation

View File

@ -65,14 +65,13 @@ book:
- missing-values.qmd
- column-wise.qmd
- part: import.qmd
- part: wrangle.qmd
chapters:
- import-rectangular.qmd
- import-spreadsheets.qmd
- import-databases.qmd
- rectangle.qmd
- import-webscrape.qmd
- import-other.qmd
- parsing.qmd
- spreadsheets.qmd
- databases.qmd
- rectangling.qmd
- webscraping.qmd
- part: program.qmd
chapters:

View File

@ -11,8 +11,7 @@ status("polishing")
Working with data provided by R packages is a great way to learn the tools of data science, but at some point you want to stop learning and start working with your own data.
In this chapter, you'll learn how to read plain-text rectangular files into R.
Here, we'll only scratch the surface of data import, but many of the principles will translate to other forms of data.
We'll finish with a few pointers to packages that are useful for other types of data.
Here, we'll only scratch the surface of data import, but many of the principles will translate to other forms of data, which we'll come back to in @sec-wrangle.
### Prerequisites
@ -320,33 +319,10 @@ There are two alternatives:
```
Feather tends to be faster than RDS and is usable outside of R.
RDS supports list-columns (which you'll learn about in [Chapter -@sec-list-columns]; feather currently does not.
RDS supports list-columns (which you'll learn about in @sec-rectangling; feather currently does not.
```{r}
#| include: false
file.remove("students-2.csv")
file.remove("students.rds")
```
## Other types of data
To get other types of data into R, we recommend starting with the tidyverse packages listed below.
They're certainly not perfect, but they are a good place to start.
For rectangular data:
- **readxl** reads Excel files (both `.xls` and `.xlsx`).
See [Chapter -@sec-import-spreadsheets] for more on working with data stored in Excel spreadsheets.
- **googlesheets4** reads Google Sheets.
Also see [Chapter -@sec-import-spreadsheets] for more on working with data stored in Google Sheets.
- **DBI**, along with a database specific backend (e.g. **RMySQL**, **RSQLite**, **RPostgreSQL** etc) allows you to run SQL queries against a database and return a data frame.
See [Chapter -@sec-import-databases] for more on working with databases .
- **haven** reads SPSS, Stata, and SAS files.
For hierarchical data: use **jsonlite** (by Jeroen Ooms) for json, and **xml2** for XML.
Jenny Bryan has some excellent worked examples at <https://jennybc.github.io/purrr-tutorial/>.
For other file types, try the [R data import/export manual](https://cran.r-project.org/doc/manuals/r-release/R-data.html) and the [**rio**](https://github.com/leeper/rio) package.

View File

@ -557,7 +557,7 @@ df <- tribble(
)
```
If we attempt to pivot this we get an output that contains list-columns, which you'll learn more about in [Chapter -@sec-list-columns]:
If we attempt to pivot this we get an output that contains list-columns, which you'll learn more about in @sec-rectangling:
```{r}
df |> pivot_wider(

View File

@ -1,4 +1,4 @@
# Rectangular data {#sec-import-rectangular}
# Parsing {#sec-import-rectangular}
```{r}
#| results: "asis"

View File

@ -1,4 +1,4 @@
# Data rectangling {#sec-rectangle-data}
# Data rectangling {#sec-rectangling}
```{r}
#| results: "asis"
@ -86,10 +86,10 @@ x5 <- list(1, list(2, list(3, list(4, list(5)))))
str(x5)
```
As lists get even large and more complex, even `str()` starts to fail, you'll need to switch to `View()`[^rectangle-1].
As lists get even large and more complex, even `str()` starts to fail, you'll need to switch to `View()`[^rectangling-1].
@fig-view-collapsed shows the result of calling `View(x4)`. The viewer starts by showing just the top level of the list, but you can interactively expand any of the components to see more, as in @fig-view-expand-1. RStudio will also show you the code you need to access that element, as in @fig-view-expand-2. We'll come back to how this code works in @sec-vector-subsetting.
[^rectangle-1]: This is an RStudio feature.
[^rectangling-1]: This is an RStudio feature.
```{r}
#| label: fig-view-collapsed

View File

@ -1,28 +0,0 @@
# Tidy {#sec-tidy-intro .unnumbered}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
```
In this part of the book, you'll learn about data tidying, the art of getting your data into R in a useful form for visualization and modelling.
Data wrangling is very important: without it you can't work with your own data!
There are three main parts to data wrangling:
```{r}
#| echo: false
#| out-width: "75%"
knitr::include_graphics("diagrams/data-science-wrangle.png")
```
<!--# TO DO: Redo the diagram without highlighting import. -->
This part of the book proceeds as follows:
- [Chapter -@sec-list-columns] will give you tools for working with list columns --- data stored in columns of a tibble as lists.
- In [Chapter -@sec-rectangle-data], you'll learn about hierarchical data formats and how to turn them into rectangular data via unnesting.
<!--# TO DO: Revisit bullet points about new chapters. -->

View File

@ -1,4 +1,4 @@
# Wrangle {#sec-import-intro .unnumbered}
# Wrangle {#sec-wrangle .unnumbered}
```{r}
#| results: "asis"
@ -14,14 +14,20 @@ But in more complex cases it encompasses both tidying and transformation as the
This part of the book proceeds as follows:
- In @sec-import-rectangular, you'll learn how to get plain-text data in rectangular formats from disk and into R.
- In @sec-rectangling, you'll learn how to get plain-text data in rectangular formats from disk and into R.
- In @sec-import-spreadsheets, you'll learn how to get data from Excel spreadsheets and Google Sheets into R.
- In @sec-import-databases, you'll learn about getting data into R from databases.
- In @sec-rectangle-data, you'll learn how to work with hierarchical data that includes deeply nested lists, as is often created we your raw data is in JSON.
- In @sec-rectangling, you'll learn how to work with hierarchical data that includes deeply nested lists, as is often created we your raw data is in JSON.
- In @sec-import-webscrape, you'll learn about harvesting data off the web and getting it into R.
- We'll close up the part with a brief discussion on other types of data and pointers for how to get them into R in @sec-import-other.
Some other types of data are not covered in this book:
- **haven** reads SPSS, Stata, and SAS files.
- xml2 for **xml2** for XML
For other file types, try the [R data import/export manual](https://cran.r-project.org/doc/manuals/r-release/R-data.html) and the [**rio**](https://github.com/leeper/rio) package.