Mild import/wrangling reorg
This commit is contained in:
parent
23bfba6809
commit
8f7748dcb1
|
@ -14,5 +14,5 @@ libs
|
||||||
_main.*
|
_main.*
|
||||||
tmp-pdfcrop-*
|
tmp-pdfcrop-*
|
||||||
figures
|
figures
|
||||||
|
|
||||||
/.quarto/
|
/.quarto/
|
||||||
|
site_libs
|
||||||
|
|
2
EDA.qmd
2
EDA.qmd
|
@ -81,7 +81,7 @@ To make the discussion easier, let's define some terms:
|
||||||
Tabular data is *tidy* if each value is placed in its own "cell", each variable in its own column, and each observation in its own row.
|
Tabular data is *tidy* if each value is placed in its own "cell", each variable in its own column, and each observation in its own row.
|
||||||
|
|
||||||
So far, all of the data that you've seen has been tidy.
|
So far, all of the data that you've seen has been tidy.
|
||||||
In real-life, most data isn't tidy, so we'll come back to these ideas again in [Chapter -@sec-list-columns] and [Chapter -@sec-rectangle-data].
|
In real-life, most data isn't tidy, so we'll come back to these ideas again in @sec-rectangling.
|
||||||
|
|
||||||
## Variation
|
## Variation
|
||||||
|
|
||||||
|
|
13
_quarto.yml
13
_quarto.yml
|
@ -65,14 +65,13 @@ book:
|
||||||
- missing-values.qmd
|
- missing-values.qmd
|
||||||
- column-wise.qmd
|
- column-wise.qmd
|
||||||
|
|
||||||
- part: import.qmd
|
- part: wrangle.qmd
|
||||||
chapters:
|
chapters:
|
||||||
- import-rectangular.qmd
|
- parsing.qmd
|
||||||
- import-spreadsheets.qmd
|
- spreadsheets.qmd
|
||||||
- import-databases.qmd
|
- databases.qmd
|
||||||
- rectangle.qmd
|
- rectangling.qmd
|
||||||
- import-webscrape.qmd
|
- webscraping.qmd
|
||||||
- import-other.qmd
|
|
||||||
|
|
||||||
- part: program.qmd
|
- part: program.qmd
|
||||||
chapters:
|
chapters:
|
||||||
|
|
|
@ -11,8 +11,7 @@ status("polishing")
|
||||||
|
|
||||||
Working with data provided by R packages is a great way to learn the tools of data science, but at some point you want to stop learning and start working with your own data.
|
Working with data provided by R packages is a great way to learn the tools of data science, but at some point you want to stop learning and start working with your own data.
|
||||||
In this chapter, you'll learn how to read plain-text rectangular files into R.
|
In this chapter, you'll learn how to read plain-text rectangular files into R.
|
||||||
Here, we'll only scratch the surface of data import, but many of the principles will translate to other forms of data.
|
Here, we'll only scratch the surface of data import, but many of the principles will translate to other forms of data, which we'll come back to in @sec-wrangle.
|
||||||
We'll finish with a few pointers to packages that are useful for other types of data.
|
|
||||||
|
|
||||||
### Prerequisites
|
### Prerequisites
|
||||||
|
|
||||||
|
@ -320,33 +319,10 @@ There are two alternatives:
|
||||||
```
|
```
|
||||||
|
|
||||||
Feather tends to be faster than RDS and is usable outside of R.
|
Feather tends to be faster than RDS and is usable outside of R.
|
||||||
RDS supports list-columns (which you'll learn about in [Chapter -@sec-list-columns]; feather currently does not.
|
RDS supports list-columns (which you'll learn about in @sec-rectangling; feather currently does not.
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
#| include: false
|
#| include: false
|
||||||
|
|
||||||
file.remove("students-2.csv")
|
file.remove("students-2.csv")
|
||||||
file.remove("students.rds")
|
file.remove("students.rds")
|
||||||
```
|
```
|
||||||
|
|
||||||
## Other types of data
|
|
||||||
|
|
||||||
To get other types of data into R, we recommend starting with the tidyverse packages listed below.
|
|
||||||
They're certainly not perfect, but they are a good place to start.
|
|
||||||
For rectangular data:
|
|
||||||
|
|
||||||
- **readxl** reads Excel files (both `.xls` and `.xlsx`).
|
|
||||||
See [Chapter -@sec-import-spreadsheets] for more on working with data stored in Excel spreadsheets.
|
|
||||||
|
|
||||||
- **googlesheets4** reads Google Sheets.
|
|
||||||
Also see [Chapter -@sec-import-spreadsheets] for more on working with data stored in Google Sheets.
|
|
||||||
|
|
||||||
- **DBI**, along with a database specific backend (e.g. **RMySQL**, **RSQLite**, **RPostgreSQL** etc) allows you to run SQL queries against a database and return a data frame.
|
|
||||||
See [Chapter -@sec-import-databases] for more on working with databases .
|
|
||||||
|
|
||||||
- **haven** reads SPSS, Stata, and SAS files.
|
|
||||||
|
|
||||||
For hierarchical data: use **jsonlite** (by Jeroen Ooms) for json, and **xml2** for XML.
|
|
||||||
Jenny Bryan has some excellent worked examples at <https://jennybc.github.io/purrr-tutorial/>.
|
|
||||||
|
|
||||||
For other file types, try the [R data import/export manual](https://cran.r-project.org/doc/manuals/r-release/R-data.html) and the [**rio**](https://github.com/leeper/rio) package.
|
|
||||||
|
|
|
@ -557,7 +557,7 @@ df <- tribble(
|
||||||
)
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
If we attempt to pivot this we get an output that contains list-columns, which you'll learn more about in [Chapter -@sec-list-columns]:
|
If we attempt to pivot this we get an output that contains list-columns, which you'll learn more about in @sec-rectangling:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
df |> pivot_wider(
|
df |> pivot_wider(
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
# Rectangular data {#sec-import-rectangular}
|
# Parsing {#sec-import-rectangular}
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
#| results: "asis"
|
#| results: "asis"
|
|
@ -1,4 +1,4 @@
|
||||||
# Data rectangling {#sec-rectangle-data}
|
# Data rectangling {#sec-rectangling}
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
#| results: "asis"
|
#| results: "asis"
|
||||||
|
@ -86,10 +86,10 @@ x5 <- list(1, list(2, list(3, list(4, list(5)))))
|
||||||
str(x5)
|
str(x5)
|
||||||
```
|
```
|
||||||
|
|
||||||
As lists get even large and more complex, even `str()` starts to fail, you'll need to switch to `View()`[^rectangle-1].
|
As lists get even large and more complex, even `str()` starts to fail, you'll need to switch to `View()`[^rectangling-1].
|
||||||
@fig-view-collapsed shows the result of calling `View(x4)`. The viewer starts by showing just the top level of the list, but you can interactively expand any of the components to see more, as in @fig-view-expand-1. RStudio will also show you the code you need to access that element, as in @fig-view-expand-2. We'll come back to how this code works in @sec-vector-subsetting.
|
@fig-view-collapsed shows the result of calling `View(x4)`. The viewer starts by showing just the top level of the list, but you can interactively expand any of the components to see more, as in @fig-view-expand-1. RStudio will also show you the code you need to access that element, as in @fig-view-expand-2. We'll come back to how this code works in @sec-vector-subsetting.
|
||||||
|
|
||||||
[^rectangle-1]: This is an RStudio feature.
|
[^rectangling-1]: This is an RStudio feature.
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
#| label: fig-view-collapsed
|
#| label: fig-view-collapsed
|
28
tidy.qmd
28
tidy.qmd
|
@ -1,28 +0,0 @@
|
||||||
# Tidy {#sec-tidy-intro .unnumbered}
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
#| results: "asis"
|
|
||||||
#| echo: false
|
|
||||||
source("_common.R")
|
|
||||||
```
|
|
||||||
|
|
||||||
In this part of the book, you'll learn about data tidying, the art of getting your data into R in a useful form for visualization and modelling.
|
|
||||||
Data wrangling is very important: without it you can't work with your own data!
|
|
||||||
There are three main parts to data wrangling:
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
#| echo: false
|
|
||||||
#| out-width: "75%"
|
|
||||||
|
|
||||||
knitr::include_graphics("diagrams/data-science-wrangle.png")
|
|
||||||
```
|
|
||||||
|
|
||||||
<!--# TO DO: Redo the diagram without highlighting import. -->
|
|
||||||
|
|
||||||
This part of the book proceeds as follows:
|
|
||||||
|
|
||||||
- [Chapter -@sec-list-columns] will give you tools for working with list columns --- data stored in columns of a tibble as lists.
|
|
||||||
|
|
||||||
- In [Chapter -@sec-rectangle-data], you'll learn about hierarchical data formats and how to turn them into rectangular data via unnesting.
|
|
||||||
|
|
||||||
<!--# TO DO: Revisit bullet points about new chapters. -->
|
|
|
@ -1,4 +1,4 @@
|
||||||
# Wrangle {#sec-import-intro .unnumbered}
|
# Wrangle {#sec-wrangle .unnumbered}
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
#| results: "asis"
|
#| results: "asis"
|
||||||
|
@ -14,14 +14,20 @@ But in more complex cases it encompasses both tidying and transformation as the
|
||||||
|
|
||||||
This part of the book proceeds as follows:
|
This part of the book proceeds as follows:
|
||||||
|
|
||||||
- In @sec-import-rectangular, you'll learn how to get plain-text data in rectangular formats from disk and into R.
|
- In @sec-rectangling, you'll learn how to get plain-text data in rectangular formats from disk and into R.
|
||||||
|
|
||||||
- In @sec-import-spreadsheets, you'll learn how to get data from Excel spreadsheets and Google Sheets into R.
|
- In @sec-import-spreadsheets, you'll learn how to get data from Excel spreadsheets and Google Sheets into R.
|
||||||
|
|
||||||
- In @sec-import-databases, you'll learn about getting data into R from databases.
|
- In @sec-import-databases, you'll learn about getting data into R from databases.
|
||||||
|
|
||||||
- In @sec-rectangle-data, you'll learn how to work with hierarchical data that includes deeply nested lists, as is often created we your raw data is in JSON.
|
- In @sec-rectangling, you'll learn how to work with hierarchical data that includes deeply nested lists, as is often created we your raw data is in JSON.
|
||||||
|
|
||||||
- In @sec-import-webscrape, you'll learn about harvesting data off the web and getting it into R.
|
- In @sec-import-webscrape, you'll learn about harvesting data off the web and getting it into R.
|
||||||
|
|
||||||
- We'll close up the part with a brief discussion on other types of data and pointers for how to get them into R in @sec-import-other.
|
Some other types of data are not covered in this book:
|
||||||
|
|
||||||
|
- **haven** reads SPSS, Stata, and SAS files.
|
||||||
|
|
||||||
|
- xml2 for **xml2** for XML
|
||||||
|
|
||||||
|
For other file types, try the [R data import/export manual](https://cran.r-project.org/doc/manuals/r-release/R-data.html) and the [**rio**](https://github.com/leeper/rio) package.
|
Loading…
Reference in New Issue