From 8f7748dcb1bdac8b16dc7c8a8e4967a36a0ee0d7 Mon Sep 17 00:00:00 2001 From: Hadley Wickham Date: Mon, 20 Jun 2022 10:40:11 -0500 Subject: [PATCH] Mild import/wrangling reorg --- .gitignore | 2 +- EDA.qmd | 2 +- _quarto.yml | 13 +++++----- data-import.qmd | 28 ++------------------- data-tidy.qmd | 2 +- import-databases.qmd => databases.qmd | 0 import-rectangular.qmd => parsing.qmd | 2 +- rectangle.qmd => rectangling.qmd | 6 ++--- import-spreadsheets.qmd => spreadsheets.qmd | 0 tidy.qmd | 28 --------------------- import-webscrape.qmd => webscraping.qmd | 0 import.qmd => wrangle.qmd | 14 ++++++++--- 12 files changed, 25 insertions(+), 72 deletions(-) rename import-databases.qmd => databases.qmd (100%) rename import-rectangular.qmd => parsing.qmd (99%) rename rectangle.qmd => rectangling.qmd (99%) rename import-spreadsheets.qmd => spreadsheets.qmd (100%) delete mode 100644 tidy.qmd rename import-webscrape.qmd => webscraping.qmd (100%) rename import.qmd => wrangle.qmd (57%) diff --git a/.gitignore b/.gitignore index c868bbd..657351e 100644 --- a/.gitignore +++ b/.gitignore @@ -14,5 +14,5 @@ libs _main.* tmp-pdfcrop-* figures - /.quarto/ +site_libs diff --git a/EDA.qmd b/EDA.qmd index 0f3be1d..74aa0e0 100644 --- a/EDA.qmd +++ b/EDA.qmd @@ -81,7 +81,7 @@ To make the discussion easier, let's define some terms: Tabular data is *tidy* if each value is placed in its own "cell", each variable in its own column, and each observation in its own row. So far, all of the data that you've seen has been tidy. -In real-life, most data isn't tidy, so we'll come back to these ideas again in [Chapter -@sec-list-columns] and [Chapter -@sec-rectangle-data]. +In real-life, most data isn't tidy, so we'll come back to these ideas again in @sec-rectangling. ## Variation diff --git a/_quarto.yml b/_quarto.yml index 44d2ca7..c5e7b6c 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -65,14 +65,13 @@ book: - missing-values.qmd - column-wise.qmd - - part: import.qmd + - part: wrangle.qmd chapters: - - import-rectangular.qmd - - import-spreadsheets.qmd - - import-databases.qmd - - rectangle.qmd - - import-webscrape.qmd - - import-other.qmd + - parsing.qmd + - spreadsheets.qmd + - databases.qmd + - rectangling.qmd + - webscraping.qmd - part: program.qmd chapters: diff --git a/data-import.qmd b/data-import.qmd index 7c7b905..f729a14 100644 --- a/data-import.qmd +++ b/data-import.qmd @@ -11,8 +11,7 @@ status("polishing") Working with data provided by R packages is a great way to learn the tools of data science, but at some point you want to stop learning and start working with your own data. In this chapter, you'll learn how to read plain-text rectangular files into R. -Here, we'll only scratch the surface of data import, but many of the principles will translate to other forms of data. -We'll finish with a few pointers to packages that are useful for other types of data. +Here, we'll only scratch the surface of data import, but many of the principles will translate to other forms of data, which we'll come back to in @sec-wrangle. ### Prerequisites @@ -320,33 +319,10 @@ There are two alternatives: ``` Feather tends to be faster than RDS and is usable outside of R. -RDS supports list-columns (which you'll learn about in [Chapter -@sec-list-columns]; feather currently does not. +RDS supports list-columns (which you'll learn about in @sec-rectangling; feather currently does not. ```{r} #| include: false - file.remove("students-2.csv") file.remove("students.rds") ``` - -## Other types of data - -To get other types of data into R, we recommend starting with the tidyverse packages listed below. -They're certainly not perfect, but they are a good place to start. -For rectangular data: - -- **readxl** reads Excel files (both `.xls` and `.xlsx`). - See [Chapter -@sec-import-spreadsheets] for more on working with data stored in Excel spreadsheets. - -- **googlesheets4** reads Google Sheets. - Also see [Chapter -@sec-import-spreadsheets] for more on working with data stored in Google Sheets. - -- **DBI**, along with a database specific backend (e.g. **RMySQL**, **RSQLite**, **RPostgreSQL** etc) allows you to run SQL queries against a database and return a data frame. - See [Chapter -@sec-import-databases] for more on working with databases . - -- **haven** reads SPSS, Stata, and SAS files. - -For hierarchical data: use **jsonlite** (by Jeroen Ooms) for json, and **xml2** for XML. -Jenny Bryan has some excellent worked examples at . - -For other file types, try the [R data import/export manual](https://cran.r-project.org/doc/manuals/r-release/R-data.html) and the [**rio**](https://github.com/leeper/rio) package. diff --git a/data-tidy.qmd b/data-tidy.qmd index 21e4076..116be8e 100644 --- a/data-tidy.qmd +++ b/data-tidy.qmd @@ -557,7 +557,7 @@ df <- tribble( ) ``` -If we attempt to pivot this we get an output that contains list-columns, which you'll learn more about in [Chapter -@sec-list-columns]: +If we attempt to pivot this we get an output that contains list-columns, which you'll learn more about in @sec-rectangling: ```{r} df |> pivot_wider( diff --git a/import-databases.qmd b/databases.qmd similarity index 100% rename from import-databases.qmd rename to databases.qmd diff --git a/import-rectangular.qmd b/parsing.qmd similarity index 99% rename from import-rectangular.qmd rename to parsing.qmd index 3bf9ef1..65d13b5 100644 --- a/import-rectangular.qmd +++ b/parsing.qmd @@ -1,4 +1,4 @@ -# Rectangular data {#sec-import-rectangular} +# Parsing {#sec-import-rectangular} ```{r} #| results: "asis" diff --git a/rectangle.qmd b/rectangling.qmd similarity index 99% rename from rectangle.qmd rename to rectangling.qmd index abf3740..662e36e 100644 --- a/rectangle.qmd +++ b/rectangling.qmd @@ -1,4 +1,4 @@ -# Data rectangling {#sec-rectangle-data} +# Data rectangling {#sec-rectangling} ```{r} #| results: "asis" @@ -86,10 +86,10 @@ x5 <- list(1, list(2, list(3, list(4, list(5))))) str(x5) ``` -As lists get even large and more complex, even `str()` starts to fail, you'll need to switch to `View()`[^rectangle-1]. +As lists get even large and more complex, even `str()` starts to fail, you'll need to switch to `View()`[^rectangling-1]. @fig-view-collapsed shows the result of calling `View(x4)`. The viewer starts by showing just the top level of the list, but you can interactively expand any of the components to see more, as in @fig-view-expand-1. RStudio will also show you the code you need to access that element, as in @fig-view-expand-2. We'll come back to how this code works in @sec-vector-subsetting. -[^rectangle-1]: This is an RStudio feature. +[^rectangling-1]: This is an RStudio feature. ```{r} #| label: fig-view-collapsed diff --git a/import-spreadsheets.qmd b/spreadsheets.qmd similarity index 100% rename from import-spreadsheets.qmd rename to spreadsheets.qmd diff --git a/tidy.qmd b/tidy.qmd deleted file mode 100644 index c83bf4d..0000000 --- a/tidy.qmd +++ /dev/null @@ -1,28 +0,0 @@ -# Tidy {#sec-tidy-intro .unnumbered} - -```{r} -#| results: "asis" -#| echo: false -source("_common.R") -``` - -In this part of the book, you'll learn about data tidying, the art of getting your data into R in a useful form for visualization and modelling. -Data wrangling is very important: without it you can't work with your own data! -There are three main parts to data wrangling: - -```{r} -#| echo: false -#| out-width: "75%" - -knitr::include_graphics("diagrams/data-science-wrangle.png") -``` - - - -This part of the book proceeds as follows: - -- [Chapter -@sec-list-columns] will give you tools for working with list columns --- data stored in columns of a tibble as lists. - -- In [Chapter -@sec-rectangle-data], you'll learn about hierarchical data formats and how to turn them into rectangular data via unnesting. - - diff --git a/import-webscrape.qmd b/webscraping.qmd similarity index 100% rename from import-webscrape.qmd rename to webscraping.qmd diff --git a/import.qmd b/wrangle.qmd similarity index 57% rename from import.qmd rename to wrangle.qmd index c72ef7f..c06210f 100644 --- a/import.qmd +++ b/wrangle.qmd @@ -1,4 +1,4 @@ -# Wrangle {#sec-import-intro .unnumbered} +# Wrangle {#sec-wrangle .unnumbered} ```{r} #| results: "asis" @@ -14,14 +14,20 @@ But in more complex cases it encompasses both tidying and transformation as the This part of the book proceeds as follows: -- In @sec-import-rectangular, you'll learn how to get plain-text data in rectangular formats from disk and into R. +- In @sec-rectangling, you'll learn how to get plain-text data in rectangular formats from disk and into R. - In @sec-import-spreadsheets, you'll learn how to get data from Excel spreadsheets and Google Sheets into R. - In @sec-import-databases, you'll learn about getting data into R from databases. -- In @sec-rectangle-data, you'll learn how to work with hierarchical data that includes deeply nested lists, as is often created we your raw data is in JSON. +- In @sec-rectangling, you'll learn how to work with hierarchical data that includes deeply nested lists, as is often created we your raw data is in JSON. - In @sec-import-webscrape, you'll learn about harvesting data off the web and getting it into R. -- We'll close up the part with a brief discussion on other types of data and pointers for how to get them into R in @sec-import-other. +Some other types of data are not covered in this book: + +- **haven** reads SPSS, Stata, and SAS files. + +- xml2 for **xml2** for XML + +For other file types, try the [R data import/export manual](https://cran.r-project.org/doc/manuals/r-release/R-data.html) and the [**rio**](https://github.com/leeper/rio) package.