Fleshing out each file section

This commit is contained in:
Hadley Wickham 2022-09-17 10:58:18 -05:00
parent d84c4a3731
commit c5a81b92ba
15 changed files with 125 additions and 19 deletions

View File

@ -54,6 +54,7 @@ Remotes:
tidyverse/dbplyr,
tidyverse/stringr,
tidyverse/tidyr,
tidyverse/purrr
jennybc/repurrrsive
Encoding: UTF-8
License: CC NC ND 3.0

20
data/gapminder.R Normal file
View File

@ -0,0 +1,20 @@
repurrrsive::gap_simple |>
count(year)
by_year <- repurrrsive::gap_simple |>
group_by(year)
paths <- by_year |>
group_keys() |>
mutate(path = str_glue("data/gapminder/{year}.xlsx")) |>
pull()
paths
years <- by_year |>
group_split() |>
map(\(df) select(df, -year))
dir.create("data/gapminder")
walk2(years, paths, writexl::write_xlsx)

BIN
data/gapminder/1952.xlsx Normal file

Binary file not shown.

BIN
data/gapminder/1957.xlsx Normal file

Binary file not shown.

BIN
data/gapminder/1962.xlsx Normal file

Binary file not shown.

BIN
data/gapminder/1967.xlsx Normal file

Binary file not shown.

BIN
data/gapminder/1972.xlsx Normal file

Binary file not shown.

BIN
data/gapminder/1977.xlsx Normal file

Binary file not shown.

BIN
data/gapminder/1982.xlsx Normal file

Binary file not shown.

BIN
data/gapminder/1987.xlsx Normal file

Binary file not shown.

BIN
data/gapminder/1992.xlsx Normal file

Binary file not shown.

BIN
data/gapminder/1997.xlsx Normal file

Binary file not shown.

BIN
data/gapminder/2002.xlsx Normal file

Binary file not shown.

BIN
data/gapminder/2007.xlsx Normal file

Binary file not shown.

View File

@ -49,8 +49,6 @@ library(tidyverse)
## Modifying multiple columns
### Motivation
Imagine you have this simple tibble:
```{r}
@ -292,7 +290,7 @@ If needed, you could `pivot_wider()` this back to the original form.
## Reading multiple files
Imagine you have a directory full of excel spreadsheets[^iteration-2] you want to read in.
Imagine you have a directory full of excel spreadsheets[^iteration-2] you want to read.
You could do it with copy and paste:
[^iteration-2]: If you instead had a directory of csv files with the same format, you can use `read_csv()` directly: `read_csv(c("data/y2019.xls", "data/y2020.xls", "data/y2021.xls", "data/y2020.xls").`
@ -314,9 +312,8 @@ data <- bind_rows(data2019, data2020, data2021, data2022)
But you can imagine that this would get tedious quickly, since often you won't have four files, but more like 400.
In this section you'll first learn a little bit about the base `dir()` function which allows you to list all the files in a directory.
And then about `map()` which lets you repeatedly apply a function to each element of a vector, allowing you to read many files in one step.
`map()` is similar to `across()`, but instead of doing something to each column in a data frame, it does something to each element of a vector.
And then about `purrr::map()` which lets you repeatedly apply a function to each element of a vector, allowing you to read many files in one step.
And then we'll finish up with `purrr::list_rbind()` which takes a list of data frames and combines them all together.
### Listing files in a directory
@ -324,40 +321,128 @@ And then about `map()` which lets you repeatedly apply a function to each elemen
Use `pattern`, a regular expression, to filter files.
Always use `full.name`.
Let's make this problem real with a folder of 12 excel spreadsheets that contain data from the gapminder package that contains some information about multiple countries over time:
```{r}
#| eval: false
paths <- dir("data", pattern = "\\.xls$", full.names = TRUE)
paths <- dir("data/gapminder", pattern = "\\.xlsx$", full.names = TRUE)
paths
```
### Basic pattern
Two steps --- read every file into a list.
Then join the pieces back into a data frame.
Overall this framework is sometimes called split-apply-combine.
You split the problem up into pieces (here paths), apply a function to each piece (read_csv), and then combine the pieces back together.
Now that we have the paths, we want to call `read_excel()` with each path.
Since in general we won't know how many elements there are, instead of putting each individual data frame in its own variable, we'll save them all into a list:
```{r}
#| eval: false
list(
readxl::read_excel("data/gapminder/1952.xls"),
readxl::read_excel("data/gapminder/1957.xls"),
readxl::read_excel("data/gapminder/1962.xls"),
...,
readxl::read_excel("data/gapminder/2007.xls")
)
```
The shortcut for this is the `map()` function.
`map(x, f)` is short hand for:
```{r}
#| eval: false
list(
f(x[[1]]),
f(x[[2]]),
...,
f(x[[n]])
)
```
`map()` is similar to `across()`, but instead of doing something to each column in a data frame, it does something to each element of a vector.
We can use `map()` get a list of data frames in one step with:
```{r}
files <- map(paths, readxl::read_excel)
length(files)
files[[1]]
```
(This is another data structure that doesn't display particularly compactly with `str()` so you might want to load into RStudio and inspecting with `` View()` ``).
Now we can to use `purrr::list_rbind()` to combine that list of data frames into a single data frame:
```{r}
list_rbind(files)
```
Or we could combine in a single pipeline like this:
```{r}
#| results: false
paths |>
map(\(path) readxl::read_excel(path)) |>
map(readxl::read_excel) |>
list_rbind()
```
What if we want to pass in extra arguments to `read_excel()`?
We use the same trick that we used with across.
For example, it's often useful to peak at just the first few rows of the data:
```{r}
paths |>
map(\(path) readxl::read_excel(path, n_max = 1)) |>
list_rbind()
```
This really hammers in something that you might've noticed earlier: each individual sheet doesn't contain the year.
That's only recorded in the path.
### Data in the path
If the file name itself contains data, try:
Sometimes the name of the file is itself data.
In this example, the file name contains the year, which is not otherwise recorded in the individual data frames.
To get that column into the final data frame, we need to do two things.
Firstly, we give the path vector names.
The easiest way to do this is with the `set_names()` function, which can optionally take a function.
Here we use `basename` to extract just the file name from the full path:
```{r}
paths <- paths |> set_names(basename)
paths
```
Those paths are automatically carried along by all the map functions, so the list of data frames will have those same names:
```{r}
#| eval: false
paths |>
set_names(basename) |>
map(\(path) readxl::read_excel) |>
list_rbind(.id = "path")
map(readxl::read_excel) |>
names()
```
You can then use `tidyr::separate_by()` and friends to turn into useful columns.
Then we use the `names_to` argument `list_rbind()` to tell it which column to save the names to:
You can use `set_names(basename)` to just use the file name.
```{r}
paths |>
set_names(basename) |>
map(readxl::read_excel) |>
list_rbind(names_to = "year") |>
mutate(year = parse_number(year))
```
Here I used `readr::parse_number()` to turn year into a proper number.
If the path contains more data, do `paths <- paths |> set_names()` to set the names to the full path, and then use `tidyr::separate_by()` and friends to turn them into useful columns.
```{r}
paths |>
set_names() |>
map(readxl::read_excel) |>
list_rbind(names_to = "year") |>
separate(year, into = c(NA, "directory", "file", "ext"), sep = "[/.]")
```
### Get to a single data frame as quickly as possible