diff --git a/DESCRIPTION b/DESCRIPTION index 8b2ab43..6219bad 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -54,6 +54,7 @@ Remotes: tidyverse/dbplyr, tidyverse/stringr, tidyverse/tidyr, + tidyverse/purrr jennybc/repurrrsive Encoding: UTF-8 License: CC NC ND 3.0 diff --git a/data/gapminder.R b/data/gapminder.R new file mode 100644 index 0000000..c2c2979 --- /dev/null +++ b/data/gapminder.R @@ -0,0 +1,20 @@ + + +repurrrsive::gap_simple |> + count(year) + +by_year <- repurrrsive::gap_simple |> + group_by(year) +paths <- by_year |> + group_keys() |> + mutate(path = str_glue("data/gapminder/{year}.xlsx")) |> + pull() +paths + +years <- by_year |> + group_split() |> + map(\(df) select(df, -year)) + +dir.create("data/gapminder") + +walk2(years, paths, writexl::write_xlsx) diff --git a/data/gapminder/1952.xlsx b/data/gapminder/1952.xlsx new file mode 100644 index 0000000..7ce82a5 Binary files /dev/null and b/data/gapminder/1952.xlsx differ diff --git a/data/gapminder/1957.xlsx b/data/gapminder/1957.xlsx new file mode 100644 index 0000000..c909acd Binary files /dev/null and b/data/gapminder/1957.xlsx differ diff --git a/data/gapminder/1962.xlsx b/data/gapminder/1962.xlsx new file mode 100644 index 0000000..621e4c6 Binary files /dev/null and b/data/gapminder/1962.xlsx differ diff --git a/data/gapminder/1967.xlsx b/data/gapminder/1967.xlsx new file mode 100644 index 0000000..337a45d Binary files /dev/null and b/data/gapminder/1967.xlsx differ diff --git a/data/gapminder/1972.xlsx b/data/gapminder/1972.xlsx new file mode 100644 index 0000000..21f9de8 Binary files /dev/null and b/data/gapminder/1972.xlsx differ diff --git a/data/gapminder/1977.xlsx b/data/gapminder/1977.xlsx new file mode 100644 index 0000000..f71a9f5 Binary files /dev/null and b/data/gapminder/1977.xlsx differ diff --git a/data/gapminder/1982.xlsx b/data/gapminder/1982.xlsx new file mode 100644 index 0000000..0ff0eae Binary files /dev/null and b/data/gapminder/1982.xlsx differ diff --git a/data/gapminder/1987.xlsx b/data/gapminder/1987.xlsx new file mode 100644 index 0000000..a0b10ce Binary files /dev/null and b/data/gapminder/1987.xlsx differ diff --git a/data/gapminder/1992.xlsx b/data/gapminder/1992.xlsx new file mode 100644 index 0000000..6ae0e56 Binary files /dev/null and b/data/gapminder/1992.xlsx differ diff --git a/data/gapminder/1997.xlsx b/data/gapminder/1997.xlsx new file mode 100644 index 0000000..fe65170 Binary files /dev/null and b/data/gapminder/1997.xlsx differ diff --git a/data/gapminder/2002.xlsx b/data/gapminder/2002.xlsx new file mode 100644 index 0000000..f794a28 Binary files /dev/null and b/data/gapminder/2002.xlsx differ diff --git a/data/gapminder/2007.xlsx b/data/gapminder/2007.xlsx new file mode 100644 index 0000000..0601ec5 Binary files /dev/null and b/data/gapminder/2007.xlsx differ diff --git a/iteration.qmd b/iteration.qmd index 2d4d06c..5242585 100644 --- a/iteration.qmd +++ b/iteration.qmd @@ -49,8 +49,6 @@ library(tidyverse) ## Modifying multiple columns -### Motivation - Imagine you have this simple tibble: ```{r} @@ -292,7 +290,7 @@ If needed, you could `pivot_wider()` this back to the original form. ## Reading multiple files -Imagine you have a directory full of excel spreadsheets[^iteration-2] you want to read in. +Imagine you have a directory full of excel spreadsheets[^iteration-2] you want to read. You could do it with copy and paste: [^iteration-2]: If you instead had a directory of csv files with the same format, you can use `read_csv()` directly: `read_csv(c("data/y2019.xls", "data/y2020.xls", "data/y2021.xls", "data/y2020.xls").` @@ -314,9 +312,8 @@ data <- bind_rows(data2019, data2020, data2021, data2022) But you can imagine that this would get tedious quickly, since often you won't have four files, but more like 400. In this section you'll first learn a little bit about the base `dir()` function which allows you to list all the files in a directory. -And then about `map()` which lets you repeatedly apply a function to each element of a vector, allowing you to read many files in one step. - -`map()` is similar to `across()`, but instead of doing something to each column in a data frame, it does something to each element of a vector. +And then about `purrr::map()` which lets you repeatedly apply a function to each element of a vector, allowing you to read many files in one step. +And then we'll finish up with `purrr::list_rbind()` which takes a list of data frames and combines them all together. ### Listing files in a directory @@ -324,40 +321,128 @@ And then about `map()` which lets you repeatedly apply a function to each elemen Use `pattern`, a regular expression, to filter files. Always use `full.name`. +Let's make this problem real with a folder of 12 excel spreadsheets that contain data from the gapminder package that contains some information about multiple countries over time: + ```{r} -#| eval: false -paths <- dir("data", pattern = "\\.xls$", full.names = TRUE) +paths <- dir("data/gapminder", pattern = "\\.xlsx$", full.names = TRUE) +paths ``` ### Basic pattern -Two steps --- read every file into a list. -Then join the pieces back into a data frame. -Overall this framework is sometimes called split-apply-combine. -You split the problem up into pieces (here paths), apply a function to each piece (read_csv), and then combine the pieces back together. +Now that we have the paths, we want to call `read_excel()` with each path. +Since in general we won't know how many elements there are, instead of putting each individual data frame in its own variable, we'll save them all into a list: ```{r} #| eval: false +list( + readxl::read_excel("data/gapminder/1952.xls"), + readxl::read_excel("data/gapminder/1957.xls"), + readxl::read_excel("data/gapminder/1962.xls"), + ..., + readxl::read_excel("data/gapminder/2007.xls") +) +``` + +The shortcut for this is the `map()` function. +`map(x, f)` is short hand for: + +```{r} +#| eval: false +list( + f(x[[1]]), + f(x[[2]]), + ..., + f(x[[n]]) +) +``` + +`map()` is similar to `across()`, but instead of doing something to each column in a data frame, it does something to each element of a vector. + +We can use `map()` get a list of data frames in one step with: + +```{r} +files <- map(paths, readxl::read_excel) +length(files) + +files[[1]] +``` + +(This is another data structure that doesn't display particularly compactly with `str()` so you might want to load into RStudio and inspecting with `` View()` ``). + +Now we can to use `purrr::list_rbind()` to combine that list of data frames into a single data frame: + +```{r} +list_rbind(files) +``` + +Or we could combine in a single pipeline like this: + +```{r} +#| results: false paths |> - map(\(path) readxl::read_excel(path)) |> + map(readxl::read_excel) |> list_rbind() ``` +What if we want to pass in extra arguments to `read_excel()`? +We use the same trick that we used with across. +For example, it's often useful to peak at just the first few rows of the data: + +```{r} +paths |> + map(\(path) readxl::read_excel(path, n_max = 1)) |> + list_rbind() +``` + +This really hammers in something that you might've noticed earlier: each individual sheet doesn't contain the year. +That's only recorded in the path. + ### Data in the path -If the file name itself contains data, try: +Sometimes the name of the file is itself data. +In this example, the file name contains the year, which is not otherwise recorded in the individual data frames. +To get that column into the final data frame, we need to do two things. + +Firstly, we give the path vector names. +The easiest way to do this is with the `set_names()` function, which can optionally take a function. +Here we use `basename` to extract just the file name from the full path: + +```{r} +paths <- paths |> set_names(basename) +paths +``` + +Those paths are automatically carried along by all the map functions, so the list of data frames will have those same names: ```{r} #| eval: false paths |> - set_names(basename) |> - map(\(path) readxl::read_excel) |> - list_rbind(.id = "path") + map(readxl::read_excel) |> + names() ``` -You can then use `tidyr::separate_by()` and friends to turn into useful columns. +Then we use the `names_to` argument `list_rbind()` to tell it which column to save the names to: -You can use `set_names(basename)` to just use the file name. +```{r} +paths |> + set_names(basename) |> + map(readxl::read_excel) |> + list_rbind(names_to = "year") |> + mutate(year = parse_number(year)) +``` + +Here I used `readr::parse_number()` to turn year into a proper number. + +If the path contains more data, do `paths <- paths |> set_names()` to set the names to the full path, and then use `tidyr::separate_by()` and friends to turn them into useful columns. + +```{r} +paths |> + set_names() |> + map(readxl::read_excel) |> + list_rbind(names_to = "year") |> + separate(year, into = c(NA, "directory", "file", "ext"), sep = "[/.]") +``` ### Get to a single data frame as quickly as possible