From 281005a31ce1bc5ccac12cb2e9e14b2986c69885 Mon Sep 17 00:00:00 2001 From: Danielle Navarro Date: Thu, 8 Dec 2022 12:43:11 +1100 Subject: [PATCH] Adds chapter about arrow (#1137) Co-authored-by: Neal Richardson Co-authored-by: Hadley Wickham Co-authored-by: Mine Cetinkaya-Rundel --- .gitignore | 4 + DESCRIPTION | 1 + _freeze/arrow/execute-results/html.json | 14 ++ _quarto.yml | 1 + arrow.qmd | 291 ++++++++++++++++++++++++ data-import.qmd | 7 +- databases.qmd | 8 +- wrangle.qmd | 13 +- 8 files changed, 328 insertions(+), 11 deletions(-) create mode 100644 _freeze/arrow/execute-results/html.json create mode 100644 arrow.qmd diff --git a/.gitignore b/.gitignore index fd1091b..ab63e0c 100644 --- a/.gitignore +++ b/.gitignore @@ -16,3 +16,7 @@ tmp-pdfcrop-* figures /.quarto/ site_libs +/data/seattle-library-checkouts.csv +/data/seattle-library-checkouts.parquet +/data/seattle-library-checkouts + diff --git a/DESCRIPTION b/DESCRIPTION index 747f378..424d95f 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -10,6 +10,7 @@ URL: https://github.com/hadley/r4ds Depends: R (>= 3.1.0) Imports: + arrow, babynames, dbplyr, dplyr, diff --git a/_freeze/arrow/execute-results/html.json b/_freeze/arrow/execute-results/html.json new file mode 100644 index 0000000..5c2ee21 --- /dev/null +++ b/_freeze/arrow/execute-results/html.json @@ -0,0 +1,14 @@ +{ + "hash": "1095f33fdacab861f9d700db0157b5a7", + "result": { + "markdown": "---\nfreeze: true\n---\n\n\n# Arrow {#sec-arrow}\n\n\n\n:::: status\n::: callout-note \nYou are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at .\n:::\n::::\n\n\n## Introduction\n\nCSV files are designed to be easily read by humans.\nThey're a good interchange format because they're very simple and they can be read by every tool under the sun.\nBut CSV files aren't very efficient: you have to do quite a lot of work to read the data into R.\nIn this chapter, you'll learn about a powerful alternative: the [parquet format](https://parquet.apache.org/), an open standards-based format widely used by big data systems.\n\nWe'll pair parquet files with [Apache Arrow](https://arrow.apache.org), a multi-language toolbox designed for efficient analysis and transport of large data sets.\nWe'll use Apache Arrow via the the [arrow package](https://arrow.apache.org/docs/r/), which provides a dplyr backend allowing you to analyze larger-than-memory datasets using familiar dplyr syntax.\nAs an additional benefit, arrow is extremely fast: you'll see some examples later in the chapter.\n\nBoth arrow and dbplyr provide dplyr backends, so you might wonder when to use each.\nIn many cases, the choice is made for you, as in the data is already in a database or in parquet files, and you'll want to work with it as is.\nBut if you're starting with your own data (perhaps CSV files), you can either load it into a database or convert it to parquet.\nIn general, it's hard to know what will work best, so in the early stages of your analysis we'd encourage you to try both and pick the one that works the best for you.\n\n### Prerequisites\n\nIn this chapter, we'll continue to use the tidyverse, particularly dplyr, but we'll pair it with the arrow package which is designed specifically for working with large data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(arrow)\n```\n:::\n\n\nLater in the chapter, we'll also see some connections between arrow and duckdb, so we'll also need dbplyr and duckdb.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(dbplyr, warn.conflicts = FALSE)\nlibrary(duckdb)\n#> Loading required package: DBI\n```\n:::\n\n\n## Getting the data\n\nWe begin by getting a dataset worthy of these tools: a data set of item checkouts from Seattle public libraries, available online at [data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6](https://data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6).\nThis dataset contains 41,389,465 rows that tell you how many times each book was checked out each month from April 2015 to October 2022.\n\nThe following code will get you a cached copy of the data.\nThe data is a 9GB CSV file, so it will take some time to download: simply getting the data is often the first challenge!\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndir.create(\"data\", showWarnings = FALSE)\nurl <- \"https://r4ds.s3.us-west-2.amazonaws.com/seattle-library-checkouts.csv\"\n\n# Default timeout is 60s; bump it up to an hour\noptions(timeout = 60 * 60)\ndownload.file(url, \"data/seattle-library-checkouts.csv\")\n```\n:::\n\n\n## Opening a dataset\n\nLet's start by taking a look at the data.\nAt 9GB, this file is large enough that we probably don't want to load the whole thing into memory.\nA good rule of thumb is that you usually want at least twice as much memory as the size of the data, and many laptops top out at 16 Gb.\nThis means we want to avoid `read_csv()` and instead use the `arrow::open_dataset()`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# partial schema for ISBN column only\nopts <- CsvConvertOptions$create(col_types = schema(ISBN = string()))\n\nseattle_csv <- open_dataset(\n sources = \"data/seattle-library-checkouts.csv\", \n format = \"csv\",\n convert_options = opts\n)\n```\n:::\n\n\n(Here we've had to use some relatively advanced code to parse the ISBN variable correctly: this is because the first \\~83,000 rows don't contain any data so arrow guesses the wrong types. The arrow team is aware of this problem and there will hopefully be a better approach by the time you read this chapter.)\n\nWhat happens when this code is run?\n`open_dataset()` will scan a few thousand rows to figure out the structure of the data set.\nThen it records what it's found and stops; it will only read further rows as you specifically request them.\nThis metadata is what we see if we print `seattle_csv`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv\n#> FileSystemDataset with 1 csv file\n#> UsageClass: string\n#> CheckoutType: string\n#> MaterialType: string\n#> CheckoutYear: int64\n#> CheckoutMonth: int64\n#> Checkouts: int64\n#> Title: string\n#> ISBN: string\n#> Creator: string\n#> Subjects: string\n#> Publisher: string\n#> PublicationYear: string\n```\n:::\n\n\nThe first line in the output tells you that `seattle_csv` is stored locally on-disk as a single CSV file; it will only be loaded into memory as needed.\nThe remainder of the output tells you the column type that arrow has imputed for each column.\n\nWe can see what's actually in with `glimpse()`.\nThis reveals that there are \\~41 million rows and 12 columns, and shows us a few values.\n\n\n::: {.cell hash='arrow_cache/html/glimpse-data_07c924738790eb185ebdd8973443e90d'}\n\n```{.r .cell-code}\nseattle_csv |> glimpse()\n#> FileSystemDataset with 1 csv file\n#> 41,389,465 rows x 12 columns\n#> $ UsageClass \"Physical\", \"Physical\", \"Digital\", \"Physical\", \"Ph…\n#> $ CheckoutType \"Horizon\", \"Horizon\", \"OverDrive\", \"Horizon\", \"Hor…\n#> $ MaterialType \"BOOK\", \"BOOK\", \"EBOOK\", \"BOOK\", \"SOUNDDISC\", \"BOO…\n#> $ CheckoutYear 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 20…\n#> $ CheckoutMonth 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,…\n#> $ Checkouts 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, 2, 3, 2, 1, 3, 2,…\n#> $ Title \"Super rich : a guide to having it all / Russell S…\n#> $ ISBN \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\"…\n#> $ Creator \"Simmons, Russell\", \"Barclay, James, 1965-\", \"Tim …\n#> $ Subjects \"Self realization, Conduct of life, Attitude Psych…\n#> $ Publisher \"Gotham Books,\", \"Pyr,\", \"Random House, Inc.\", \"Di…\n#> $ PublicationYear \"c2011.\", \"2010.\", \"2015\", \"2005.\", \"c2004.\", \"c20…\n```\n:::\n\n\nWe can start to use this dataset with dplyr verbs, using `collect()` to force arrow to perform the computation and return some data.\nFor example, this code tells us the total number of checkouts per year:\n\n\n::: {.cell hash='arrow_cache/html/unnamed-chunk-5_7a5e1ce0bed4d69e849dff75d0c0d8d3'}\n\n```{.r .cell-code}\nseattle_csv |> \n count(CheckoutYear, wt = Checkouts) |> \n arrange(CheckoutYear) |> \n collect()\n#> # A tibble: 18 × 2\n#> CheckoutYear n\n#> \n#> 1 2005 3798685\n#> 2 2006 6599318\n#> 3 2007 7126627\n#> 4 2008 8438486\n#> 5 2009 9135167\n#> 6 2010 8608966\n#> # … with 12 more rows\n```\n:::\n\n\nThanks to arrow, this code will work regardless of how large the underlying dataset is.\nBut it's currently rather slow: on Hadley's computer, it took \\~10s to run.\nThat's not terrible given how much data we have, but we can make it much faster by switching to a better format.\n\n## The parquet format\n\nTo make this data easier to work with, lets switch to the parquet file format and split it up into multiple files.\nThe following sections will first introduce you to parquet and partitioning, and then apply what we learned to the Seattle library data.\n\n### Advantages of parquet\n\nLike CSV, parquet is used for rectangular data, but instead of being a text format that you can read with any file editor, it's a custom binary format designed specifically for the needs of big data.\nThis means that:\n\n- Parquet files are usually smaller the equivalent CSV file.\n Parquet relies on [efficient encodings](https://parquet.apache.org/docs/file-format/data-pages/encodings/) to keep file size down, and supports file compression.\n This helps make parquet files fast because there's less data to move from disk to memory.\n\n- Parquet files have a rich type system.\n As we talked about in @sec-col-types, a CSV file does not provide any information about column types.\n For example, a CSV reader has to guess whether `\"08-10-2022\"` should be parsed as a string or a date.\n In contrast, parquet files store data in a way that records the type along with the data.\n\n- Parquet files are \"column-oriented\".\n This means that they're organised column-by-column, much like R's data frame.\n This typically leads to better performance for data analysis tasks compared to CSV files, which are organised row-by-row.\n\n- Parquet files are \"chunked\", which makes it possible to work on different parts of the file at the same time, and, if you're lucky, to skip some chunks all together.\n\n### Partitioning\n\nAs datasets get larger and larger, storing all the data in a single file gets increasingly painful and it's often useful to split large datasets across many files.\nWhen this structuring is done intelligently, this strategy can lead to significant improvements in performance because many analyses will only require a subset of the files.\n\nThere are no hard and fast rules about how to partition your data set: the results will depend on your data, access patterns, and the systems that read the data.\nYou're likely to need to do some experimentation before you find the ideal partitioning for your situation.\nAs a rough guide, arrow suggests that you avoid files smaller than 20MB and larger than 2GB and avoid partitions that produce more than 10,000 files.\nYou should also try to partition by variables that you filter by; as you'll see shortly, that allows arrow to skip a lot of work by reading only the relevant files.\n\n### Rewriting the Seattle library data\n\nLet's apply these ideas to the Seattle library data to see how they play out in practice.\nWe're going to partition by `CheckoutYear`, since it's likely some analyses will only want to look at recent data and partitioning by year yields 18 chunks of a reasonable size.\n\nTo rewrite the data we define the partition using `dplyr::group_by()` and then save the partitions to a directory with `arrow::write_dataset()`.\n`write_dataset()` has two important arguments: a directory where we'll create the files and the format we'll use.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npq_path <- \"data/seattle-library-checkouts\"\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv |>\n group_by(CheckoutYear) |>\n write_dataset(path = pq_path, format = \"parquet\")\n```\n:::\n\n\nThis takes about a minute to run; as we'll see shortly this is an initial investment that pays off by making future operations much much faster.\n\nLet's take a look at what we just produced:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntibble(\n files = list.files(pq_path, recursive = TRUE),\n size_MB = file.size(file.path(pq_path, files)) / 1024^2\n)\n#> # A tibble: 18 × 2\n#> files size_MB\n#> \n#> 1 CheckoutYear=2005/part-0.parquet 109.\n#> 2 CheckoutYear=2006/part-0.parquet 164.\n#> 3 CheckoutYear=2007/part-0.parquet 178.\n#> 4 CheckoutYear=2008/part-0.parquet 195.\n#> 5 CheckoutYear=2009/part-0.parquet 214.\n#> 6 CheckoutYear=2010/part-0.parquet 222.\n#> # … with 12 more rows\n```\n:::\n\n\nOur single 9GB CSV file has been rewritten into 18 parquet files.\nThe file names use a \"self-describing\" convention used by the [Apache Hive](https://hive.apache.org) project.\nHive-style partitions name folders with a \"key=value\" convention, so as you might guess, the `CheckoutYear=2005` directory contains all the data where `CheckoutYear` is 2005.\nEach file is between 100 and 300 MB and the total size is now around 4 GB, a little over half the size of the original CSV file.\nThis is as we expect since parquet is a much more efficient format.\n\n## Using dplyr with arrow\n\nNow we've created these parquet files, we'll need to read them in again.\nWe use `open_dataset()` again, but this time we give it a directory:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_pq <- open_dataset(pq_path)\n```\n:::\n\n\nNow we can write our dplyr pipeline.\nFor example, we could count the total number of books checked out in each month for the last five years:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nquery <- seattle_pq |> \n filter(CheckoutYear >= 2018, MaterialType == \"BOOK\") |>\n group_by(CheckoutYear, CheckoutMonth) |>\n summarise(TotalCheckouts = sum(Checkouts)) |>\n arrange(CheckoutYear, CheckoutMonth)\n```\n:::\n\n\nWriting dplyr code for arrow data is conceptually similar to dbplyr, @sec-import-databases: you write dplyr code, which is automatically transformed into a query that the Apache Arrow C++ library understands, which is then executed when you call `collect()`.\nIf we print out the `query` object we can see a little information about what we expect Arrow to return when the execution takes place:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nquery\n#> FileSystemDataset (query)\n#> CheckoutYear: int32\n#> CheckoutMonth: int64\n#> TotalCheckouts: int64\n#> \n#> * Grouped by CheckoutYear\n#> * Sorted by CheckoutYear [asc], CheckoutMonth [asc]\n#> See $.data for the source Arrow object\n```\n:::\n\n\nAnd we can get the results by calling `collect()`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nquery |> collect()\n#> # A tibble: 58 × 3\n#> # Groups: CheckoutYear [5]\n#> CheckoutYear CheckoutMonth TotalCheckouts\n#> \n#> 1 2018 1 355101\n#> 2 2018 2 309813\n#> 3 2018 3 344487\n#> 4 2018 4 330988\n#> 5 2018 5 318049\n#> 6 2018 6 341825\n#> # … with 52 more rows\n```\n:::\n\n\nLike dbplyr, arrow only understands some R expressions, so you may not be able to write exactly the same code you usually would.\nHowever, the list of operations and functions supported is fairly extensive and continues to grow; find a complete list of currently supported functions in `?acero`.\n\n### Performance {#sec-parquet-fast}\n\nLet's take a quick look at the performance impact of switching from CSV to parquet.\nFirst, let's time how long it takes to calculate the number of books checked out in each month of 2021, when the data is stored as a single large csv:\n\n\n::: {.cell hash='arrow_cache/html/dataset-performance-csv_4d24d09e336fc39a348b5ad94f60f527'}\n\n```{.r .cell-code}\nseattle_csv |> \n filter(CheckoutYear == 2021, MaterialType == \"BOOK\") |>\n group_by(CheckoutMonth) |>\n summarise(TotalCheckouts = sum(Checkouts)) |>\n arrange(desc(CheckoutMonth)) |>\n collect() |> \n system.time()\n#> user system elapsed \n#> 11.980 0.924 11.350\n```\n:::\n\n\nNow let's use our new version of the data set in which the Seattle library checkout data has been partitioned into 18 smaller parquet files:\n\n\n::: {.cell hash='arrow_cache/html/dataset-performance-multiple-parquet_ad546f5d817df3ad4bdb238240b808d3'}\n\n```{.r .cell-code}\nseattle_pq |> \n filter(CheckoutYear == 2021, MaterialType == \"BOOK\") |>\n group_by(CheckoutMonth) |>\n summarise(TotalCheckouts = sum(Checkouts)) |>\n arrange(desc(CheckoutMonth)) |>\n collect() |> \n system.time()\n#> user system elapsed \n#> 0.273 0.045 0.055\n```\n:::\n\n\nThe \\~100x speedup in performance is attributable to two factors: the multi-file partitioning, and the format of individual files:\n\n- Partitioning improves performance because this query uses `CheckoutYear == 2021` to filter the data, and arrow is smart enough to recognize that it only needs to read 1 of the 18 parquet files.\n- The parquet format improves performance by storing data in a binary format that can be read more directly into memory. The column-wise format and rich metadata means that arrow only needs to read the four columns actually used in the query (`CheckoutYear`, `MaterialType`, `CheckoutMonth`, and `Checkouts`).\n\nThis massive difference in performance is why it pays off to convert large CSVs to parquet!\n\n### Using dbplyr with arrow\n\nThere's one last advantage of parquet and arrow --- it's very easy to turn an arrow dataset into a duckdb datasource by calling `arrow::to_duckdb()`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_pq |> \n to_duckdb() |>\n filter(CheckoutYear >= 2018, MaterialType == \"BOOK\") |>\n group_by(CheckoutYear) |>\n summarise(TotalCheckouts = sum(Checkouts)) |>\n arrange(desc(CheckoutYear)) |>\n collect()\n#> Warning: Missing values are always removed in SQL aggregation functions.\n#> Use `na.rm = TRUE` to silence this warning\n#> This warning is displayed once every 8 hours.\n#> # A tibble: 5 × 2\n#> CheckoutYear TotalCheckouts\n#> \n#> 1 2022 2431502\n#> 2 2021 2266438\n#> 3 2020 1241999\n#> 4 2019 3931688\n#> 5 2018 3987569\n```\n:::\n\n\nThe neat thing about `to_duckdb()` is that the transfer doesn't involve any memory copying, and speaks to the goals of the arrow ecosystem: enabling seamless transitions from one computing environment to another.\n\n## Summary\n\nIn this chapter, you've been given a taste of the arrow package, which provides a dplyr backend for working with large on-disk datasets.\nIt can work with CSV files, its much much faster if you convert your data to parquet.\nParquet is a binary data format that's designed specifically for data analysis on modern computers.\nFar fewer tools can work with parquet files compared to CSV, but it's partitioned, compressed, and columnar structure makes it much more efficient to analyze.\n\nNext up you'll learn about your first non-rectangular data source, which you'll handle using tools provided by the tidyr package.\nWe'll focus on data that comes from JSON files, but the general principles apply to tree-like data regardless of its source.\n", + "supporting": [], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": {}, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_quarto.yml b/_quarto.yml index 3aaa96e..203b852 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -57,6 +57,7 @@ book: chapters: - spreadsheets.qmd - databases.qmd + - arrow.qmd - rectangling.qmd - webscraping.qmd diff --git a/arrow.qmd b/arrow.qmd new file mode 100644 index 0000000..bb7fdcb --- /dev/null +++ b/arrow.qmd @@ -0,0 +1,291 @@ +--- +freeze: true +--- + +# Arrow {#sec-arrow} + +```{r} +#| results: "asis" +#| echo: false +source("_common.R") +status("polishing") +``` + +## Introduction + +CSV files are designed to be easily read by humans. +They're a good interchange format because they're very simple and they can be read by every tool under the sun. +But CSV files aren't very efficient: you have to do quite a lot of work to read the data into R. +In this chapter, you'll learn about a powerful alternative: the [parquet format](https://parquet.apache.org/), an open standards-based format widely used by big data systems. + +We'll pair parquet files with [Apache Arrow](https://arrow.apache.org), a multi-language toolbox designed for efficient analysis and transport of large data sets. +We'll use Apache Arrow via the the [arrow package](https://arrow.apache.org/docs/r/), which provides a dplyr backend allowing you to analyze larger-than-memory datasets using familiar dplyr syntax. +As an additional benefit, arrow is extremely fast: you'll see some examples later in the chapter. + +Both arrow and dbplyr provide dplyr backends, so you might wonder when to use each. +In many cases, the choice is made for you, as in the data is already in a database or in parquet files, and you'll want to work with it as is. +But if you're starting with your own data (perhaps CSV files), you can either load it into a database or convert it to parquet. +In general, it's hard to know what will work best, so in the early stages of your analysis we'd encourage you to try both and pick the one that works the best for you. + +### Prerequisites + +In this chapter, we'll continue to use the tidyverse, particularly dplyr, but we'll pair it with the arrow package which is designed specifically for working with large data. + +```{r setup} +#| message: false +#| warning: false +library(tidyverse) +library(arrow) +``` + +Later in the chapter, we'll also see some connections between arrow and duckdb, so we'll also need dbplyr and duckdb. + +```{r} +library(dbplyr, warn.conflicts = FALSE) +library(duckdb) +``` + +## Getting the data + +We begin by getting a dataset worthy of these tools: a data set of item checkouts from Seattle public libraries, available online at [data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6](https://data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6). +This dataset contains 41,389,465 rows that tell you how many times each book was checked out each month from April 2015 to October 2022. + +The following code will get you a cached copy of the data. +The data is a 9GB CSV file, so it will take some time to download: simply getting the data is often the first challenge! + +```{r} +#| eval: false +dir.create("data", showWarnings = FALSE) +url <- "https://r4ds.s3.us-west-2.amazonaws.com/seattle-library-checkouts.csv" + +# Default timeout is 60s; bump it up to an hour +options(timeout = 60 * 60) +download.file(url, "data/seattle-library-checkouts.csv") +``` + +## Opening a dataset + +Let's start by taking a look at the data. +At 9GB, this file is large enough that we probably don't want to load the whole thing into memory. +A good rule of thumb is that you usually want at least twice as much memory as the size of the data, and many laptops top out at 16 Gb. +This means we want to avoid `read_csv()` and instead use the `arrow::open_dataset()`: + +```{r open-dataset} +# partial schema for ISBN column only +opts <- CsvConvertOptions$create(col_types = schema(ISBN = string())) + +seattle_csv <- open_dataset( + sources = "data/seattle-library-checkouts.csv", + format = "csv", + convert_options = opts +) +``` + +(Here we've had to use some relatively advanced code to parse the ISBN variable correctly: this is because the first \~83,000 rows don't contain any data so arrow guesses the wrong types. The arrow team is aware of this problem and there will hopefully be a better approach by the time you read this chapter.) + +What happens when this code is run? +`open_dataset()` will scan a few thousand rows to figure out the structure of the data set. +Then it records what it's found and stops; it will only read further rows as you specifically request them. +This metadata is what we see if we print `seattle_csv`: + +```{r} +seattle_csv +``` + +The first line in the output tells you that `seattle_csv` is stored locally on-disk as a single CSV file; it will only be loaded into memory as needed. +The remainder of the output tells you the column type that arrow has imputed for each column. + +We can see what's actually in with `glimpse()`. +This reveals that there are \~41 million rows and 12 columns, and shows us a few values. + +```{r glimpse-data} +#| cache: true +seattle_csv |> glimpse() +``` + +We can start to use this dataset with dplyr verbs, using `collect()` to force arrow to perform the computation and return some data. +For example, this code tells us the total number of checkouts per year: + +```{r} +#| cache: true +seattle_csv |> + count(CheckoutYear, wt = Checkouts) |> + arrange(CheckoutYear) |> + collect() +``` + +Thanks to arrow, this code will work regardless of how large the underlying dataset is. +But it's currently rather slow: on Hadley's computer, it took \~10s to run. +That's not terrible given how much data we have, but we can make it much faster by switching to a better format. + +## The parquet format + +To make this data easier to work with, lets switch to the parquet file format and split it up into multiple files. +The following sections will first introduce you to parquet and partitioning, and then apply what we learned to the Seattle library data. + +### Advantages of parquet + +Like CSV, parquet is used for rectangular data, but instead of being a text format that you can read with any file editor, it's a custom binary format designed specifically for the needs of big data. +This means that: + +- Parquet files are usually smaller the equivalent CSV file. + Parquet relies on [efficient encodings](https://parquet.apache.org/docs/file-format/data-pages/encodings/) to keep file size down, and supports file compression. + This helps make parquet files fast because there's less data to move from disk to memory. + +- Parquet files have a rich type system. + As we talked about in @sec-col-types, a CSV file does not provide any information about column types. + For example, a CSV reader has to guess whether `"08-10-2022"` should be parsed as a string or a date. + In contrast, parquet files store data in a way that records the type along with the data. + +- Parquet files are "column-oriented". + This means that they're organised column-by-column, much like R's data frame. + This typically leads to better performance for data analysis tasks compared to CSV files, which are organised row-by-row. + +- Parquet files are "chunked", which makes it possible to work on different parts of the file at the same time, and, if you're lucky, to skip some chunks all together. + +### Partitioning + +As datasets get larger and larger, storing all the data in a single file gets increasingly painful and it's often useful to split large datasets across many files. +When this structuring is done intelligently, this strategy can lead to significant improvements in performance because many analyses will only require a subset of the files. + +There are no hard and fast rules about how to partition your data set: the results will depend on your data, access patterns, and the systems that read the data. +You're likely to need to do some experimentation before you find the ideal partitioning for your situation. +As a rough guide, arrow suggests that you avoid files smaller than 20MB and larger than 2GB and avoid partitions that produce more than 10,000 files. +You should also try to partition by variables that you filter by; as you'll see shortly, that allows arrow to skip a lot of work by reading only the relevant files. + +### Rewriting the Seattle library data + +Let's apply these ideas to the Seattle library data to see how they play out in practice. +We're going to partition by `CheckoutYear`, since it's likely some analyses will only want to look at recent data and partitioning by year yields 18 chunks of a reasonable size. + +To rewrite the data we define the partition using `dplyr::group_by()` and then save the partitions to a directory with `arrow::write_dataset()`. +`write_dataset()` has two important arguments: a directory where we'll create the files and the format we'll use. + +```{r} +pq_path <- "data/seattle-library-checkouts" +``` + +```{r write-dataset} +#| eval: !expr "!file.exists(pq_path)" + +seattle_csv |> + group_by(CheckoutYear) |> + write_dataset(path = pq_path, format = "parquet") +``` + +This takes about a minute to run; as we'll see shortly this is an initial investment that pays off by making future operations much much faster. + +Let's take a look at what we just produced: + +```{r show-parquet-files} +tibble( + files = list.files(pq_path, recursive = TRUE), + size_MB = file.size(file.path(pq_path, files)) / 1024^2 +) +``` + +Our single 9GB CSV file has been rewritten into 18 parquet files. +The file names use a "self-describing" convention used by the [Apache Hive](https://hive.apache.org) project. +Hive-style partitions name folders with a "key=value" convention, so as you might guess, the `CheckoutYear=2005` directory contains all the data where `CheckoutYear` is 2005. +Each file is between 100 and 300 MB and the total size is now around 4 GB, a little over half the size of the original CSV file. +This is as we expect since parquet is a much more efficient format. + +## Using dplyr with arrow + +Now we've created these parquet files, we'll need to read them in again. +We use `open_dataset()` again, but this time we give it a directory: + +```{r} +seattle_pq <- open_dataset(pq_path) +``` + +Now we can write our dplyr pipeline. +For example, we could count the total number of books checked out in each month for the last five years: + +```{r books-by-year-query} +query <- seattle_pq |> + filter(CheckoutYear >= 2018, MaterialType == "BOOK") |> + group_by(CheckoutYear, CheckoutMonth) |> + summarise(TotalCheckouts = sum(Checkouts)) |> + arrange(CheckoutYear, CheckoutMonth) +``` + +Writing dplyr code for arrow data is conceptually similar to dbplyr, @sec-import-databases: you write dplyr code, which is automatically transformed into a query that the Apache Arrow C++ library understands, which is then executed when you call `collect()`. +If we print out the `query` object we can see a little information about what we expect Arrow to return when the execution takes place: + +```{r} +query +``` + +And we can get the results by calling `collect()`: + +```{r books-by-year} +query |> collect() +``` + +Like dbplyr, arrow only understands some R expressions, so you may not be able to write exactly the same code you usually would. +However, the list of operations and functions supported is fairly extensive and continues to grow; find a complete list of currently supported functions in `?acero`. + +### Performance {#sec-parquet-fast} + +Let's take a quick look at the performance impact of switching from CSV to parquet. +First, let's time how long it takes to calculate the number of books checked out in each month of 2021, when the data is stored as a single large csv: + +```{r dataset-performance-csv} +#| cache: true + +seattle_csv |> + filter(CheckoutYear == 2021, MaterialType == "BOOK") |> + group_by(CheckoutMonth) |> + summarise(TotalCheckouts = sum(Checkouts)) |> + arrange(desc(CheckoutMonth)) |> + collect() |> + system.time() +``` + +Now let's use our new version of the data set in which the Seattle library checkout data has been partitioned into 18 smaller parquet files: + +```{r dataset-performance-multiple-parquet} +#| cache: true + +seattle_pq |> + filter(CheckoutYear == 2021, MaterialType == "BOOK") |> + group_by(CheckoutMonth) |> + summarise(TotalCheckouts = sum(Checkouts)) |> + arrange(desc(CheckoutMonth)) |> + collect() |> + system.time() +``` + +The \~100x speedup in performance is attributable to two factors: the multi-file partitioning, and the format of individual files: + +- Partitioning improves performance because this query uses `CheckoutYear == 2021` to filter the data, and arrow is smart enough to recognize that it only needs to read 1 of the 18 parquet files. +- The parquet format improves performance by storing data in a binary format that can be read more directly into memory. The column-wise format and rich metadata means that arrow only needs to read the four columns actually used in the query (`CheckoutYear`, `MaterialType`, `CheckoutMonth`, and `Checkouts`). + +This massive difference in performance is why it pays off to convert large CSVs to parquet! + +### Using dbplyr with arrow + +There's one last advantage of parquet and arrow --- it's very easy to turn an arrow dataset into a duckdb datasource by calling `arrow::to_duckdb()`: + +```{r use-duckdb} +seattle_pq |> + to_duckdb() |> + filter(CheckoutYear >= 2018, MaterialType == "BOOK") |> + group_by(CheckoutYear) |> + summarise(TotalCheckouts = sum(Checkouts)) |> + arrange(desc(CheckoutYear)) |> + collect() +``` + +The neat thing about `to_duckdb()` is that the transfer doesn't involve any memory copying, and speaks to the goals of the arrow ecosystem: enabling seamless transitions from one computing environment to another. + +## Summary + +In this chapter, you've been given a taste of the arrow package, which provides a dplyr backend for working with large on-disk datasets. +It can work with CSV files, its much much faster if you convert your data to parquet. +Parquet is a binary data format that's designed specifically for data analysis on modern computers. +Far fewer tools can work with parquet files compared to CSV, but it's partitioned, compressed, and columnar structure makes it much more efficient to analyze. + +Next up you'll learn about your first non-rectangular data source, which you'll handle using tools provided by the tidyr package. +We'll focus on data that comes from JSON files, but the general principles apply to tree-like data regardless of its source. diff --git a/data-import.qmd b/data-import.qmd index ef2543d..0fa14e6 100644 --- a/data-import.qmd +++ b/data-import.qmd @@ -413,7 +413,7 @@ read_csv("students-2.csv") ``` This makes CSVs a little unreliable for caching interim results---you need to recreate the column specification every time you load in. -There are two main options: +There are two main alternative: 1. `write_rds()` and `read_rds()` are uniform wrappers around the base functions `readRDS()` and `saveRDS()`. These store data in R's custom binary format called RDS: @@ -423,7 +423,8 @@ There are two main options: read_rds("students.rds") ``` -2. The arrow package allows you to read and write parquet files, a fast binary file format that can be shared across programming languages: +2. The arrow package allows you to read and write parquet files, a fast binary file format that can be shared across programming languages. + We'll come back to arrow in more depth in @sec-arrow. ```{r} #| eval: false @@ -442,7 +443,7 @@ There are two main options: #> 6 6 Güvenç Attila Ice cream Lunch only 6 ``` -Parquet tends to be much faster than RDS and is usable outside of R, but does require you install the arrow package. +Parquet tends to be much faster than RDS and is usable outside of R, but does require the arrow package. ```{r} #| include: false diff --git a/databases.qmd b/databases.qmd index e67dfe8..cc86fc0 100644 --- a/databases.qmd +++ b/databases.qmd @@ -673,10 +673,16 @@ flights |> dbplyr also translates common string and date-time manipulation functions, which you can learn about in `vignette("translation-function", package = "dbplyr")`. dbplyr's translations are certainly not perfect, and there are many R functions that aren't translated yet, but dbplyr does a surprisingly good job covering the functions that you'll use most of the time. -### Learning more +## Summary +In this chapter you learned how to access data from databases. +We focused on dbplyr, a dplyr "backend" that allows you to write the dplyr code you're familiar with, and have it be automatically translated to SQL. +We used that translation to teach you a little SQL; it's important to learn some SQL because it's *the* most commonly used language for working with data and knowing some will it easier for you to communicate with other data folks who don't use R. If you've finished this chapter and would like to learn more about SQL. We have two recommendations: - [*SQL for Data Scientists*](https://sqlfordatascientists.com) by Renée M. P. Teate is an introduction to SQL designed specifically for the needs of data scientists, and includes examples of the sort of highly interconnected data you're likely to encounter in real organisations. - [*Practical SQL*](https://www.practicalsql.com) by Anthony DeBarros is written from the perspective of a data journalist (a data scientist specialized in telling compelling stories) and goes into more detail about getting your data into a database and running your own DBMS. + +In the next chapter, we'll learn about another dplyr backend for working with large data: arrow. +Arrow is designed for working with large files on disk, and is a natural complement to databases. diff --git a/wrangle.qmd b/wrangle.qmd index f5895e3..9b8ff1a 100644 --- a/wrangle.qmd +++ b/wrangle.qmd @@ -32,14 +32,13 @@ This part of the book proceeds as follows: - In @sec-import-databases, you'll learn about getting data into R from databases. +- In @sec-arrow, you'll learn about Arrow, a powerful tool for working with large on-disk files. + - In @sec-rectangling, you'll learn how to work with hierarchical data that includes deeply nested lists, as is often created we your raw data is in JSON. - In @sec-scraping, you'll learn about harvesting data off the web and getting it into R. -Some other types of data are not covered in this book: - -- **haven** reads SPSS, Stata, and SAS files. - -- xml2 for **xml2** for XML - -For other file types, try the [R data import/export manual](https://cran.r-project.org/doc/manuals/r-release/R-data.html) and the [**rio**](https://github.com/leeper/rio) package. +There are two important tidyverse packages that we don't discuss here: haven and xml2. +If you working with data from SPSS, Stata, and SAS files, check out the **haven** package, . +If you're working with XML, check out the **xml2** package, . +Otherwise, you'll need to do some research to figure which package you'll need to use; google is your friend here 😃.