r4ds/_freeze/arrow/execute-results/html.json

14 lines
18 KiB
JSON
Raw Normal View History

{
"hash": "1095f33fdacab861f9d700db0157b5a7",
"result": {
"markdown": "---\nfreeze: true\n---\n\n\n# Arrow {#sec-arrow}\n\n\n\n:::: status\n::: callout-note \nYou are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <https://r4ds.had.co.nz>.\n:::\n::::\n\n\n## Introduction\n\nCSV files are designed to be easily read by humans.\nThey're a good interchange format because they're very simple and they can be read by every tool under the sun.\nBut CSV files aren't very efficient: you have to do quite a lot of work to read the data into R.\nIn this chapter, you'll learn about a powerful alternative: the [parquet format](https://parquet.apache.org/), an open standards-based format widely used by big data systems.\n\nWe'll pair parquet files with [Apache Arrow](https://arrow.apache.org), a multi-language toolbox designed for efficient analysis and transport of large data sets.\nWe'll use Apache Arrow via the the [arrow package](https://arrow.apache.org/docs/r/), which provides a dplyr backend allowing you to analyze larger-than-memory datasets using familiar dplyr syntax.\nAs an additional benefit, arrow is extremely fast: you'll see some examples later in the chapter.\n\nBoth arrow and dbplyr provide dplyr backends, so you might wonder when to use each.\nIn many cases, the choice is made for you, as in the data is already in a database or in parquet files, and you'll want to work with it as is.\nBut if you're starting with your own data (perhaps CSV files), you can either load it into a database or convert it to parquet.\nIn general, it's hard to know what will work best, so in the early stages of your analysis we'd encourage you to try both and pick the one that works the best for you.\n\n### Prerequisites\n\nIn this chapter, we'll continue to use the tidyverse, particularly dplyr, but we'll pair it with the arrow package which is designed specifically for working with large data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(arrow)\n```\n:::\n\n\nLater in the chapter, we'll also see some connections between arrow and duckdb, so we'll also need dbplyr and duckdb.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(dbplyr, warn.conflicts = FALSE)\nlibrary(duckdb)\n#> Loading required package: DBI\n```\n:::\n\n\n## Getting the data\n\nWe begin by getting a dataset worthy of these tools: a data set of item checkouts from Seattle public libraries, available online at [data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6](https://data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6).\nThis dataset contains 41,389,465 rows that tell you how many times each book was checked out each month from April 2015 to October 2022.\n\nThe following code will get you a cached copy of the data.\nThe data is a 9GB CSV file, so it will take some time to download: simply getting the data is often the first challenge!\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndir.create(\"data\", showWarnings = FALSE)\nurl <- \"https://r4ds.s3.us-west-2.amazonaws.com/seattle-library-checkouts.csv\"\n\n# Default timeout is 60s; bump it up to an hour\noptions(timeout = 60 * 60)\ndownload.file(url, \"data/seattle-library-checkouts.csv\")\n```\n:::\n\n\n## Opening a dataset\n\nLet's start by taking a look at the data.\nAt 9GB, this file is large enough that we probably don't want to load the whole thing into memory.\nA good rule of thumb is that you usually want at least twice as much memory as the size of the data, and many laptops top out at 16 Gb.\nThis means we want to avoid `read_csv()` and instead use the `arrow::open_dataset()`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# partial schema for ISBN column only\nopts <- CsvConvertOptions$create(col_types = schema(ISBN = string()))\n\nseattle_csv <- open_dataset(\n sources = \"data/seattle-library-checkouts.csv\", \n format = \"csv\",\n convert_options = opts\n)\n```\n:::\n\n\n(Here we've had to use some relatively advanced code to parse the ISBN variable correctly: this is because the first \\~83,000 rows don't
"supporting": [],
"filters": [
"rmarkdown/pagebreak.lua"
],
"includes": {},
"engineDependencies": {},
"preserve": {},
"postProcess": true
}
}