r4ds/_freeze/arrow/execute-results/html.json

15 lines
19 KiB
JSON
Raw Normal View History

{
"hash": "8638b22ba6d178b3e118f2d38845fc2e",
"result": {
"engine": "knitr",
"markdown": "---\nfreeze: true\n---\n\n\n# Arrow {#sec-arrow}\n\n\n::: {.cell}\n\n:::\n\n\n## Introduction\n\nCSV files are designed to be easily read by humans.\nThey're a good interchange format because they're very simple and they can be read by every tool under the sun.\nBut CSV files aren't very efficient: you have to do quite a lot of work to read the data into R.\nIn this chapter, you'll learn about a powerful alternative: the [parquet format](https://parquet.apache.org/), an open standards-based format widely used by big data systems.\n\nWe'll pair parquet files with [Apache Arrow](https://arrow.apache.org), a multi-language toolbox designed for efficient analysis and transport of large datasets.\nWe'll use Apache Arrow via the [arrow package](https://arrow.apache.org/docs/r/), which provides a dplyr backend allowing you to analyze larger-than-memory datasets using familiar dplyr syntax.\nAs an additional benefit, arrow is extremely fast: you'll see some examples later in the chapter.\n\nBoth arrow and dbplyr provide dplyr backends, so you might wonder when to use each.\nIn many cases, the choice is made for you, as the data is already in a database or in parquet files, and you'll want to work with it as is.\nBut if you're starting with your own data (perhaps CSV files), you can either load it into a database or convert it to parquet.\nIn general, it's hard to know what will work best, so in the early stages of your analysis we'd encourage you to try both and pick the one that works the best for you.\n\n(A big thanks to Danielle Navarro who contributed the initial version of this chapter.)\n\n### Prerequisites\n\nIn this chapter, we'll continue to use the tidyverse, particularly dplyr, but we'll pair it with the arrow package which is designed specifically for working with large data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidyverse)\nlibrary(arrow)\n```\n:::\n\n\nLater in the chapter, we'll also see some connections between arrow and duckdb, so we'll also need dbplyr and duckdb.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(dbplyr, warn.conflicts = FALSE)\nlibrary(duckdb)\n#> Loading required package: DBI\n```\n:::\n\n\n## Getting the data\n\nWe begin by getting a dataset worthy of these tools: a dataset of item checkouts from Seattle public libraries, available online at [data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6](https://data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6).\nThis dataset contains 41,389,465 rows that tell you how many times each book was checked out each month from April 2005 to October 2022.\n\nThe following code will get you a cached copy of the data.\nThe data is a 9GB CSV file, so it will take some time to download.\nI highly recommend using `curl::multi_download()` to get very large files as it's built for exactly this purpose: it gives you a progress bar and it can resume the download if its interrupted.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndir.create(\"data\", showWarnings = FALSE)\n\ncurl::multi_download(\n \"https://r4ds.s3.us-west-2.amazonaws.com/seattle-library-checkouts.csv\",\n \"data/seattle-library-checkouts.csv\",\n resume = TRUE\n)\n#> # A tibble: 1 × 10\n#> success status_code resumefrom url destfile error\n#> <lgl> <int> <dbl> <chr> <chr> <chr>\n#> 1 TRUE 200 0 https://r4ds.s3.us-we… data/seattle-l… <NA> \n#> # 4 more variables: type <chr>, modified <dttm>, time <dbl>,\n#> # headers <list>\n```\n:::\n\n\n## Opening a dataset\n\nLet's start by taking a look at the data.\nAt 9 GB, this file is large enough that we probably don't want to load the whole thing into memory.\nA good rule of thumb is that you usually want at least twice as much memory as the size of the data, and many laptops top out at 16 GB.\nThis means we want to avoid `read_csv()` and instead use the `arrow::open_dataset()`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nseattle_csv <- open_dataset(\n sources = \"data/seattle-library-checkouts.csv\", \n col_types = schema(ISBN =
"supporting": [],
"filters": [
"rmarkdown/pagebreak.lua"
],
"includes": {},
"engineDependencies": {},
"preserve": {},
"postProcess": true
}
}