From c1e1437fd8efa2698ec8bdc7fad1e142ce28185e Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Sun, 16 Jul 2023 13:29:21 +0100 Subject: [PATCH] Update arrow chapter code to avoid errors (#1517) * Add in `col_types` to specify schema * Just use open_dataset() --- arrow.qmd | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arrow.qmd b/arrow.qmd index 30666cf..36f3e21 100644 --- a/arrow.qmd +++ b/arrow.qmd @@ -76,13 +76,15 @@ This means we want to avoid `read_csv()` and instead use the `arrow::open_datase ```{r open-dataset} seattle_csv <- open_dataset( sources = "data/seattle-library-checkouts.csv", + col_types = schema(ISBN = string()), format = "csv" ) ``` What happens when this code is run? `open_dataset()` will scan a few thousand rows to figure out the structure of the dataset. -Then it records what it's found and stops; it will only read further rows as you specifically request them. +The `ISBN` column contains blank values for the first 80,000 rows, so we have to specify the column type to help arrow work out the data structure. +Once the data has been scanned by `open_dataset()`, it records what it's found and stops; it will only read further rows as you specifically request them. This metadata is what we see if we print `seattle_csv`: ```{r}