Update arrow chapter code to avoid errors (#1517)

* Add in `col_types` to specify schema

* Just use open_dataset()
This commit is contained in:
Nic Crane 2023-07-16 13:29:21 +01:00 committed by GitHub
parent 2674b870ae
commit c1e1437fd8
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 3 additions and 1 deletions

View File

@ -76,13 +76,15 @@ This means we want to avoid `read_csv()` and instead use the `arrow::open_datase
```{r open-dataset}
seattle_csv <- open_dataset(
sources = "data/seattle-library-checkouts.csv",
col_types = schema(ISBN = string()),
format = "csv"
)
```
What happens when this code is run?
`open_dataset()` will scan a few thousand rows to figure out the structure of the dataset.
Then it records what it's found and stops; it will only read further rows as you specifically request them.
The `ISBN` column contains blank values for the first 80,000 rows, so we have to specify the column type to help arrow work out the data structure.
Once the data has been scanned by `open_dataset()`, it records what it's found and stops; it will only read further rows as you specifically request them.
This metadata is what we see if we print `seattle_csv`:
```{r}