Try simpler code with latest arrow (#1334)

This commit is contained in:
Hadley Wickham 2023-03-07 08:05:43 -06:00 committed by GitHub
parent c6edfb977e
commit 810b9f6a3c
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 3 additions and 9 deletions

File diff suppressed because one or more lines are too long

View File

@ -75,18 +75,12 @@ A good rule of thumb is that you usually want at least twice as much memory as t
This means we want to avoid `read_csv()` and instead use the `arrow::open_dataset()`:
```{r open-dataset}
# partial schema for ISBN column only
opts <- CsvConvertOptions$create(col_types = schema(ISBN = string()))
seattle_csv <- open_dataset(
sources = "data/seattle-library-checkouts.csv",
format = "csv",
convert_options = opts
format = "csv"
)
```
(Here we've had to use some relatively advanced code to parse the ISBN variable correctly: this is because the first \~83,000 rows don't contain any data so arrow guesses the wrong types. The arrow team is aware of this problem and there will hopefully be a better approach by the time you read this chapter.)
What happens when this code is run?
`open_dataset()` will scan a few thousand rows to figure out the structure of the dataset.
Then it records what it's found and stops; it will only read further rows as you specifically request them.