You can't apply any of the tools you've applied so far to your own work, unless you can get your own data into R. In this chapter, you'll learn how to import:
* Flat files (like csv) with readr.
* Database queries with DBI.
* Data from web APIs with httr.
* Binary file formats (like excel or sas), with haven and readxl.
The common link between all these packages is they all aim to take your data and turn it into a data frame in R, so you can tidy it and then analyse it.
There are many ways to read flat files into R. If you've be using R for a while, you might be familiar with `read.csv()`, `read.fwf()` and friends. We're not going to use these base functions. Instead we're going to use `read_csv()`, `read_fwf()`, and friends from the readr package. Because:
* These functions are typically much faster (~10x) than the base equivalents.
Long run running jobs also have a progress bar, so you can see what's
happening. (If you're looking for raw speed, try `data.table::fread()`,
it's slightly less flexible than readr, but can be twice as fast.)
* They have more flexible parsers: they can read in dates, times, currencies,
percentages, and more.
* They fail to do some annoying things like converting character vectors to
separated files (common in countries where `,` is used as the decimal place),
`read_tsv()` reads tab delimited files, and `read_delim()` reads in files
with a user supplied delimiter.
* `read_fwf()` reads fixed width files. You can specify fields either by their
widths with `fwf_widths()` or theirs position with `fwf_positions()`.
`read_table()` reads a common variation of fixed width files where columns
are separated by white space.
* `read_log()` reads Apache style logs. (But also check out
[webreadr](https://github.com/Ironholds/webreadr) which is built on top
of `read_log()`, but provides many more helpful tools.)
readr also provides a number of functions for reading files off disk into simpler data structures:
* `read_file()` reads an entire file into a single string.
* `read_lines()` reads a file into a character vector with one element per line.
These might be useful for other programming tasks.
As well as reading data frame disk, readr also provides tools for working with data frames and character vectors in R:
* `type_convert()` applies the same parsing heuristics to the character columns
in a data frame. You can override its choices using `col_types`.
For the rest of this chapter we'll focus on `read_csv()`. If you understand how to use this function, it will be straightforward to your knowledge to all the other functions in readr.
### Basics
The first two arguments of `read_csv()` are:
* `file`: path (or URL) to the file you want to load. Readr can automatically
decompress files ending in `.zip`, `.gz`, `.bz2`, and `.xz`. This can also
be a literal csv file, which is useful for experimenting and creating
reproducible examples.
* `col_names`: column names. There are three options:
* `TRUE` (the default), which reads column names from the first row
of the file
* `FALSE` number columns sequentially from `X1` to `Xn`.
* A character vector, used as column names. If these don't match up
with the columns in the data, you'll get a warning message.
Readr uses a heuristic to figure out the types of your columns: it reads the first 1000 rows and uses some (moderately conservative) heuristics to figure out the type of each column. This is fast, and fairly robust. If readr detects the wrong type of data, you'll get warning messages. Readr prints out the first five, and you can access them all with `problems()`:
Typically, you'll see a lot of warnings if readr has guessed the column type incorrectly. This most often occurs when the first 1000 rows are different to the rest of the data. Perhaps there are a lot of missing data there, or maybe your data is mostly numeric but a few rows have characters. Fortunately, it's easy to fix these problems using the `col_type` argument.
(Note that if you have a very large file, you might want to set `n_max` to 10,000 or 100,000. That will speed up iteration while you're finding common problems)
* `col_skip()` (_, -) completely ignores a column.
* `col_date()` (D), `col_datetime()` (T) and `col_time()` (t) parse into dates,
date times, and times as described below.
You might have noticed that each column parser has a one letter abbreviation, which you can instead of the full function call (assuming you're happy with the default arguments):
```{r, eval = FALSE}
read_csv("mypath.csv", col_types = cols(
x = "i",
treatment = "c"
))
```
(If you just have a few columns you supply a single string giving the type for each column: `i__dc`. See the documentation for more details. It's not as easy to understand as the `cols()` specification, so I'm not going to describe it further here.)
By default, any column not mentioned in `cols` will be guessed. If you'd rather those columns are simply not read in, use `cols_only()`. In that case, you can use `col_guess()` (?) if you want to guess the type of a column.
Each `col_XYZ()` function also has a corresponding `parse_XYZ()` that you can use on a character vector. This makes it easier to explore what each of the parsers does interactively.
Parsing occurs after leading and trailing whitespace has been removed (if not overridden with `trim_ws = FALSE`) and missing values listed in `na` have been removed:
Readr provides three options depending on where you want a date (the number of days since 1970-01-01), a date time (the number of seconds since midnight 1970-01-01), or a time (i.e. the number of seconds since midnight). The defaults read:
* Date times: an [ISO8601](https://en.wikipedia.org/wiki/ISO_8601) date time.
* Date: a year, optional separator, month, optional separator, day.
* Time zone: `%Z` (as name, e.g. `America/Chicago`), `%z` (as offset from UTC,
e.g. `+0800`). If you're American, note that "EST" is a Canadian time zone
that does not have daylight savings time. It is \emph{not} Eastern Standard
Time!
* AM/PM indicator: `%p`.
* Non-digits: `%.` skips one non-digit charcter, `%*` skips any number of
non-digits.
The best way to figure out the correct string is to create a few examples in a character vector, and test with one of the parsing functions. For example:
The goal of readr's locales is to encapsulate the common options that vary between languages and different regions of the world. This includes:
* Names of months and days, used when parsing dates.
* The default time zones, used when parsing date times.
* The character encoding, used when reading non-ASCII strings.
* Default date and time formats, used when guessing column types.
* The decimal and grouping marks, used when reading numbers.
Readr is designed to be independent of your current locale settings. This makes a bit more hassle in the short term, but makes it much much easier to share your code with others: if your readr code works locally, it will also work for everyone else in the world. The same is not true for base R code, since it often inherits defaults from your system settings. Just because data ingest code works for you doesn't mean that it will work for someone else in another country.
The settings you are most like to need to change are:
* The names of days and months:
```{r}
locale("fr")
locale("fr", asciify = TRUE)
```
* The character encoding used in the file. If you don't know the encoding
you can use `guess_encoding()`. It's not perfect, but if you have a decent
sample of text, it's likely to be able to figure it out.
Readr converts all strings into UTF-8 as this is safest to work with across
platforms. (It's also what every stringr operation does.)