r4ds/import.Rmd

405 lines
18 KiB
Plaintext
Raw Normal View History

2015-12-12 02:34:20 +08:00
# Data import
2016-07-07 02:59:50 +08:00
## Introduction
2016-01-27 21:33:35 +08:00
2016-07-07 02:59:50 +08:00
Working with existing data is a great way to learn the tools, but you can't apply the tools to your own data unless you can get it into R. In this chapter, we'll focus on the readr package for reading plain-text rectangular files from disk. This only scratches the surface of the ways you can load data into R, but it's the common way to get data, and many of the principles will translate to the other forms of data import.
2016-01-27 21:33:35 +08:00
2016-07-07 02:59:50 +08:00
### Prerequisites
2015-09-23 02:35:39 +08:00
2016-07-07 02:59:50 +08:00
In this chapter, you'll learn how to load flat files in R with the readr package:
2015-09-23 02:35:39 +08:00
2016-07-07 02:59:50 +08:00
```{r setup}
library(readr)
```
2015-09-23 02:35:39 +08:00
2016-07-07 02:59:50 +08:00
## Basics
2015-09-23 02:35:39 +08:00
Most of readr's functions are concerned with turning flat files into data frames:
2016-01-27 21:33:35 +08:00
* `read_csv()` reads comma delimited files, `read_csv2()` reads semi-colon
2015-09-23 02:35:39 +08:00
separated files (common in countries where `,` is used as the decimal place),
`read_tsv()` reads tab delimited files, and `read_delim()` reads in files
2016-07-07 02:59:50 +08:00
with any delimiter.
2015-09-23 02:35:39 +08:00
* `read_fwf()` reads fixed width files. You can specify fields either by their
2016-01-27 21:33:35 +08:00
widths with `fwf_widths()` or their position with `fwf_positions()`.
2015-09-23 02:35:39 +08:00
`read_table()` reads a common variation of fixed width files where columns
are separated by white space.
* `read_log()` reads Apache style logs. (But also check out
2016-01-27 21:33:35 +08:00
[webreadr](https://github.com/Ironholds/webreadr) which is built on top
2015-09-23 02:35:39 +08:00
of `read_log()`, but provides many more helpful tools.)
2016-07-07 02:59:50 +08:00
These functions all have similar syntax: once you've mastered one, you can use the others with ease. For the rest of this chapter we'll focus on `read_csv()`. If you understand how to use this function, it will be straightforward to apply your knowledge to all the other functions in readr.
2015-09-23 02:35:39 +08:00
2016-07-07 02:59:50 +08:00
The first argument to `read_csv()` is the most important: it's the path to the file to read.
2015-09-23 02:35:39 +08:00
2016-07-07 02:59:50 +08:00
```{r}
heights <- read_csv("data/heights.csv")
```
2015-09-23 02:35:39 +08:00
2016-07-07 02:59:50 +08:00
Readr can automatically decompress files ending in `.zip`, `.gz`, `.bz2`, and `.xz`.
2015-09-23 02:35:39 +08:00
2016-07-08 00:17:11 +08:00
You can also supply an inline csv file. This is useful for experimenting and creating reproducible examples:
2015-09-23 02:35:39 +08:00
2016-07-07 02:59:50 +08:00
```{r}
read_csv("a,b,c
1,2,3
4,5,6")
```
2016-01-27 21:33:35 +08:00
2016-07-08 00:17:11 +08:00
Notice that `read_csv()` uses the first line of the data for column headings. This is a very common convention. There are two cases where you might want tweak this behaviour:
2015-09-23 02:35:39 +08:00
2016-07-07 02:59:50 +08:00
1. Sometimes there are a few lines of metadata at the top of the file. You can
use `skip = n` to skip the first `n` lines; or use `comment = "#"` to drop
all lines that start with a comment character.
```{r}
2016-07-08 00:17:11 +08:00
read_csv("The first line of metadata
The second line of metadata
2016-07-07 02:59:50 +08:00
x,y,z
2016-07-08 00:17:11 +08:00
1,2,3", skip = 2)
2016-07-07 02:59:50 +08:00
read_csv("# A comment I want to skip
x,y,z
1,2,3", comment = "#")
```
1. The data might not have column names. You can use `col_names = FALSE` to
tell `read_csv()` not to treat the first row as headings, and instead
label them sequentially from `X1` to `Xn`:
```{r}
read_csv("1,2,3\n4,5,6", col_names = FALSE)
```
Alternatively you can pass `col_names` a character vector which will be
used as the column names:
```{r}
read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z"))
```
2015-09-23 02:35:39 +08:00
2016-07-08 00:17:11 +08:00
This is all you need to know to read ~50% of csv files that you'll encounter in practice. To read in the rest, you'll need to learn more about how readr parses each individual column, turning a character vector into the most appropriate type.
2016-07-07 02:59:50 +08:00
### Compared to base R
2015-09-23 02:35:39 +08:00
2016-07-08 00:17:11 +08:00
If you've used R before, you might wonder why we're not using `read.csv()`. There are a few good reasons to favour readr functions over the base equivalents:
2016-01-27 21:33:35 +08:00
2016-07-08 00:17:11 +08:00
* They are typically much faster (~10x) than their base equivalents.
2016-07-07 02:59:50 +08:00
Long running jobs also have a progress bar, so you can see what's
2016-07-08 00:17:11 +08:00
happening. If you're looking for raw speed, try `data.table::fread()`,
it doesn't fit so tidily into the tidyverse, but it can be quite a bit
faster than readr.
2016-01-27 21:33:35 +08:00
2016-07-08 00:17:11 +08:00
* They produce tibbles, and they don't convert character vectors to factors,
produce row names, or munge the column names.
2016-01-27 21:33:35 +08:00
2016-07-08 00:17:11 +08:00
* They are more reproducible. Base R functions inherit some behaviour from
your operation system, so code that works on your computer might not
work on another computer.
2016-01-27 21:33:35 +08:00
2016-07-08 00:17:11 +08:00
## Parsing a vector
2015-09-23 02:35:39 +08:00
2016-07-08 00:17:11 +08:00
Before we get to how readr reads files from disk, we're going to take a little detour to talk about the `parse_*()` functions. These functions all take a character vector and return something more specialised like logical, integer, or date:
2015-09-23 02:35:39 +08:00
2016-07-08 00:17:11 +08:00
```{r}
str(parse_logical(c("TRUE", "FALSE", "NA")))
str(parse_integer(c("1", "2", "3")))
str(parse_date(c("2010-01-01", "1979-10-14")))
```
2015-09-23 02:35:39 +08:00
2016-07-08 00:17:11 +08:00
These functions are useful in their own right, but are also an important building block for readr. Once you've learned how the individual parsers work in this section, we'll circle back and see how they fit together to parse a complete file in the next section.
Like all functions in the tidyverse, the `parse_*()` functions are uniform: the first argument is a character vector to parse, and the `na` argument specifies which strings should be treated as missing.
2015-09-23 02:35:39 +08:00
2016-07-07 02:59:50 +08:00
```{r}
2016-07-08 00:17:11 +08:00
parse_integer(c("1", "231", ".", "456"), na = ".")
2016-07-07 02:59:50 +08:00
```
2015-09-23 02:35:39 +08:00
2016-07-08 00:17:11 +08:00
If parsing fails, you'll get a warning, and can use the `problems()` function to get more details. `problems()` returns a tibble, so you can easily explore it using dplyr.
2015-09-23 21:58:16 +08:00
2016-07-07 02:59:50 +08:00
```{r}
2016-07-08 00:17:11 +08:00
x <- parse_integer(c("123", "345", "abc", "123.45"))
problems(x)
2015-09-23 21:58:16 +08:00
```
2016-07-08 00:17:11 +08:00
There are eight particularly important parsers:
1. `parse_logical()` and `parse_integer()` parse logicals and integers
respectively. There's basically nothing that can go wrong with them
so I won't describe them here further.
1. `parse_double()` is a strict numeric parser, and `parse_number()`
is a flexible numeric parser. These are more complicated than you might
expect because different parts of the world write numbers in different
ways.
1. `parse_character()` seems so simple that it shouldn't be necessary. But
one complication makes it important: character encodings.
1. `parse_datetime()`, `parse_date()`, and `parse_time()` allow you to
parse various date & time specifications. These are the most complicated
because there are so many different ways of writing dates.
The following sections describe the parsers in more detail.
2015-09-23 21:58:16 +08:00
2016-07-07 02:59:50 +08:00
### Numbers
2015-09-23 21:58:16 +08:00
2016-07-07 02:59:50 +08:00
There are three tricky bits to numbers:
2016-01-27 21:33:35 +08:00
2016-07-07 02:59:50 +08:00
1. People write numbers differently in different parts of the world.
2016-07-08 00:17:11 +08:00
1. Numbers are often surrounded by other characters that provide some
context, like "$1000" or "10%".
2016-01-27 21:33:35 +08:00
2016-07-08 00:17:11 +08:00
1. Numbers often contain "grouping" characters to make them easier to read,
like "1,000,000", and the characters are differ around the world.
To address the first problem, readr has the notion of a "locale", an object which specifies parsing options that differ around the world. For parsing numbers, the most important option is what character you use for the decimal place:
2015-09-23 21:58:16 +08:00
2016-07-07 02:59:50 +08:00
```{r}
parse_double("1.23")
parse_double("1,23", locale = locale(decimal_mark = ","))
```
2015-09-23 21:58:16 +08:00
2016-07-08 00:17:11 +08:00
The default locale in readr is US-centric, because R generally is US-centric (i.e. the documentation of base R is written in American English). An alternative approach would be to try and guess the defaults from your operating system. This is hard to do well, but more importantly makes your code fragile: it might work on your computer, but might fail when you email it to a colleague in another country.
2015-09-23 21:58:16 +08:00
2016-07-08 00:17:11 +08:00
`parse_number()` addresses problem two: it ignores non-numeric characters before and after the number. This is particularly useful for currencies and percentages, but also works to extract numbers embedded in text.
2016-07-07 02:59:50 +08:00
```{r}
parse_number("$100")
parse_number("20%")
2016-07-08 00:17:11 +08:00
parse_number("It cost $123.45")
2015-09-23 21:58:16 +08:00
```
2016-07-08 00:17:11 +08:00
The final problem is addressed with the combination of `parse_number()` the locale: `parse_number()` will also ignore the "grouping mark" used to separate numbers:
2015-09-23 21:58:16 +08:00
2016-07-07 02:59:50 +08:00
```{r}
parse_number("$100,000,000")
parse_number("123.456,789", locale = locale(grouping_mark = "."))
```
2015-09-23 21:58:16 +08:00
2016-07-07 02:59:50 +08:00
### Character
2016-07-08 00:17:11 +08:00
It seems like `parse_character()` should be really simple - it could just return its input. Unfortunately life isn't so simple, as there are multiple ways to represent the same string. To understand what's going on, we need to dive into the details of how computers represent strings. In R, we can get at the underlying binary representation of a string using `charToRaw()`:
```{r}
charToRaw("Hadley")
```
Each hexadecimal number represents a byte of information: `48` is H, `61` is a, and so on. This encoding, from hexadecimal number to character is called ASCII. ASCII does a great job of representing English characters.
Unfortunately you can only represent a maximum of 255 values with a single byte of information, and there are many more characters when you look across languages. That means to represent a character in other languages you need to use multiple bytes of information. The way multiple bytes are used to encode a character is called the "encoding".
In the early days of computing there were many different ways of representing non-English characters which caused a lot of confusion. Fortunately, today there is one standard that is supported almost everywhere: UTF-8. UTF-8 can encode just about every character used by human's today, as well as many extra symbols (like emoji!).
readr uses UTF-8 everywhere: it assumes your data is UTF-8 encoded when you read it, and always uses it when writing. This is a good default, but will fail for data produced by older systems that don't understand UTF-8. Unfortunately handling
However, you may be attempting to read data that is produced by a system that doesn't understand UTF-8. You can tell you need to do this because when you print the data in R it looks weird. Sometimes you might get complete gibberish, or sometimes just one or two characters might be messed up.
2015-09-23 21:58:16 +08:00
```{r}
2016-07-08 00:17:11 +08:00
x1 <- "\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd"
x2 <- "El Ni\xf1o was particularly bad this year"
x1
x2
2015-09-23 21:58:16 +08:00
```
2015-09-23 02:35:39 +08:00
2016-07-08 00:17:11 +08:00
To fix the problem you need to specify the encoding in `parse_character()`:
2016-07-07 02:59:50 +08:00
2016-07-08 00:17:11 +08:00
```{r}
parse_character(x1, locale = locale(encoding = "Shift-JIS"))
parse_character(x2, locale = locale(encoding = "Latin1"))
```
2016-07-07 02:59:50 +08:00
2016-07-08 00:17:11 +08:00
How do you find the correct encoding? If you're lucky, it'll be included somewhere in the data documentation. But that rarely happens so readr provides `guess_encoding()` to help you figure it out. It's not foolproof, and it works better when you have lots of text, but it's a reasonable place to start. Even then you may need to try a couple of different encodings before you get the right once.
2015-09-23 02:35:39 +08:00
2015-09-23 21:58:16 +08:00
```{r}
2016-07-08 00:17:11 +08:00
guess_encoding(charToRaw(x1))
guess_encoding(charToRaw(x2))
2015-09-23 21:58:16 +08:00
```
2015-09-23 02:35:39 +08:00
2016-07-08 00:17:11 +08:00
The first argument to `guess_encoding()` can either be a path to a file, or, as in this case, a raw vector (useful if the strings are already in R).
2016-07-07 02:59:50 +08:00
### Dates, date times, and times
2015-09-23 02:35:39 +08:00
2016-07-08 00:17:11 +08:00
There are three options depending on whether you want a date (the number of days since 1970-01-01), a date time (the number of seconds since midnight 1970-01-01), or a time (i.e. the number of seconds since midnight). The defaults read:
2015-09-23 02:35:39 +08:00
* Date times: an [ISO8601](https://en.wikipedia.org/wiki/ISO_8601) date time.
* Date: a year, optional separator, month, optional separator, day.
* Time: an hour, optional colon, hour, optional colon, minute, optional colon,
optional seconds, optional am/pm.
2016-07-08 00:17:11 +08:00
For example:
2015-09-23 02:35:39 +08:00
```{r}
parse_datetime("2010-10-01T2010")
parse_date("2010-10-01")
parse_time("20:10:01")
```
2015-09-23 21:58:16 +08:00
If these defaults don't work for your data you can supply your own date time formats, built up of the following pieces:
2015-09-23 02:35:39 +08:00
2016-07-08 00:17:11 +08:00
* Year: `%Y` (4 digits). `%y` (2 digits); 00-69 -> 2000-2069,
70-99 -> 1970-1999.
2015-09-23 02:35:39 +08:00
* Month: `%m` (2 digits), `%b` (abbreviated name), `%B` (full name).
* Day: `%d` (2 digits), `%e` (optional leading space).
* Hour: `%H`.
* Minutes: `%M`.
* Seconds: `%S` (integer seconds), `%OS` (partial seconds).
2016-01-27 21:33:35 +08:00
* Time zone: `%Z` (as name, e.g. `America/Chicago`), `%z` (as offset from UTC,
e.g. `+0800`). If you're American, note that "EST" is a Canadian time zone
that does not have daylight savings time. It is \emph{not} Eastern Standard
2015-09-23 02:35:39 +08:00
Time!
* AM/PM indicator: `%p`.
2016-01-27 21:33:35 +08:00
* Non-digits: `%.` skips one non-digit character, `%*` skips any number of
2015-09-23 02:35:39 +08:00
non-digits.
The best way to figure out the correct string is to create a few examples in a character vector, and test with one of the parsing functions. For example:
```{r}
parse_date("01/02/15", "%m/%d/%y")
parse_date("01/02/15", "%d/%m/%y")
parse_date("01/02/15", "%y/%m/%d")
```
2016-07-08 00:17:11 +08:00
If you're using `%b` or `%p`, and you're in a non-English locale, you can set the values with the `lang` argument to `locale()`. readr comes bundled with a bunch: `date_names_langs()`, or create your own with `date_names()`.
2015-09-23 02:35:39 +08:00
2016-07-07 02:59:50 +08:00
```{r}
locale("fr")
locale("fr", asciify = TRUE)
```
2015-09-23 21:58:16 +08:00
2016-07-08 00:17:11 +08:00
## Parsing a file
Now that you've learned how to parse an individual vector, it's time to turn back and explore how readr parses a file. There are three new things that you'll learn about in this section:
2015-09-23 21:58:16 +08:00
2016-07-08 00:17:11 +08:00
1. How readr guesses what type of vector a column is.
1. What happens when things go wrong.
1. How to override the default specification
### Guesser
Readr uses a heuristic to figure out the types of your columns: it reads the first 1000 rows and uses some (moderately conservative) heuristics to figure out the type of each column. This is fast, and fairly robust. You can emulate this process with a single vector using `parse_guess()`:
2015-09-23 21:58:16 +08:00
2016-07-07 02:59:50 +08:00
```{r}
str(parse_guess("2001-10-10"))
```
2015-09-23 21:58:16 +08:00
2016-07-08 00:17:11 +08:00
* `parse_logical()` detects a column contaning only "F", "T", "FALSE", or
"TRUE"
2016-07-07 02:59:50 +08:00
### Problems object
2015-09-23 21:58:16 +08:00
2016-07-07 02:59:50 +08:00
### Heuristic
2016-07-08 00:17:11 +08:00
If readr detects the wrong type of data, you'll get warning messages. Readr prints out the first five, and you can access them all with `problems()`:
2016-07-07 02:59:50 +08:00
EXAMPLE
Typically, you'll see a lot of warnings if readr has guessed the column type incorrectly. This most often occurs when the first 1000 rows are different to the rest of the data. Perhaps there are a lot of missing data there, or maybe your data is mostly numeric but a few rows have characters. Fortunately, it's easy to fix these problems using the `col_type` argument.
(Note that if you have a very large file, you might want to set `n_max` to 10,000 or 100,000. That will speed up iterations while you're finding common problems)
Specifying the `col_type` looks like this:
```{r, eval = FALSE}
read_csv("mypath.csv", col_types = col(
x = col_integer(),
treatment = col_character()
))
```
You can use the following types of columns
* `col_integer()` (i) and `col_double()` (d) specify integer and doubles.
`col_logical()` (l) parses TRUE, T, FALSE and F into a logical vector.
`col_character()` (c) leaves strings as is.
2016-01-27 21:33:35 +08:00
2016-07-07 02:59:50 +08:00
* `col_number()` (n) is a more flexible parser for numbers embedded in other
strings. It will look for the first number in a string, ignoring non-numeric
prefixes and suffixes. It will also ignore the grouping mark specified by
the locale (see below for more details).
* `col_factor()` (f) allows you to load data directly into a factor if you know
what the levels are.
* `col_skip()` (_, -) completely ignores a column.
* `col_date()` (D), `col_datetime()` (T) and `col_time()` (t) parse into dates,
date times, and times as described below.
You might have noticed that each column parser has a one letter abbreviation, which you can use instead of the full function call (assuming you're happy with the default arguments):
```{r, eval = FALSE}
read_csv("mypath.csv", col_types = cols(
x = "i",
treatment = "c"
))
```
(If you just have a few columns, you can supply a single string that gives the type for each column: `i__dc`. See the documentation for more details. It's not as easy to understand as the `cols()` specification, so I'm not going to describe it further here.)
By default, any column not mentioned in `cols` will be guessed. If you'd rather those columns are simply not read in, use `cols_only()`. In that case, you can use `col_guess()` (?) if you want to guess the type of a column and include it to be read.
Each `col_XYZ()` function also has a corresponding `parse_XYZ()` that you can use on a character vector. This makes it easier to explore what each of the parsers does interactively.
### Spec object
## Other functions
2016-01-27 21:33:35 +08:00
2016-07-07 02:59:50 +08:00
### Reading
2015-09-23 21:58:16 +08:00
2016-07-07 02:59:50 +08:00
readr also provides a number of functions for reading files off disk directly into character vectors:
* `read_file()` reads an entire file into a character vector of length one.
* `read_lines()` reads a file into a character vector with one element per
line.
These are useful if you have a plain text file with an unusual format. Often you can use `read_lines()` to read into a character vector, and then use the regular expression skills you'll learn in [[strings]] to pull out the pieces that you need.
`read_file_raw()` and `read_lines_raw()` work similarly but return raw vectors, which are useful if you need to work with binary data.
### Converting
`type_convert()` applies the same parsing heuristics to the character columns in a data frame. It's useful if you've loaded data "by hand", and now want to convert character columns to the appropriate type:
```{r}
df <- tibble(x = c("1", "2", "3"), y = c("1.21", "2.32", "4.56"))
df
# Note the column types
type_convert(df)
```
2015-09-23 21:58:16 +08:00
2016-07-07 02:59:50 +08:00
Like the `read_*()` functions, you can override the default guesses using the `col_type` argument.
2015-09-23 21:58:16 +08:00
2016-07-07 02:59:50 +08:00
### Writing
2015-09-21 21:41:14 +08:00
2016-07-07 02:59:50 +08:00
readr also comes with two useful functions for writing data back to disk: `write_csv()` and `write_tsv()`. These are considerably faster than the base R equvalents, never write rownames, and automatically quote only when needed.
2015-09-21 21:41:14 +08:00
2016-07-07 02:59:50 +08:00
If you want to export a csv file to Excel, use `write_excel_csv()` - this writes a special character (a "byte order mark") at the start of the file which forces Excel to use UTF-8.
2015-09-21 21:41:14 +08:00
2016-07-07 02:59:50 +08:00
## Other types of data
2016-04-02 01:33:51 +08:00
2016-07-07 02:59:50 +08:00
We have worked on a number of packages to make importing data into R as easy as possible. These packages are certainly not perfect, but they are the best place to start because they behave as similar as possible to readr.
2016-04-02 01:33:51 +08:00
2016-07-07 02:59:50 +08:00
Two packages helper
2016-04-02 01:33:51 +08:00
2016-07-07 02:59:50 +08:00
* haven reads files from other SPSS, Stata, and SAS files.
2016-04-02 01:33:51 +08:00
2016-07-07 02:59:50 +08:00
* readxl reads excel files (both `.xls` and `.xlsx`).
2016-04-02 01:33:51 +08:00
2016-07-07 02:59:50 +08:00
There are two common forms of hierarchical data: XML and json. We recommend using xml2 and jsonlite respectively. These packages are performant, safe, and (relatively) easy to use. To work with these effectively in R, you'll need to x
2016-04-02 01:33:51 +08:00
2016-07-07 02:59:50 +08:00
If your data lives in a database, you'll need to use the DBI package. DBI provides a common interface that works with many different types of database. R's support is particularly good for open source databases (e.g. RPostgres, RMySQL, RSQLite, MonetDBLite).