Start writing about readr

2015-09-22 13:35:39 -05:00 · 2015-09-22 13:35:39 -05:00 · d5d52f05c6
parent 445c3dba82
commit d5d52f05c6
2 changed files with 167 additions and 1 deletions
--- a/.travis.yml
+++ b/.travis.yml
@ -23,7 +23,7 @@ install:
  # Install R packages
  - ./travis-tool.sh r_binary_install knitr png
  - ./travis-tool.sh r_install        ggplot2 dplyr tidyr
-  - ./travis-tool.sh github_package   hadley/bookdown garrettgman/DSR
+  - ./travis-tool.sh github_package   hadley/bookdown garrettgman/DSR hadley/readr

 script: jekyll build

--- a/import.Rmd
+++ b/import.Rmd
@ -4,6 +4,10 @@ title: Data import
 output: bookdown::html_chapter
 ---

+```{r, include = FALSE}
+library(readr)
+```
+
 # Data import

 ## Overview
@ -15,8 +19,170 @@ You can't apply any of the tools you've applied so far to your own work, unless
 * Data from web APIs with httr.
 * Binary file formats (like excel or sas), with haven and readxl.

+The common link between all these packages is they all aim to take your data and turn it into a data frame in R, so you can tidy it and then analyse it.
+
 ## Flat files

+There are many ways to read flat files into R. If you've be using R for a while, you might be familiar with `read.csv()`, `read.fwf()` and friends. We're not going to use these base functions. Instead we're going to use `read_csv()`, `read_fwf()`, and friends from the readr package. Because:
+
+* These functions are typically much faster (~10x) than the base equivalents.
+  Long run running jobs also have a progress bar, so you can see what's
+  happening. (If you're looking for raw speed, try `data.table::fread()`, 
+  it's slightly less flexible than readr, but can be twice as fast.)
+  
+* They have more flexible parsers: they can read in dates, times, currencies,
+  percentages, and more. 
+  
+* They fail to do some annoying things like converting character vectors to 
+  factors, and munging the column headers to make sure they're valid R 
+  variable names.
+
+* They return objects with class `tbl_df`. As you saw in the dplyr chapter,
+  this provides a nicer printing method, so it's easier to work with large
+  datasets.
+
+* They're designed to be as reproducible as possible - this means that you
+  sometimes need to supply a few more arguments when using them the first
+  time, but they'll definitely work on other peoples computers. The base R
+  functions take a number of settings from system defaults, which means that
+  code that works on your computer might not work on someone elses.
+
+Make sure you have the readr package (`install.packages("readr")`).
+
+Most of readr's functions are concerned with turning flat files into data frames:
+
+* `read_csv()` read comma delimited files, `read_csv2()` reads semi-colon
+  separated files (common in countries where `,` is used as the decimal place),
+  `read_tsv()` reads tab delimited files, and `read_delim()` reads in files
+  with a user supplied delimiter.
+
+* `read_fwf()` reads fixed width files. You can specify fields either by their
+  widths with `fwf_widths()` or theirs position with `fwf_positions()`. 
+  `read_table()` reads a common variation of fixed width files where columns
+  are separated by white space.
+
+* `read_log()` reads Apache style logs. (But also check out
+  [webreadr](https://github.com/Ironholds/webreadr) which is built on top 
+  of `read_log()`, but provides many more helpful tools.)
+
+readr also provides a number of functions for reading files off disk into simpler data structures:
+
+* `read_file()` reads an entire file into a single string.
+
+* `read_lines()` reads a file into a character vector with one element per line.
+
+These might be useful for other programming tasks.
+
+As well as reading data frame disk, readr also provides tools for working with data frames and character vectors in R:
+
+* `type_convert()` applies the same parsing heuristics to the character columns
+  in a data frame. You can override its choices using `col_types`.
+  
+* `parse_datetime()`, `parse_factor()`, `parse_integer()`, etc. Corresponding
+  to each `col_XYZ()` function is a `parse_XYZ()` function that takes a 
+  character vector and returns a parsed vector. We'll use these in examples
+  so you can see how a single piece works at a time.
+
+For the rest of this chapter we'll focus on `read_csv()`. If you understand how to use this function, it will be straightforward to your knowledge to all the other functions in readr.
+
+### Basics
+
+The first two arguments of `read_csv()` are:
+
+* `file`: path (or URL) to the file you want to load. Readr can automatically 
+  decompress files ending in `.zip`, `.gz`, `.bz2`, and `.xz`. This can also
+  be a literal csv file, which is useful for experimenting and creating
+  reproducible examples.
+  
+* `col_names`: column names. There are three options:
+  
+    * `TRUE` (the default), which reads column names from the first row 
+      of the file
+      
+    * `FALSE` number columns sequentially from `X1` to `Xn`.
+    
+    * A character vector, used as column names. If these don't match up
+      with the columns in the data, you'll get a warning message.
+
+EXAMPLE
+
+### Column types
+
+Readr uses a heuristic to figure out the types of your columns: it reads the first 1000 rows. This is fast, and fairly robust. If readr detects the wrong type of data, you'll get warning messages:
+
+EXAMPLE
+
+You can fix these by overriding readr's guesses with the `col_type` argument. 
+
+(Note that if you have a very large file, you might want to set `n_max` to 10,000 or 100,000. That will speed up iteration while you're finding common problems)
+
+* `col_integer()` and `col_double()` specify integer and doubles. `col_number()`
+  is a more flexible parsed for numbers embedded in other strings. It will 
+  look for the first number in a string, ignoring non-numeric prefixes and
+  suffixes. It will also ignoring the grouping mark specified by the locale 
+  (see below for more details).
+  
+* `col_logical()` parses TRUE, T, FALSE and F into a logical vector.
+  
+* `col_character()` leaves strings as is. `col_factor()` allows you to load
+  data directly into a factor if you know what the levels are.
+  
+* `col_skip()` completely ignores a column.
+
+* `col_date()`, `col_datetime()` and `col_time()` parse into dates, date times,
+  and times as described below.
+
+Parsing occurs after leading and trailing whitespace has been removed (if not overridden with `trim_ws = FALSE`) and missing values listed in `na` have been removed.
+
+#### Datetimes
+
+Readr provides three options depending on where you want a date (the number of days since 1970-01-01), a date time (the number of seconds since midnight 1970-01-01), or a time (i.e. the number of seconds since midnight). The defaults read:
+
+* Date times: an [ISO8601](https://en.wikipedia.org/wiki/ISO_8601) date time.
+* Date: a year, optional separator, month, optional separator, day.
+* Time: an hour, optional colon, hour, optional colon, minute, optional colon,
+  optional seconds, optional am/pm.
+
+```{r}
+parse_datetime("2010-10-01T2010")
+parse_date("2010-10-01")
+parse_time("20:10:01")
+```
+
+If these don't work for your data (common!) you can supply your own date time formats, built up of the following pieces:
+
+* Year: `%Y` (4 digits). `%y` (2 digits); 00-69 -> 2000-2069, 70-99 -> 1970-1999.
+
+* Month: `%m` (2 digits), `%b` (abbreviated name), `%B` (full name).
+
+* Day: `%d` (2 digits), `%e` (optional leading space).
+
+* Hour: `%H`.
+
+* Minutes: `%M`.
+
+* Seconds: `%S` (integer seconds), `%OS` (partial seconds).
+
+* Time zone: `%Z` (as name, e.g. `America/Chicago`), `%z` (as offset from UTC, 
+  e.g. `+0800`). If you're American, note that "EST" is a Canadian time zone 
+  that does not have daylight savings time. It is \emph{not} Eastern Standard 
+  Time!
+
+* AM/PM indicator: `%p`.
+
+* Non-digits: `%.` skips one non-digit charcter, `%*` skips any number of 
+  non-digits.
+
+The best way to figure out the correct string is to create a few examples in a character vector, and test with one of the parsing functions. For example:
+
+```{r}
+parse_date("01/02/15", "%m/%d/%y")
+parse_date("01/02/15", "%d/%m/%y")
+parse_date("01/02/15", "%y/%m/%d")
+```
+
+### International data
+
 ## Databases

 ## Web APIs