r4ds/import.Rmd

# Data import

```{r, include = FALSE}
library(dplyr)
library(readr)
```

## Overview

You can't apply any of the tools you've learned so far to your own work, unless you can get your own data into R. In this chapter, you'll learn how to:

* Import flat files (like csv) with readr.
* 
* Cache intermediate results in a fast file format like feather or RDS.

The common link between all these packages is they all aim to take your data and turn it into a data frame in R, so you can tidy it and then analyse it.

## Flat files

There are many ways to read flat files into R. If you've be using R for a while, you might be familiar with `read.csv()`, `read.fwf()` and friends. We're not going to use these base functions. Instead we're going to use `read_csv()`, `read_fwf()`, and friends from the readr package. Because:

* These functions are typically much faster (~10x) than the base equivalents.
  Long running jobs also have a progress bar, so you can see what's
  happening. (If you're looking for raw speed, try `data.table::fread()`,
  it's slightly less flexible than readr, but can be twice as fast.)

* They have more flexible parsers: they can read in dates, times, currencies,
  percentages, and more.

* They do not do some annoying things that base R functions do, like converting 
character vectors to factors, munging the column headers to make sure they're 
valid R variable names, and using row names.

* They return objects with class `tbl_df`. As you saw in the dplyr chapter,
  this provides a nicer printing method, so it's easier to work with large
  datasets.

* They're designed to be as reproducible as possible - this means that you
  sometimes need to supply a few more arguments when using them the first
  time, but they'll definitely work on other peoples computers. The base R
  functions take a number of settings from system defaults, which means that
  code that works on your computer might not work on someone else's.

Make sure you have the readr package (`install.packages("readr")`).

Most of readr's functions are concerned with turning flat files into data frames:

* `read_csv()` reads comma delimited files, `read_csv2()` reads semi-colon
  separated files (common in countries where `,` is used as the decimal place),
  `read_tsv()` reads tab delimited files, and `read_delim()` reads in files
  with a user supplied delimiter.

* `read_fwf()` reads fixed width files. You can specify fields either by their
  widths with `fwf_widths()` or their position with `fwf_positions()`.
  `read_table()` reads a common variation of fixed width files where columns
  are separated by white space.

* `read_log()` reads Apache style logs. (But also check out
  [webreadr](https://github.com/Ironholds/webreadr) which is built on top
  of `read_log()`, but provides many more helpful tools.)

readr also provides a number of functions for reading files off disk into simpler data structures:

* `read_file()` reads an entire file into a single string.

* `read_lines()` reads a file into a character vector with one element per line.

These might be useful for other programming tasks.

readr also provides tools for working with data frames and character vectors in R:

* `type_convert()` applies the same parsing heuristics to the character columns
  in a data frame. You can override its choices using `col_types`.

For the rest of this chapter we'll focus on `read_csv()`. If you understand how to use this function, it will be straightforward to apply your knowledge to all the other functions in readr.

### Basics

EXAMPLE

The first two arguments of `read_csv()` are:

* `file`: path (or URL) to the file you want to load. Readr can automatically
  decompress files ending in `.zip`, `.gz`, `.bz2`, and `.xz`. This can also
  be a literal csv file, which is useful for experimenting and creating
  reproducible examples.

* `col_names`: column names. There are three options:

    * `TRUE` (the default), which reads column names from the first row
      of the file

    * `FALSE` numbers columns sequentially from `X1` to `Xn`.

    * A character vector, used as column names. If these don't match up
      with the columns in the data, you'll get a warning message.

### Column types

Readr uses a heuristic to figure out the types of your columns: it reads the first 1000 rows and uses some (moderately conservative) heuristics to figure out the type of each column. This is fast, and fairly robust. If readr detects the wrong type of data, you'll get warning messages. Readr prints out the first five, and you can access them all with `problems()`:

EXAMPLE

Typically, you'll see a lot of warnings if readr has guessed the column type incorrectly. This most often occurs when the first 1000 rows are different to the rest of the data. Perhaps there are a lot of missing data there, or maybe your data is mostly numeric but a few rows have characters. Fortunately, it's easy to fix these problems using the `col_type` argument.

(Note that if you have a very large file, you might want to set `n_max` to 10,000 or 100,000. That will speed up iterations while you're finding common problems)

Specifying the `col_type` looks like this:

```{r, eval = FALSE}
read_csv("mypath.csv", col_types = col(
  x = col_integer(),
  treatment = col_character()
))
```

You can use the following types of columns

* `col_integer()` (i) and `col_double()` (d) specify integer and doubles.
  `col_logical()` (l) parses TRUE, T, FALSE and F into a logical vector.
  `col_character()` (c) leaves strings as is.

* `col_number()` (n) is a more flexible parser for numbers embedded in other
  strings. It will look for the first number in a string, ignoring non-numeric
  prefixes and suffixes. It will also ignore the grouping mark specified by
  the locale (see below for more details).

* `col_factor()` (f) allows you to load data directly into a factor if you know
  what the levels are.

* `col_skip()` (_, -) completely ignores a column.

* `col_date()` (D), `col_datetime()` (T) and `col_time()` (t) parse into dates,
  date times, and times as described below.

You might have noticed that each column parser has a one letter abbreviation, which you can use instead of the full function call (assuming you're happy with the default arguments):

```{r, eval = FALSE}
read_csv("mypath.csv", col_types = cols(
  x = "i",
  treatment = "c"
))
```

(If you just have a few columns, you can supply a single string that gives the type for each column: `i__dc`. See the documentation for more details. It's not as easy to understand as the `cols()` specification, so I'm not going to describe it further here.)

By default, any column not mentioned in `cols` will be guessed. If you'd rather those columns are simply not read in, use `cols_only()`. In that case, you can use `col_guess()` (?) if you want to guess the type of a column and include it to be read.

Each `col_XYZ()` function also has a corresponding `parse_XYZ()` that you can use on a character vector. This makes it easier to explore what each of the parsers does interactively.

```{r}
parse_integer(c("1", "2", "3"))
parse_logical(c("TRUE", "FALSE", "NA"))
parse_number(c("$1000", "20%", "3,000"))
parse_number(c("$1000", "20%", "3,000"))
```

Parsing occurs after leading and trailing whitespace has been removed (if not overridden with `trim_ws = FALSE`) and missing values listed in `na` have been removed:

```{r}
parse_logical(c("TRUE ", " ."), na = ".")
```

#### Datetimes

Readr provides three options depending on whether you want a date (the number of days since 1970-01-01), a date time (the number of seconds since midnight 1970-01-01), or a time (i.e. the number of seconds since midnight). The defaults read:

* Date times: an [ISO8601](https://en.wikipedia.org/wiki/ISO_8601) date time.
* Date: a year, optional separator, month, optional separator, day.
* Time: an hour, optional colon, hour, optional colon, minute, optional colon,
  optional seconds, optional am/pm.

```{r}
parse_datetime("2010-10-01T2010")
parse_date("2010-10-01")
parse_time("20:10:01")
```

If these defaults don't work for your data you can supply your own date time formats, built up of the following pieces:

* Year: `%Y` (4 digits). `%y` (2 digits); 00-69 -> 2000-2069, 70-99 -> 1970-1999.

* Month: `%m` (2 digits), `%b` (abbreviated name), `%B` (full name).

* Day: `%d` (2 digits), `%e` (optional leading space).

* Hour: `%H`.

* Minutes: `%M`.

* Seconds: `%S` (integer seconds), `%OS` (partial seconds).

* Time zone: `%Z` (as name, e.g. `America/Chicago`), `%z` (as offset from UTC,
  e.g. `+0800`). If you're American, note that "EST" is a Canadian time zone
  that does not have daylight savings time. It is \emph{not} Eastern Standard
  Time!

* AM/PM indicator: `%p`.

* Non-digits: `%.` skips one non-digit character, `%*` skips any number of
  non-digits.

The best way to figure out the correct string is to create a few examples in a character vector, and test with one of the parsing functions. For example:

```{r}
parse_date("01/02/15", "%m/%d/%y")
parse_date("01/02/15", "%d/%m/%y")
parse_date("01/02/15", "%y/%m/%d")
```

Then when you read in the data with `read_csv()` you can easily translate to the `col_date()` format.

### International data

The goal of readr's locales is to encapsulate the common options that vary between languages and different regions of the world. This includes:

* Names of months and days, used when parsing dates.
* The default time zones, used when parsing date times.
* The character encoding, used when reading non-ASCII strings.
* Default date and time formats, used when guessing column types.
* The decimal and grouping marks, used when reading numbers.

Readr is designed to be independent of your current locale settings. This makes a bit more hassle in the short term, but makes it much much easier to share your code with others: if your readr code works locally, it will also work for everyone else in the world. The same is not true for base R code, since it often inherits defaults from your system settings. Just because data ingest code works for you doesn't mean that it will work for someone else in another country.

The settings you are most like to need to change are:

*   The names of days and months:

    ```{r}
    locale("fr")
    locale("fr", asciify = TRUE)
    ```

*   The character encoding used in the file.  If you don't know the encoding
    you can use `guess_encoding()`. It's not perfect, but if you have a decent
    sample of text, it's likely to be able to figure it out.

    Readr converts all strings into UTF-8 as this is safest to work with across
    platforms. (It's also what every stringr operation does.)

### Exercises

* Parse these dates (incl. non-English examples).
* Parse these example files.
* Parse this fixed width file.

## Other file formats

* Excel: readxl
* SPSS: haven
* Stata: haven
* SAS: haven

Databases. All powered by the DBI package which provides a common interface.

* RPostgres
* RMySQL
* RSQLite
* Avoid JDBC un

Hierarchical:

* XML: xml2
* JSON: jsonlite 


## Binary file formats

Feather.
RDS.
Make sure first element is heading 2015-12-12 02:34:20 +08:00			`# Data import`

Start writing about readr 2015-09-23 02:35:39 +08:00			```{r, include = FALSE}
More on column types 2015-09-23 21:58:16 +08:00			`library(dplyr)`
Start writing about readr 2015-09-23 02:35:39 +08:00			`library(readr)`
			```

Rough notes for import & transform 2015-09-21 21:41:14 +08:00			`## Overview`

Small edits to import.Rmd (typos and mistakes) 2016-04-07 02:24:44 +08:00			`You can't apply any of the tools you've learned so far to your own work, unless you can get your own data into R. In this chapter, you'll learn how to:`
Rough notes for import & transform 2015-09-21 21:41:14 +08:00
Brain dump of some import ideas 2016-04-02 01:33:51 +08:00			`* Import flat files (like csv) with readr.`
			`*`
			`* Cache intermediate results in a fast file format like feather or RDS.`
Rough notes for import & transform 2015-09-21 21:41:14 +08:00
Start writing about readr 2015-09-23 02:35:39 +08:00			`The common link between all these packages is they all aim to take your data and turn it into a data frame in R, so you can tidy it and then analyse it.`

Rough notes for import & transform 2015-09-21 21:41:14 +08:00			`## Flat files`

Start writing about readr 2015-09-23 02:35:39 +08:00			There are many ways to read flat files into R. If you've be using R for a while, you might be familiar with `read.csv()`, `read.fwf()` and friends. We're not going to use these base functions. Instead we're going to use `read_csv()`, `read_fwf()`, and friends from the readr package. Because:

			`* These functions are typically much faster (~10x) than the base equivalents.`
Small edits to import.Rmd (typos and mistakes) 2016-04-07 02:24:44 +08:00			`Long running jobs also have a progress bar, so you can see what's`
Update import.Rmd typos 2016-01-27 21:33:35 +08:00			happening. (If you're looking for raw speed, try `data.table::fread()`,
Start writing about readr 2015-09-23 02:35:39 +08:00			`it's slightly less flexible than readr, but can be twice as fast.)`
Update import.Rmd typos 2016-01-27 21:33:35 +08:00
Start writing about readr 2015-09-23 02:35:39 +08:00			`* They have more flexible parsers: they can read in dates, times, currencies,`
Update import.Rmd typos 2016-01-27 21:33:35 +08:00			`percentages, and more.`

Small edits to import.Rmd (typos and mistakes) 2016-04-07 02:24:44 +08:00			`* They do not do some annoying things that base R functions do, like converting`
			`character vectors to factors, munging the column headers to make sure they're`
			`valid R variable names, and using row names.`
Start writing about readr 2015-09-23 02:35:39 +08:00
			* They return objects with class `tbl_df`. As you saw in the dplyr chapter,
			`this provides a nicer printing method, so it's easier to work with large`
			`datasets.`

			`* They're designed to be as reproducible as possible - this means that you`
			`sometimes need to supply a few more arguments when using them the first`
			`time, but they'll definitely work on other peoples computers. The base R`
			`functions take a number of settings from system defaults, which means that`
Update import.Rmd typos 2016-01-27 21:33:35 +08:00			`code that works on your computer might not work on someone else's.`
Start writing about readr 2015-09-23 02:35:39 +08:00
			Make sure you have the readr package (`install.packages("readr")`).

			`Most of readr's functions are concerned with turning flat files into data frames:`

Update import.Rmd typos 2016-01-27 21:33:35 +08:00			* `read_csv()` reads comma delimited files, `read_csv2()` reads semi-colon
Start writing about readr 2015-09-23 02:35:39 +08:00			separated files (common in countries where `,` is used as the decimal place),
			`read_tsv()` reads tab delimited files, and `read_delim()` reads in files
			`with a user supplied delimiter.`

			* `read_fwf()` reads fixed width files. You can specify fields either by their
Update import.Rmd typos 2016-01-27 21:33:35 +08:00			widths with `fwf_widths()` or their position with `fwf_positions()`.
Start writing about readr 2015-09-23 02:35:39 +08:00			`read_table()` reads a common variation of fixed width files where columns
			`are separated by white space.`

			* `read_log()` reads Apache style logs. (But also check out
Update import.Rmd typos 2016-01-27 21:33:35 +08:00			`[webreadr](https://github.com/Ironholds/webreadr) which is built on top`
Start writing about readr 2015-09-23 02:35:39 +08:00			of `read_log()`, but provides many more helpful tools.)

			`readr also provides a number of functions for reading files off disk into simpler data structures:`

			* `read_file()` reads an entire file into a single string.

			* `read_lines()` reads a file into a character vector with one element per line.

			`These might be useful for other programming tasks.`

Small edits to import.Rmd (typos and mistakes) 2016-04-07 02:24:44 +08:00			`readr also provides tools for working with data frames and character vectors in R:`
Start writing about readr 2015-09-23 02:35:39 +08:00
			* `type_convert()` applies the same parsing heuristics to the character columns
			in a data frame. You can override its choices using `col_types`.
Update import.Rmd typos 2016-01-27 21:33:35 +08:00
Small edits to import.Rmd (typos and mistakes) 2016-04-07 02:24:44 +08:00			For the rest of this chapter we'll focus on `read_csv()`. If you understand how to use this function, it will be straightforward to apply your knowledge to all the other functions in readr.
Start writing about readr 2015-09-23 02:35:39 +08:00
			`### Basics`

Small edits to import.Rmd (typos and mistakes) 2016-04-07 02:24:44 +08:00			`EXAMPLE`

Start writing about readr 2015-09-23 02:35:39 +08:00			The first two arguments of `read_csv()` are:

Update import.Rmd typos 2016-01-27 21:33:35 +08:00			* `file`: path (or URL) to the file you want to load. Readr can automatically
Start writing about readr 2015-09-23 02:35:39 +08:00			decompress files ending in `.zip`, `.gz`, `.bz2`, and `.xz`. This can also
			`be a literal csv file, which is useful for experimenting and creating`
			`reproducible examples.`
Update import.Rmd typos 2016-01-27 21:33:35 +08:00
Start writing about readr 2015-09-23 02:35:39 +08:00			* `col_names`: column names. There are three options:
Update import.Rmd typos 2016-01-27 21:33:35 +08:00
			* `TRUE` (the default), which reads column names from the first row
Start writing about readr 2015-09-23 02:35:39 +08:00			`of the file`
Update import.Rmd typos 2016-01-27 21:33:35 +08:00
			* `FALSE` numbers columns sequentially from `X1` to `Xn`.

Start writing about readr 2015-09-23 02:35:39 +08:00			`* A character vector, used as column names. If these don't match up`
			`with the columns in the data, you'll get a warning message.`

			`### Column types`

More on column types 2015-09-23 21:58:16 +08:00			Readr uses a heuristic to figure out the types of your columns: it reads the first 1000 rows and uses some (moderately conservative) heuristics to figure out the type of each column. This is fast, and fairly robust. If readr detects the wrong type of data, you'll get warning messages. Readr prints out the first five, and you can access them all with `problems()`:
Start writing about readr 2015-09-23 02:35:39 +08:00
			`EXAMPLE`

More on column types 2015-09-23 21:58:16 +08:00			Typically, you'll see a lot of warnings if readr has guessed the column type incorrectly. This most often occurs when the first 1000 rows are different to the rest of the data. Perhaps there are a lot of missing data there, or maybe your data is mostly numeric but a few rows have characters. Fortunately, it's easy to fix these problems using the `col_type` argument.
Start writing about readr 2015-09-23 02:35:39 +08:00
Update import.Rmd typos 2016-01-27 21:33:35 +08:00			(Note that if you have a very large file, you might want to set `n_max` to 10,000 or 100,000. That will speed up iterations while you're finding common problems)
Start writing about readr 2015-09-23 02:35:39 +08:00
More on column types 2015-09-23 21:58:16 +08:00			Specifying the `col_type` looks like this:

			```{r, eval = FALSE}
			`read_csv("mypath.csv", col_types = col(`
			`x = col_integer(),`
			`treatment = col_character()`
			`))`
			```

			`You can use the following types of columns`

Update import.Rmd typos 2016-01-27 21:33:35 +08:00			* `col_integer()` (i) and `col_double()` (d) specify integer and doubles.
More on column types 2015-09-23 21:58:16 +08:00			`col_logical()` (l) parses TRUE, T, FALSE and F into a logical vector.
Update import.Rmd typos 2016-01-27 21:33:35 +08:00			`col_character()` (c) leaves strings as is.
More on column types 2015-09-23 21:58:16 +08:00
Small edits to import.Rmd (typos and mistakes) 2016-04-07 02:24:44 +08:00			* `col_number()` (n) is a more flexible parser for numbers embedded in other
Update import.Rmd typos 2016-01-27 21:33:35 +08:00			`strings. It will look for the first number in a string, ignoring non-numeric`
			`prefixes and suffixes. It will also ignore the grouping mark specified by`
More on column types 2015-09-23 21:58:16 +08:00			`the locale (see below for more details).`
Update import.Rmd typos 2016-01-27 21:33:35 +08:00
			* `col_factor()` (f) allows you to load data directly into a factor if you know
More on column types 2015-09-23 21:58:16 +08:00			`what the levels are.`
Update import.Rmd typos 2016-01-27 21:33:35 +08:00
More on column types 2015-09-23 21:58:16 +08:00			* `col_skip()` (_, -) completely ignores a column.

Update import.Rmd typos 2016-01-27 21:33:35 +08:00			* `col_date()` (D), `col_datetime()` (T) and `col_time()` (t) parse into dates,
More on column types 2015-09-23 21:58:16 +08:00			`date times, and times as described below.`

Update import.Rmd typos 2016-01-27 21:33:35 +08:00			`You might have noticed that each column parser has a one letter abbreviation, which you can use instead of the full function call (assuming you're happy with the default arguments):`
More on column types 2015-09-23 21:58:16 +08:00
			```{r, eval = FALSE}
			`read_csv("mypath.csv", col_types = cols(`
			`x = "i",`
			`treatment = "c"`
			`))`
			```

Small edits to import.Rmd (typos and mistakes) 2016-04-07 02:24:44 +08:00			(If you just have a few columns, you can supply a single string that gives the type for each column: `i__dc`. See the documentation for more details. It's not as easy to understand as the `cols()` specification, so I'm not going to describe it further here.)
More on column types 2015-09-23 21:58:16 +08:00
Small edits to import.Rmd (typos and mistakes) 2016-04-07 02:24:44 +08:00			By default, any column not mentioned in `cols` will be guessed. If you'd rather those columns are simply not read in, use `cols_only()`. In that case, you can use `col_guess()` (?) if you want to guess the type of a column and include it to be read.
More on column types 2015-09-23 21:58:16 +08:00
			Each `col_XYZ()` function also has a corresponding `parse_XYZ()` that you can use on a character vector. This makes it easier to explore what each of the parsers does interactively.

			```{r}
			`parse_integer(c("1", "2", "3"))`
			`parse_logical(c("TRUE", "FALSE", "NA"))`
			`parse_number(c("$1000", "20%", "3,000"))`
			`parse_number(c("$1000", "20%", "3,000"))`
			```
Start writing about readr 2015-09-23 02:35:39 +08:00
More on column types 2015-09-23 21:58:16 +08:00			Parsing occurs after leading and trailing whitespace has been removed (if not overridden with `trim_ws = FALSE`) and missing values listed in `na` have been removed:
Start writing about readr 2015-09-23 02:35:39 +08:00
More on column types 2015-09-23 21:58:16 +08:00			```{r}
			`parse_logical(c("TRUE ", " ."), na = ".")`
			```
Start writing about readr 2015-09-23 02:35:39 +08:00
			`#### Datetimes`

Small edits to import.Rmd (typos and mistakes) 2016-04-07 02:24:44 +08:00			`Readr provides three options depending on whether you want a date (the number of days since 1970-01-01), a date time (the number of seconds since midnight 1970-01-01), or a time (i.e. the number of seconds since midnight). The defaults read:`
Start writing about readr 2015-09-23 02:35:39 +08:00
			`* Date times: an [ISO8601](https://en.wikipedia.org/wiki/ISO_8601) date time.`
			`* Date: a year, optional separator, month, optional separator, day.`
			`* Time: an hour, optional colon, hour, optional colon, minute, optional colon,`
			`optional seconds, optional am/pm.`

			```{r}
			`parse_datetime("2010-10-01T2010")`
			`parse_date("2010-10-01")`
			`parse_time("20:10:01")`
			```

More on column types 2015-09-23 21:58:16 +08:00			`If these defaults don't work for your data you can supply your own date time formats, built up of the following pieces:`
Start writing about readr 2015-09-23 02:35:39 +08:00
			* Year: `%Y` (4 digits). `%y` (2 digits); 00-69 -> 2000-2069, 70-99 -> 1970-1999.

			* Month: `%m` (2 digits), `%b` (abbreviated name), `%B` (full name).

			* Day: `%d` (2 digits), `%e` (optional leading space).

			* Hour: `%H`.

			* Minutes: `%M`.

			* Seconds: `%S` (integer seconds), `%OS` (partial seconds).

Update import.Rmd typos 2016-01-27 21:33:35 +08:00			* Time zone: `%Z` (as name, e.g. `America/Chicago`), `%z` (as offset from UTC,
			e.g. `+0800`). If you're American, note that "EST" is a Canadian time zone
			`that does not have daylight savings time. It is \emph{not} Eastern Standard`
Start writing about readr 2015-09-23 02:35:39 +08:00			`Time!`

			* AM/PM indicator: `%p`.

Update import.Rmd typos 2016-01-27 21:33:35 +08:00			* Non-digits: `%.` skips one non-digit character, `%*` skips any number of
Start writing about readr 2015-09-23 02:35:39 +08:00			`non-digits.`

			`The best way to figure out the correct string is to create a few examples in a character vector, and test with one of the parsing functions. For example:`

			```{r}
			`parse_date("01/02/15", "%m/%d/%y")`
			`parse_date("01/02/15", "%d/%m/%y")`
			`parse_date("01/02/15", "%y/%m/%d")`
			```

More on column types 2015-09-23 21:58:16 +08:00			Then when you read in the data with `read_csv()` you can easily translate to the `col_date()` format.

Start writing about readr 2015-09-23 02:35:39 +08:00			`### International data`

More on column types 2015-09-23 21:58:16 +08:00			`The goal of readr's locales is to encapsulate the common options that vary between languages and different regions of the world. This includes:`

			`* Names of months and days, used when parsing dates.`
			`* The default time zones, used when parsing date times.`
			`* The character encoding, used when reading non-ASCII strings.`
			`* Default date and time formats, used when guessing column types.`
			`* The decimal and grouping marks, used when reading numbers.`

			`Readr is designed to be independent of your current locale settings. This makes a bit more hassle in the short term, but makes it much much easier to share your code with others: if your readr code works locally, it will also work for everyone else in the world. The same is not true for base R code, since it often inherits defaults from your system settings. Just because data ingest code works for you doesn't mean that it will work for someone else in another country.`

			`The settings you are most like to need to change are:`

			`* The names of days and months:`

			```{r}
			`locale("fr")`
			`locale("fr", asciify = TRUE)`
			```
Update import.Rmd typos 2016-01-27 21:33:35 +08:00
More on column types 2015-09-23 21:58:16 +08:00			`* The character encoding used in the file. If you don't know the encoding`
			you can use `guess_encoding()`. It's not perfect, but if you have a decent
			`sample of text, it's likely to be able to figure it out.`
Update import.Rmd typos 2016-01-27 21:33:35 +08:00
More on column types 2015-09-23 21:58:16 +08:00			`Readr converts all strings into UTF-8 as this is safest to work with across`
			`platforms. (It's also what every stringr operation does.)`

			`### Exercises`

			`* Parse these dates (incl. non-English examples).`
			`* Parse these example files.`
			`* Parse this fixed width file.`

Brain dump of some import ideas 2016-04-02 01:33:51 +08:00			`## Other file formats`
Rough notes for import & transform 2015-09-21 21:41:14 +08:00
Brain dump of some import ideas 2016-04-02 01:33:51 +08:00			`* Excel: readxl`
			`* SPSS: haven`
			`* Stata: haven`
			`* SAS: haven`
Rough notes for import & transform 2015-09-21 21:41:14 +08:00
Brain dump of some import ideas 2016-04-02 01:33:51 +08:00			`Databases. All powered by the DBI package which provides a common interface.`
Rough notes for import & transform 2015-09-21 21:41:14 +08:00
Brain dump of some import ideas 2016-04-02 01:33:51 +08:00			`* RPostgres`
			`* RMySQL`
			`* RSQLite`
			`* Avoid JDBC un`

			`Hierarchical:`

			`* XML: xml2`
			`* JSON: jsonlite`



			`## Binary file formats`

			`Feather.`
			`RDS.`