Data import proofing

This commit is contained in:
hadley 2016-08-12 08:09:18 -05:00
parent 2cb57b34ff
commit 6da4da4a54
1 changed files with 47 additions and 37 deletions

View File

@ -2,13 +2,11 @@
## Introduction
Working with existing data is a great way to learn the tools of data science, but at some point you want to stop learning and start working with your own data. In this chapter, you'll learn how to use the readr package for reading plain-text rectangular files into R.
This chapter will only scratch surface of data import, but many of the principles will translate to the other forms of data import. The chapter concludes with a few pointers to packages that you might find useful for loading other types of data.
Working with data provided by R packages is a great way to learn the tools of data science, but at some point you want to stop learning and start working with your own data. In this chapter, you'll learn how to read plain-text rectangular files into R. Here, we'll only scratch surface of data import, but many of the principles will translate to the other forms of data. We'll finish with a few pointers to packages that useful for other types of data.
### Prerequisites
In this chapter, you'll learn how to load flat files in R with the readr package:
In this chapter, you'll learn how to load flat files in R with the __readr__ package:
```{r setup}
library(readr)
@ -30,11 +28,11 @@ Most of readr's functions are concerned with turning flat files into data frames
* `read_log()` reads Apache style log files. (But also check out
[webreadr](https://github.com/Ironholds/webreadr) which is built on top
of `read_log()`, but provides many more helpful tools.)
of `read_log()` and provides many more helpful tools.)
These functions all have similar syntax: once you've mastered one, you can use the others with ease. For the rest of this chapter we'll focus on `read_csv()`. Once you understand `read_csv()`, it will be straightforward to apply your knowledge to all the other functions in readr.
These functions all have similar syntax: once you've mastered one, you can use the others with ease. For the rest of this chapter we'll focus on `read_csv()`. Not onl are csv files one of the most common forms of data storage, but once you understand `read_csv()`, you can easily apply your knowledge to all the other functions in readr.
The first argument to `read_csv()` is the most important: it's the path to the file to read:
The first argument to `read_csv()` is the most important: it's the path to the file to read.
```{r, message = TRUE}
heights <- read_csv("data/heights.csv")
@ -42,7 +40,7 @@ heights <- read_csv("data/heights.csv")
When you run `read_csv()` it prints out a column specification that gives the name and type of each column. That's an important part of readr, which we'll come back to in [parsing a file].
You can also supply an inline csv file. This is useful for experimenting and creating reproducible examples:
You can also supply an inline csv file. This is useful for experimenting with readr and for creating reproducible examples to share with others:
```{r}
read_csv("a,b,c
@ -54,7 +52,7 @@ In both cases `read_csv()` uses the first line of the data for the column names,
1. Sometimes there are a few lines of metadata at the top of the file. You can
use `skip = n` to skip the first `n` lines; or use `comment = "#"` to drop
all lines that start with a comment character.
all lines that start with (e.g.) `#`.
```{r}
read_csv("The first line of metadata
@ -75,6 +73,9 @@ In both cases `read_csv()` uses the first line of the data for the column names,
read_csv("1,2,3\n4,5,6", col_names = FALSE)
```
(`"\n"` is a convenient shortcut for adding a new line. You'll learn more
about it and other types of string escape in [string basics].)
Alternatively you can pass `col_names` a character vector which will be
used as the column names:
@ -88,7 +89,7 @@ Another option that commonly needs tweaking is `na`: this specifies the value (o
read_csv("a,b,c\n1,2,.", na = ".")
```
This is all you need to know to read ~75% of csv files that you'll encounter in practice. You can also easily adapt what you've learned to read tab separated files with `read_tsv()` and fixed width files with `read_fwf()`. To read in more challenging files, you'll need to learn more about how readr parses each column, turning strings into the most appropriate type. That's up next.
This is all you need to know to read ~75% of csv files that you'll encounter in practice. You can also easily adapt what you've learned to read tab separated files with `read_tsv()` and fixed width files with `read_fwf()`. To read in more challenging files, you'll need to learn more about how readr parses each column, turning them in to R vectors.
### Compared to base R
@ -115,8 +116,7 @@ If you've used R before, you might wonder why we're not using `read.csv()`. Ther
1. Apart from `file`, `skip`, and `comment`, what other arguments do
`read_csv()` and `read_tsv()` have in common?
1. What is the most important argument to `read_fwf()` that we haven't already
discussed?
1. What are the most important arguments to `read_fwf()`?
1. Sometimes strings in a csv file contain commas. To prevent them from
causing problems they need to be surrounded by a quoting character, like
@ -129,7 +129,7 @@ If you've used R before, you might wonder why we're not using `read.csv()`. Ther
"x,y\n1,'a,b'"
```
1. Identify what is wrong with each of the following inline csvs.
1. Identify what is wrong with each of the following inline csv files.
What happens when you run the code?
```{r, eval = FALSE}
@ -170,7 +170,7 @@ And the failures will be missing in the output:
x
```
If there are many parsing failures, you'll need to use `problems()` to get the complete set. This returns a tibble which you can then manipulate with dplyr.
If there are many parsing failures, you'll need to use `problems()` to get the complete set. This returns a tibble, which you can then manipulate with dplyr.
```{r}
problems(x)
@ -194,7 +194,7 @@ Using parsers is mostly a matter of understanding what's available and how they
parse various date & time specifications. These are the most complicated
because there are so many different ways of writing dates.
The following sections describe the parsers in more detail.
The following sections describe these parsers in more detail.
### Numbers
@ -208,16 +208,16 @@ It seems like it should be straightforward to parse a number, but three problems
context, like "$1000" or "10%".
1. Numbers often contain "grouping" characters to make them easier to read,
like "1,000,000", and these grouping characters around the world.
like "1,000,000", and these grouping characters vary around the world.
To address the first problem, readr has the notion of a "locale", an object that specifies parsing options that differ from place to place. When parsing numbers, the most important option is the character you use for the decimal mark. You can override the default value of `.` by creating a new locale:
To address the first problem, readr has the notion of a "locale", an object that specifies parsing options that differ from place to place. When parsing numbers, the most important option is the character you use for the decimal mark. You can override the default value of `.` by creating a new locale and setting the `decimal_mark` argument:
```{r}
parse_double("1.23")
parse_double("1,23", locale = locale(decimal_mark = ","))
```
readr's default locale is US-centric, because generally R is US-centric (i.e. the documentation of base R is written in American English). An alternative approach would be to try and guess the defaults from your operating system. This is hard to do well, but, more importantly, makes your code fragile: even if it works on your computer, it might fail when you email it to a colleague in another country.
readr's default locale is US-centric, because generally R is US-centric (i.e. the documentation of base R is written in American English). An alternative approach would be to try and guess the defaults from your operating system. This is hard to do well, and, more importantly, makes your code fragile: even if it works on your computer, it might fail when you email it to a colleague in another country.
`parse_number()` addresses the second problem: it ignores non-numeric characters before and after the number. This is particularly useful for currencies and percentages, but also works to extract numbers embedded in text.
@ -230,16 +230,19 @@ parse_number("It cost $123.45")
The final problem is addressed by the combination of `parse_number()` and the locale as `parse_number()` will ignore the "grouping mark":
```{r}
# Used in America
parse_number("$123,456,789")
# Used in many parts of Europe
parse_number("123.456.789", locale = locale(grouping_mark = "."))
# Used in Switzerland
parse_number("123'456'789", locale = locale(grouping_mark = "'"))
```
### Character
It seems like `parse_character()` should be really simple - it could just return its input. Unfortunately life isn't so simple, as there are multiple ways to represent the same string. To understand what's going on, we need to dive into the details of how computers represent strings. In R, we can get at the underlying representation of a string using `charToRaw()`:
It seems like `parse_character()` should be really simple --- it could just return its input. Unfortunately life isn't so simple, as there are multiple ways to represent the same string. To understand what's going on, we need to dive into the details of how computers represent strings. In R, we can get at the underlying representation of a string using `charToRaw()`:
```{r}
charToRaw("Hadley")
@ -339,7 +342,8 @@ Time
: `%OS` real seconds.
: `%Z` Time zone (as name, e.g. `America/Chicago`). Beware abbreviations:
if you're American, note that "EST" is a Canadian time zone that does not
have daylight savings time. It is \emph{not} Eastern Standard Time!
have daylight savings time. It is \emph{not} Eastern Standard Time! We'll
come back to this [time zones].
: `%z` (as offset from UTC, e.g. `+0800`).
Non-digits
@ -364,6 +368,12 @@ parse_date("1 janvier 2015", "%d %B %Y", locale = locale("fr"))
1. What are the most important arguments to `locale()`?
1. What happens if you try and set `decimal_mark` and `grouping_mark`
to the same character? What happens to the default value of
`grouping_mark` when you set `decimal_mark` to ","? What happens
to the default value of `decimal_mark` when you set the `grouping_mark`
to "."?
1. I didn't discuss the `date_format` and `time_format` options to
`locale()`. What do they do? Construct an example that shows when
they might be useful.
@ -496,14 +506,14 @@ tail(challenge)
Every `parse_xyz()` function has a corresponding `col_xyz()` function. You use `parse_xyz()` when the data is in a character vector in R already; you use `col_xyz()` when you want to tell readr how to load the data.
I highly recommend always supplying the `col_types` argument, building from printout provided by readr. This ensures that you have a consistent, reproducible, data import script. If you rely on the default guesses and your data changes, readr will continue to read it in. If you want to be really strict, use `stop_for_problems()`: that function throws an error and stops your script if there are any parsing problems.
I highly recommend always supplying `col_types`, building up from the print-out provided by readr. This ensures that you have a consistent and reproducible data import script. If you rely on the default guesses and your data changes, readr will continue to read it in. If you want to be really strict, use `stop_for_problems()`: that will throw an error and stop your script if there are any parsing problems.
### Other strategies
There are a few other general strategies to help you parse files:
* In this case we just got unlucky, and if we'd looked at just
a few more rows, we could have correctly parsed in one shot:
* In the previous example, we just got unlucky: if we look at just
one more row than the default, we can correctly parse in one shot:
```{r}
challenge2 <- read_csv(readr_example("challenge.csv"), guess_max = 1001)
@ -529,13 +539,14 @@ There are a few other general strategies to help you parse files:
y = c("1.21", "2.32", "4.56")
)
df
# Note the column types
type_convert(df)
```
* If you're reading a very large file, you might want to set `n_max` to
a smallish number like 10,000 or 100,000. That will speed up iteration
while you eliminate common problems.
a smallish number like 10,000 or 100,000. That will accelerate your
iterations while you eliminate common problems.
* If you're having major parsing problems, sometimes it's easier
to just read into a character vector of lines with `read_lines()`,
@ -547,14 +558,13 @@ There are a few other general strategies to help you parse files:
readr also comes with two useful functions for writing data back to disk: `write_csv()` and `write_tsv()`. Both functions increase the chances of the output file being read back in correctly by:
* Always encoding strings in UTF-8. If you want to export a csv file to
Excel, use `write_excel_csv()` - this writes a special character
(a "byte order mark") at the start of the file which tells Excel that
you're using the UTF-8 encoding.
* Always encoding strings in UTF-8.
* Saving dates and date-times in ISO8601 format so they are easily
parsed elsewhere.
If you want to export a csv file to Excel, use `write_excel_csv()` --- this writes a special character (a "byte order mark") at the start of the file which tells Excel that you're using the UTF-8 encoding.
The most important arguments are `x` (the data frame to save), and `path` (the location to save it). You can also specify how missing values are written with `na`, and if you want to `append` to an existing file.
```{r, eval = FALSE}
@ -599,7 +609,7 @@ This makes csvs a little unreliable for caching interim results---you need to re
#> # ... with 1,994 more rows
```
Feather tends to be faster than RDS and is usable outside of R. RDS supports list-columns (which you'll learn about in [many models]), which feather currently does not.
Feather tends to be faster than RDS and is usable outside of R. RDS supports list-columns (which you'll learn about in [many models]); feather currently does not.
```{r, include = FALSE}
file.remove("challenge-2.csv")
@ -610,14 +620,14 @@ file.remove("challenge.rds")
To get other types of data into R, we recommend starting with the tidyverse packages listed below. They're certainly not perfect, but they are a good place to start. For rectangular data:
* haven reads SPSS, Stata, and SAS files.
* __haven__ reads SPSS, Stata, and SAS files.
* readxl reads excel files (both `.xls` and `.xlsx`).
* __readxl__ reads excel files (both `.xls` and `.xlsx`).
* DBI, along with a database specific backend (e.g. RMySQL, RSQLite,
RPostgreSQL etc) allows you to run SQL queries against a database
and return a data frame.
* __DBI__, along with a database specific backend (e.g. __RMySQL__,
__RSQLite__, __RPostgreSQL__ etc) allows you to run SQL queries against a
database and return a data frame.
For hierarchical data: use jsonlite, by Jeroen Ooms for json, and xml2 for XML.
For hierarchical data: use __jsonlite__ (by Jeroen Ooms) for json, and __xml2__ for XML. whichYou will need to convert them to data frames using the tools on [handling hierarchy].
For more exotic file types, try the [R data import/export manual](https://cran.r-project.org/doc/manuals/r-release/R-data.html) and the [rio](https://github.com/leeper/rio) package.
For other file types, try the [R data import/export manual](https://cran.r-project.org/doc/manuals/r-release/R-data.html) and the [__rio__](https://github.com/leeper/rio) package.