Merge branch 'master' of github.com:hadley/r4ds

This commit is contained in:
hadley 2016-04-07 09:21:44 -05:00
commit 67516034f7
2 changed files with 15 additions and 15 deletions

View File

@ -7,7 +7,7 @@ library(readr)
## Overview
You can't apply any of the tools you've applied so far to your own work, unless you can get your own data into R. In this chapter, you'll learn how to:
You can't apply any of the tools you've learned so far to your own work, unless you can get your own data into R. In this chapter, you'll learn how to:
* Import flat files (like csv) with readr.
*
@ -20,16 +20,16 @@ The common link between all these packages is they all aim to take your data and
There are many ways to read flat files into R. If you've be using R for a while, you might be familiar with `read.csv()`, `read.fwf()` and friends. We're not going to use these base functions. Instead we're going to use `read_csv()`, `read_fwf()`, and friends from the readr package. Because:
* These functions are typically much faster (~10x) than the base equivalents.
Long run running jobs also have a progress bar, so you can see what's
Long running jobs also have a progress bar, so you can see what's
happening. (If you're looking for raw speed, try `data.table::fread()`,
it's slightly less flexible than readr, but can be twice as fast.)
* They have more flexible parsers: they can read in dates, times, currencies,
percentages, and more.
* They fail to do some annoying things like converting character vectors to
factors, munging the column headers to make sure they're valid R
variable names, and using row names.
* They do not do some annoying things that base R functions do, like converting
character vectors to factors, munging the column headers to make sure they're
valid R variable names, and using row names.
* They return objects with class `tbl_df`. As you saw in the dplyr chapter,
this provides a nicer printing method, so it's easier to work with large
@ -67,15 +67,17 @@ readr also provides a number of functions for reading files off disk into simple
These might be useful for other programming tasks.
As well as reading data from disk, readr also provides tools for working with data frames and character vectors in R:
readr also provides tools for working with data frames and character vectors in R:
* `type_convert()` applies the same parsing heuristics to the character columns
in a data frame. You can override its choices using `col_types`.
For the rest of this chapter we'll focus on `read_csv()`. If you understand how to use this function, it will be straightforward to your knowledge to all the other functions in readr.
For the rest of this chapter we'll focus on `read_csv()`. If you understand how to use this function, it will be straightforward to apply your knowledge to all the other functions in readr.
### Basics
EXAMPLE
The first two arguments of `read_csv()` are:
* `file`: path (or URL) to the file you want to load. Readr can automatically
@ -93,8 +95,6 @@ The first two arguments of `read_csv()` are:
* A character vector, used as column names. If these don't match up
with the columns in the data, you'll get a warning message.
EXAMPLE
### Column types
Readr uses a heuristic to figure out the types of your columns: it reads the first 1000 rows and uses some (moderately conservative) heuristics to figure out the type of each column. This is fast, and fairly robust. If readr detects the wrong type of data, you'll get warning messages. Readr prints out the first five, and you can access them all with `problems()`:
@ -120,7 +120,7 @@ You can use the following types of columns
`col_logical()` (l) parses TRUE, T, FALSE and F into a logical vector.
`col_character()` (c) leaves strings as is.
* `col_number()` (n) is a more flexible parsed for numbers embedded in other
* `col_number()` (n) is a more flexible parser for numbers embedded in other
strings. It will look for the first number in a string, ignoring non-numeric
prefixes and suffixes. It will also ignore the grouping mark specified by
the locale (see below for more details).
@ -142,9 +142,9 @@ read_csv("mypath.csv", col_types = cols(
))
```
(If you just have a few columns you supply a single string giving the type for each column: `i__dc`. See the documentation for more details. It's not as easy to understand as the `cols()` specification, so I'm not going to describe it further here.)
(If you just have a few columns, you can supply a single string that gives the type for each column: `i__dc`. See the documentation for more details. It's not as easy to understand as the `cols()` specification, so I'm not going to describe it further here.)
By default, any column not mentioned in `cols` will be guessed. If you'd rather those columns are simply not read in, use `cols_only()`. In that case, you can use `col_guess()` (?) if you want to guess the type of a column.
By default, any column not mentioned in `cols` will be guessed. If you'd rather those columns are simply not read in, use `cols_only()`. In that case, you can use `col_guess()` (?) if you want to guess the type of a column and include it to be read.
Each `col_XYZ()` function also has a corresponding `parse_XYZ()` that you can use on a character vector. This makes it easier to explore what each of the parsers does interactively.
@ -163,7 +163,7 @@ parse_logical(c("TRUE ", " ."), na = ".")
#### Datetimes
Readr provides three options depending on where you want a date (the number of days since 1970-01-01), a date time (the number of seconds since midnight 1970-01-01), or a time (i.e. the number of seconds since midnight). The defaults read:
Readr provides three options depending on whether you want a date (the number of days since 1970-01-01), a date time (the number of seconds since midnight 1970-01-01), or a time (i.e. the number of seconds since midnight). The defaults read:
* Date times: an [ISO8601](https://en.wikipedia.org/wiki/ISO_8601) date time.
* Date: a year, optional separator, month, optional separator, day.

View File

@ -2,7 +2,7 @@
With data, the relationships between values matter as much as the values themselves. Tidy data encodes those relationships.
Throughout this book we work with "tibbles" instead of the traditional data frame. Tibbles _are_ data frame but encode some patterns that make modern usage of R better. Unfortunately R is an old language, and things that made sense 10 or 20 years a go are no longer as valid. It's difficult to change base R without breaking existing code, so most innovation occurs in packages, providing new functions that you should use instead of the old ones.
Throughout this book we work with "tibbles" instead of the traditional data frame. Tibbles _are_ data frames but they encode some patterns that make modern usage of R better. Unfortunately R is an old language, and things that made sense 10 or 20 years a go are no longer as valid. It's difficult to change base R without breaking existing code, so most innovation occurs in packages, providing new functions that you should use instead of the old ones.
```{r}
library(tibble)
@ -16,7 +16,7 @@ The majority of the functions that you'll use in this book already produce tibbl
as_data_frame(iris)
```
As well as data frames, this function also knows how to convert lists (provided the elements are equal length vectors), matrices, and tables.
`as_data_frame()` knows how to convert data frames, lists (provided the elements are equal length vectors), matrices, and tables.
You can also create a new tibble from individual vectors with `data_frame()`: