Update tibbles & move up in hierarchy

This commit is contained in:
hadley 2016-03-25 09:39:49 -05:00
parent 1c99099a0c
commit 031d7c9182
2 changed files with 85 additions and 143 deletions

View File

@ -252,146 +252,3 @@ The settings you are most like to need to change are:
## Binary files
Needs to discuss how data types in different languages are converted to R. Similarly for missing values.
## Tibble diffs
`data_frame()` is a nice way to create data frames. It encapsulates best practices for data frames:
* It never changes an input's type (i.e., no more `stringsAsFactors = FALSE`!).
```{r}
data.frame(x = letters) %>% sapply(class)
data_frame(x = letters) %>% sapply(class)
```
This makes it easier to use with list-columns:
```{r}
data_frame(x = 1:3, y = list(1:5, 1:10, 1:20))
```
List-columns are most commonly created by `do()`, but they can be useful to
create by hand.
* It never adjusts the names of variables:
```{r}
data.frame(`crazy name` = 1) %>% names()
data_frame(`crazy name` = 1) %>% names()
```
* It evaluates its arguments lazily and sequentially:
```{r}
data_frame(x = 1:5, y = x ^ 2)
```
* It adds the `tbl_df()` class to the output so that if you accidentally print a large
data frame you only get the first few rows.
```{r}
data_frame(x = 1:5) %>% class()
```
* It changes the behaviour of `[` to always return the same type of object:
subsetting using `[` always returns a `tbl_df()` object; subsetting using
`[[` always returns a column.
You should be aware of one case where subsetting a `tbl_df()` object
will produce a different result than a `data.frame()` object:
```{r}
df <- data.frame(a = 1:2, b = 1:2)
str(df[, "a"])
tbldf <- tbl_df(df)
str(tbldf[, "a"])
```
* It never uses `row.names()`. The whole point of tidy data is to
store variables in a consistent way. So it never stores a variable as
special attribute.
* It only recycles vectors of length 1. This is because recycling vectors of greater lengths
is a frequent source of bugs.
### Coercion
To complement `data_frame()`, dplyr provides `as_data_frame()` to coerce lists into data frames. It does two things:
* It checks that the input list is valid for a data frame, i.e. that each element
is named, is a 1d atomic vector or list, and all elements have the same
length.
* It sets the class and attributes of the list to make it behave like a data frame.
This modification does not require a deep copy of the input list, so it's
very fast.
This is much simpler than `as.data.frame()`. It's hard to explain precisely what `as.data.frame()` does, but it's similar to `do.call(cbind, lapply(x, data.frame))` - i.e. it coerces each component to a data frame and then `cbinds()` them all together. Consequently `as_data_frame()` is much faster than `as.data.frame()`:
```{r}
l2 <- replicate(26, sample(100), simplify = FALSE)
names(l2) <- letters
microbenchmark::microbenchmark(
as_data_frame(l2),
as.data.frame(l2)
)
```
The speed of `as.data.frame()` is not usually a bottleneck when used interactively, but can be a problem when combining thousands of messy inputs into one tidy data frame.
### tbl_dfs vs data.frames
There are three key differences between tbl_dfs and data.frames:
* When you print a tbl_df, it only shows the first ten rows and all the
columns that fit on one screen. It also prints an abbreviated description
of the column type:
```{r}
data_frame(x = 1:1000)
```
You can control the default appearance with options:
* `options(dplyr.print_max = n, dplyr.print_min = m)`: if more than `m`
rows print `m` rows. Use `options(dplyr.print_max = Inf)` to always
show all rows.
* `options(dplyr.width = Inf)` will always print all columns, regardless
of the width of the screen.
* When you subset a tbl\_df with `[`, it always returns another tbl\_df.
Contrast this with a data frame: sometimes `[` returns a data frame and
sometimes it just returns a single column:
```{r}
df1 <- data.frame(x = 1:3, y = 3:1)
class(df1[, 1:2])
class(df1[, 1])
df2 <- data_frame(x = 1:3, y = 3:1)
class(df2[, 1:2])
class(df2[, 1])
```
To extract a single column it's use `[[` or `$`:
```{r}
class(df2[[1]])
class(df2$x)
```
* When you extract a variable with `$`, tbl\_dfs never do partial
matching. They'll throw an error if the column doesn't exist:
```{r, error = TRUE}
df <- data.frame(abc = 1)
df$a
df2 <- data_frame(abc = 1)
df2$a
```

View File

@ -1,3 +1,88 @@
# Work with your data
With data, the relationships between values matter as much as the values themselves. Tidy data encodes those relationships.
Throughout this book we work with "tibbles" instead of the traditional data frame. Tibbles _are_ data frame but encode some patterns that make modern usage of R better. Unfortunately R is an old language, and things that made sense 10 or 20 years a go are no longer as valid. It's difficult to change base R without breaking existing code, so most innovation occurs in packages, providing new functions that you should use instead of the old ones.
```{r}
library(tibble)
```
## Creating tibbles
The majority of the functions that you'll use in this book already produce tibbles. But if you're working with functions from other packages, you might need to coerce a regular data frame a tibble. You can do that with `as_data_frame()`:
```{r}
as_data_frame(iris)
```
As well as data frames, this function also knows how to convert lists (provided the elements are equal length vectors), matrices, and tables.
You can also create a new tibble from individual vectors with `data_frame()`:
```{r}
data_frame(x = 1:5, y = 1, z = x ^ 2 + y)
```
`data_frame()` does much less than `data.frame()`: it never changes the type of the inputs (e.g. it never converts strings to factors!), it never changes the names of variables, and it never creates `row.names()`. You can read more about these features in the vignette, `vignette("tibble")`.
You can define a tibble row-by-row with `frame_data()`:
```{r}
frame_data(
~x, ~y, ~z,
"a", 2, 3.6,
"b", 1, 8.5
)
```
## Tibbles vs data frames
There are two main differences in the usage of a data frame vs a tibble: printing, and subsetting.
Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen. This makes it much easier to work with large data. In addition to its name, each column reports its type, a nice feature borrowed from `str()`:
```{r}
library(nycflights13)
flights
```
You can control the default appearance with options:
* `options(tibble.print_max = n, tibble.print_min = m)`: if more than `m`
rows print `m` rows. Use `options(dplyr.print_max = Inf)` to always
show all rows.
* `options(tibble.width = Inf)` will always print all columns, regardless
of the width of the screen.
Tibbles are strict about subsetting. If you try to access a variable that does not exist, you'll get an error:
```{r, error = TRUE}
flights$yea
```
Tibbles also clearly delineate `[` and `[[`: `[` always returns another tibble, `[[` always returns a vector. No more `drop = FALSE`!
```{r}
class(iris[ , 1])
class(iris[ , 1, drop = FALSE])
class(as_data_frame(iris)[ , 1])
```
Contrast this with a data frame: sometimes `[` returns a data frame and
sometimes it just returns a single column:
```{r}
df1 <- data.frame(x = 1:3, y = 3:1)
class(df1[, 1:2])
class(df1[, 1])
```
## Interacting with legacy code
Some older functions don't work with tibbles because they expect `df[, 1]` to return a vector, not a data frame. If you encounter one of these functions, use `as.data.frame()` to turn a tibble back to a data frame:
```
class(as.data.frame(tbl_df(iris)))
```