diff --git a/import.Rmd b/import.Rmd index edf26cc..b8d8398 100644 --- a/import.Rmd +++ b/import.Rmd @@ -252,146 +252,3 @@ The settings you are most like to need to change are: ## Binary files Needs to discuss how data types in different languages are converted to R. Similarly for missing values. - - -## Tibble diffs - -`data_frame()` is a nice way to create data frames. It encapsulates best practices for data frames: - - * It never changes an input's type (i.e., no more `stringsAsFactors = FALSE`!). - - ```{r} - data.frame(x = letters) %>% sapply(class) - data_frame(x = letters) %>% sapply(class) - ``` - - This makes it easier to use with list-columns: - - ```{r} - data_frame(x = 1:3, y = list(1:5, 1:10, 1:20)) - ``` - - List-columns are most commonly created by `do()`, but they can be useful to - create by hand. - - * It never adjusts the names of variables: - - ```{r} - data.frame(`crazy name` = 1) %>% names() - data_frame(`crazy name` = 1) %>% names() - ``` - - * It evaluates its arguments lazily and sequentially: - - ```{r} - data_frame(x = 1:5, y = x ^ 2) - ``` - - * It adds the `tbl_df()` class to the output so that if you accidentally print a large - data frame you only get the first few rows. - - ```{r} - data_frame(x = 1:5) %>% class() - ``` - - * It changes the behaviour of `[` to always return the same type of object: - subsetting using `[` always returns a `tbl_df()` object; subsetting using - `[[` always returns a column. - - You should be aware of one case where subsetting a `tbl_df()` object - will produce a different result than a `data.frame()` object: - - ```{r} - df <- data.frame(a = 1:2, b = 1:2) - str(df[, "a"]) - - tbldf <- tbl_df(df) - str(tbldf[, "a"]) - ``` - - * It never uses `row.names()`. The whole point of tidy data is to - store variables in a consistent way. So it never stores a variable as - special attribute. - - * It only recycles vectors of length 1. This is because recycling vectors of greater lengths - is a frequent source of bugs. - -### Coercion - -To complement `data_frame()`, dplyr provides `as_data_frame()` to coerce lists into data frames. It does two things: - -* It checks that the input list is valid for a data frame, i.e. that each element - is named, is a 1d atomic vector or list, and all elements have the same - length. - -* It sets the class and attributes of the list to make it behave like a data frame. - This modification does not require a deep copy of the input list, so it's - very fast. - -This is much simpler than `as.data.frame()`. It's hard to explain precisely what `as.data.frame()` does, but it's similar to `do.call(cbind, lapply(x, data.frame))` - i.e. it coerces each component to a data frame and then `cbinds()` them all together. Consequently `as_data_frame()` is much faster than `as.data.frame()`: - -```{r} -l2 <- replicate(26, sample(100), simplify = FALSE) -names(l2) <- letters -microbenchmark::microbenchmark( - as_data_frame(l2), - as.data.frame(l2) -) -``` - -The speed of `as.data.frame()` is not usually a bottleneck when used interactively, but can be a problem when combining thousands of messy inputs into one tidy data frame. - -### tbl_dfs vs data.frames - -There are three key differences between tbl_dfs and data.frames: - -* When you print a tbl_df, it only shows the first ten rows and all the - columns that fit on one screen. It also prints an abbreviated description - of the column type: - - ```{r} - data_frame(x = 1:1000) - ``` - - You can control the default appearance with options: - - * `options(dplyr.print_max = n, dplyr.print_min = m)`: if more than `m` - rows print `m` rows. Use `options(dplyr.print_max = Inf)` to always - show all rows. - - * `options(dplyr.width = Inf)` will always print all columns, regardless - of the width of the screen. - - -* When you subset a tbl\_df with `[`, it always returns another tbl\_df. - Contrast this with a data frame: sometimes `[` returns a data frame and - sometimes it just returns a single column: - - ```{r} - df1 <- data.frame(x = 1:3, y = 3:1) - class(df1[, 1:2]) - class(df1[, 1]) - - df2 <- data_frame(x = 1:3, y = 3:1) - class(df2[, 1:2]) - class(df2[, 1]) - ``` - - To extract a single column it's use `[[` or `$`: - - ```{r} - class(df2[[1]]) - class(df2$x) - ``` - -* When you extract a variable with `$`, tbl\_dfs never do partial - matching. They'll throw an error if the column doesn't exist: - - ```{r, error = TRUE} - df <- data.frame(abc = 1) - df$a - - df2 <- data_frame(abc = 1) - df2$a - ``` - diff --git a/work.Rmd b/work.Rmd index c8ccd00..9e5796f 100644 --- a/work.Rmd +++ b/work.Rmd @@ -1,3 +1,88 @@ # Work with your data With data, the relationships between values matter as much as the values themselves. Tidy data encodes those relationships. + +Throughout this book we work with "tibbles" instead of the traditional data frame. Tibbles _are_ data frame but encode some patterns that make modern usage of R better. Unfortunately R is an old language, and things that made sense 10 or 20 years a go are no longer as valid. It's difficult to change base R without breaking existing code, so most innovation occurs in packages, providing new functions that you should use instead of the old ones. + +```{r} +library(tibble) +``` + +## Creating tibbles + +The majority of the functions that you'll use in this book already produce tibbles. But if you're working with functions from other packages, you might need to coerce a regular data frame a tibble. You can do that with `as_data_frame()`: + +```{r} +as_data_frame(iris) +``` + +As well as data frames, this function also knows how to convert lists (provided the elements are equal length vectors), matrices, and tables. + +You can also create a new tibble from individual vectors with `data_frame()`: + +```{r} +data_frame(x = 1:5, y = 1, z = x ^ 2 + y) +``` + +`data_frame()` does much less than `data.frame()`: it never changes the type of the inputs (e.g. it never converts strings to factors!), it never changes the names of variables, and it never creates `row.names()`. You can read more about these features in the vignette, `vignette("tibble")`. + +You can define a tibble row-by-row with `frame_data()`: + +```{r} +frame_data( + ~x, ~y, ~z, + "a", 2, 3.6, + "b", 1, 8.5 +) +``` + +## Tibbles vs data frames + +There are two main differences in the usage of a data frame vs a tibble: printing, and subsetting. + +Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen. This makes it much easier to work with large data. In addition to its name, each column reports its type, a nice feature borrowed from `str()`: + +```{r} +library(nycflights13) +flights +``` + +You can control the default appearance with options: + +* `options(tibble.print_max = n, tibble.print_min = m)`: if more than `m` + rows print `m` rows. Use `options(dplyr.print_max = Inf)` to always + show all rows. + +* `options(tibble.width = Inf)` will always print all columns, regardless + of the width of the screen. + +Tibbles are strict about subsetting. If you try to access a variable that does not exist, you'll get an error: + +```{r, error = TRUE} +flights$yea +``` + +Tibbles also clearly delineate `[` and `[[`: `[` always returns another tibble, `[[` always returns a vector. No more `drop = FALSE`! + +```{r} +class(iris[ , 1]) +class(iris[ , 1, drop = FALSE]) +class(as_data_frame(iris)[ , 1]) +``` + +Contrast this with a data frame: sometimes `[` returns a data frame and +sometimes it just returns a single column: + +```{r} +df1 <- data.frame(x = 1:3, y = 3:1) +class(df1[, 1:2]) +class(df1[, 1]) +``` + +## Interacting with legacy code + +Some older functions don't work with tibbles because they expect `df[, 1]` to return a vector, not a data frame. If you encounter one of these functions, use `as.data.frame()` to turn a tibble back to a data frame: + +``` +class(as.data.frame(tbl_df(iris))) +```