Polishing tibbles

This commit is contained in:
Hadley Wickham 2022-05-02 08:35:25 -05:00
parent b1f5d9f57c
commit 3267221ebb
1 changed files with 78 additions and 47 deletions

View File

@ -1,9 +1,13 @@
# Tibbles
```{r, results = "asis", echo = FALSE}
status("complete")
```
## Introduction
Throughout this book we work with "tibbles" instead of R's traditional `data.frame`.
Tibbles *are* data frames, but they tweak some older behaviours to make life a little easier.
Tibbles *are* data frames, but they tweak some older behaviors to make your life a little easier.
R is an old language, and some things that were useful 10 or 20 years ago now get in your way.
It's difficult to change base R without breaking existing code, so most innovation occurs in packages.
Here we will describe the **tibble** package, which provides opinionated data frames that make working in the tidyverse a little easier.
@ -21,30 +25,48 @@ library(tidyverse)
## Creating tibbles
Almost all of the functions that you'll use in this book produce tibbles, as tibbles are one of the unifying features of the tidyverse.
Most other R packages use regular `data.frame`s, so you might want to coerce a `data.frame` to a tibble.
You can do that with `as_tibble()`:
If you need to make a tibble "by hand", you can use `tibble()` or `tribble()`.
`tibble()` works by assembling individual vectors:
```{r}
as_tibble(mtcars)
x <- c(1, 2, 5)
y <- c("a", "b", "h")
tibble(x, y)
```
You can create a new tibble from individual vectors with `tibble()`.
`tibble()` will automatically recycle inputs of length 1, and allows you to refer to variables that you just created, as shown in this example:
You can also optionally name the inputs, provide data inline with `c()`, and perform computation:
```{r}
tibble(
x = 1:5,
y = 1,
z = x ^ 2 + y
x1 = x,
x2 = c(10, 15, 25),
y = sqrt(x1^2 + x2^2)
)
```
If you're already familiar with `data.frame()`, note that `tibble()` does less: it never changes the names of variables and it never creates row names.
Every column in a data frame or tibble must be same length, so you'll get an error if the lengths are different:
Another way to create a tibble is with `tribble()`, short for **tr**ansposed tibble.
`tribble()` is customized for data entry in code: column headings start with `~`) and entries are separated by commas.
This makes it possible to lay out small amounts of data in easy to read form:
```{r, error = TRUE}
tibble(
x = c(1, 5),
y = c("a", "b", "c")
)
```
As the error suggests, individual values will be recycled to the same length as everything else:
```{r}
tibble(
x = 1:5,
y = "a",
z = TRUE
)
```
Another way to create a tibble is with `tribble()`, which short for **tr**ansposed tibble.
`tribble()` is customized for data entry in code: column headings start with `~` and entries are separated by commas.
This makes it possible to lay out small amounts of data in an easy to read form:
```{r}
tribble(
@ -54,10 +76,18 @@ tribble(
)
```
### Non-syntactic names
Finally, if you have a regular `data.frame` you can turn it into to a tibble with `as_tibble()`:
It's possible for a tibble to have column names that are not valid R variable names, aka **non-syntactic** names.
For example, they might not start with a letter, or they might contain unusual characters like a space.
```{r}
as_tibble(mtcars)
```
The inverse of `as_tibble()` is `as.data.frame()`; it converts a tibble back into a regular `data.frame`.
## Non-syntactic names
It's possible for a tibble to have column names that are not valid R variable names, names that are **non-syntactic**.
For example, the variables might not start with a letter or they might contain unusual characters like a space.
To refer to these variables, you need to surround them with backticks, `` ` ``:
```{r}
@ -74,12 +104,13 @@ You'll also need the backticks when working with these variables in other packag
## Tibbles vs. data.frame
There are two main differences in the usage of a tibble vs. a classic `data.frame`: printing and subsetting.
If these difference cause problems when working with older packages, you can turn a tibble back to a regular data frame with `as.data.frame()`.
### Printing
Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen.
This makes it much easier to work with large data.
In addition to its name, each column reports its type, a nice feature borrowed from `str()`:
In addition to its name, each column reports its type, a nice feature inspired by `str()`:
```{r}
tibble(
@ -91,7 +122,7 @@ tibble(
)
```
Where possible, they also use color to draw your eye to important differences.
Where possible, tibbles also use color to draw your eye to important differences.
One of the most important distinctions is between the string `"NA"` and the missing value, `NA`:
```{r}
@ -106,7 +137,9 @@ First, you can explicitly `print()` the data frame and control the number of row
`width = Inf` will display all columns:
```{r}
nycflights13::flights |>
library(nycflights13)
flights |>
print(n = 10, width = Inf)
```
@ -123,15 +156,13 @@ A final option is to use RStudio's built-in data viewer to get a scrollable view
This is also often useful at the end of a long chain of manipulations.
```{r, eval = FALSE}
nycflights13::flights |>
View()
flights |> View()
```
### Subsetting
### Extracting variables
So far all the tools you've learned have worked with complete data frames.
If you want to pull out a single variable, you can use `dplyr::pull()`.
`pull()` also takes an optional `name` argument that specifies the column to be used as names for a named vector (you'll learn more about those in Chapter \@ref(vectors).
If you want to pull out a single variable, you can use `dplyr::pull()`:
```{r}
tb <- tibble(
@ -140,11 +171,17 @@ tb <- tibble(
y1 = 6:10
)
tb |> pull(x1)
tb |> pull(x1) # by name
tb |> pull(1) # by position
```
`pull()` also takes an optional `name` argument that specifies the column to be used as names for a named vector, which you'll learn about in Chapter \@ref(vectors).
```{r}
tb |> pull(x1, name = id)
```
Alternatively, you can use base R tools like `$` and `[[`.
You can also use the base R tools `$` and `[[`.
`[[` can extract by name or position; `$` only extracts by name but is a little less typing.
```{r}
@ -157,35 +194,29 @@ tb[[1]]
```
Compared to a `data.frame`, tibbles are more strict: they never do partial matching, and they will generate a warning if the column you are trying to access does not exist.
In the following chunk `df` is a `data.frame` and `tb` is a `tibble`.
```{r}
# Tibbles complain a lot:
tb$x
tb$z
# Data frame use partial matching and don't complain if a column doesn't exist
df <- as.data.frame(tb)
# Partial match to existing variable name
tb$x # Warning + no match
df$x # Warning + partial match
# Column doesn't exist
tb$z # Warning
df$z # No warning
df$x
df$z
```
## Interacting with older code
For this reason we sometimes joke that tibbles are lazy and surly: they do less and complain more.
Some older functions don't work with tibbles.
If you encounter one of these functions, use `as.data.frame()` to turn a tibble back to a `data.frame`:
### Subsetting
```{r}
class(as.data.frame(tb))
```
The main reason that some older functions don't work with tibble is the `[` function.
We don't use `[` much in this book because for data frames, `dplyr::filter()` and `dplyr::select()` typically allow you to solve the same problems with clearer code.
With base R `data.frame`s, `[` sometimes returns a `data.frame`, and sometimes returns a vector.
Lastly, there are some important differences when using `[`.
With `data.frame`s, `[` sometimes returns a `data.frame`, and sometimes returns a vector, which is a common source of bugs.
With tibbles, `[` always returns another tibble.
This can sometimes cause problems when working with older code.
If you hit one of those functions, just use `as.data.frame()` to turn your tibble back to a `data.frame`.
## Exercises
### Exercises
1. How can you tell if an object is a tibble?
(Hint: try printing `mtcars`, which is a regular `data.frame`).