r4ds/tibble.qmd

269 lines
7.6 KiB
Plaintext
Raw Normal View History

# Tibbles {#sec-tibbles}
2016-07-19 21:57:22 +08:00
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
2022-05-02 21:35:25 +08:00
status("complete")
```
2016-07-19 21:57:22 +08:00
## Introduction
Throughout this book we work with "tibbles" instead of R's traditional `data.frame`.
2022-05-02 21:35:25 +08:00
Tibbles *are* data frames, but they tweak some older behaviors to make your life a little easier.
R is an old language, and some things that were useful 10 or 20 years ago now get in your way.
It's difficult to change base R without breaking existing code, so most innovation occurs in packages.
Here we will describe the **tibble** package, which provides opinionated data frames that make working in the tidyverse a little easier.
2022-08-10 00:43:12 +08:00
In most places, we use the term tibble and data frame interchangeably; when we want to draw particular attention to R's built-in data frame, we'll call them `data.frame`s.
2016-07-25 04:04:41 +08:00
2016-07-19 21:57:22 +08:00
### Prerequisites
In this chapter we'll explore the **tibble** package, part of the core tidyverse.
2016-07-19 21:57:22 +08:00
```{r}
#| label: setup
#| message: false
2016-10-04 01:30:24 +08:00
library(tidyverse)
2016-07-19 21:57:22 +08:00
```
2020-01-16 02:40:02 +08:00
## Creating tibbles
2016-07-19 21:57:22 +08:00
2022-05-02 21:35:25 +08:00
If you need to make a tibble "by hand", you can use `tibble()` or `tribble()`.
`tibble()` works by assembling individual vectors:
2016-07-19 21:57:22 +08:00
```{r}
2022-05-02 21:35:25 +08:00
x <- c(1, 2, 5)
y <- c("a", "b", "h")
tibble(x, y)
2016-07-19 21:57:22 +08:00
```
2022-05-02 21:35:25 +08:00
You can also optionally name the inputs, provide data inline with `c()`, and perform computation:
2016-07-19 21:57:22 +08:00
```{r}
2016-08-12 06:20:15 +08:00
tibble(
2022-05-02 21:35:25 +08:00
x1 = x,
x2 = c(10, 15, 25),
y = sqrt(x1^2 + x2^2)
2016-08-12 06:20:15 +08:00
)
2016-07-19 21:57:22 +08:00
```
2022-05-02 21:35:25 +08:00
Every column in a data frame or tibble must be same length, so you'll get an error if the lengths are different:
2022-03-24 21:53:11 +08:00
```{r}
#| error: true
2022-05-02 21:35:25 +08:00
tibble(
x = c(1, 5),
y = c("a", "b", "c")
)
```
As the error suggests, individual values will be recycled to the same length as everything else:
```{r}
tibble(
x = 1:5,
y = "a",
z = TRUE
)
```
Another way to create a tibble is with `tribble()`, which short for **tr**ansposed t**ibble**.
2022-05-02 21:35:25 +08:00
`tribble()` is customized for data entry in code: column headings start with `~` and entries are separated by commas.
This makes it possible to lay out small amounts of data in an easy to read form:
2022-03-24 21:53:11 +08:00
```{r}
tribble(
~x, ~y, ~z,
"a", 2, 3.6,
"b", 1, 8.5
)
```
2022-05-02 21:35:25 +08:00
Finally, if you have a regular `data.frame` you can turn it into to a tibble with `as_tibble()`:
2016-07-19 21:57:22 +08:00
2022-05-02 21:35:25 +08:00
```{r}
as_tibble(mtcars)
```
The inverse of `as_tibble()` is `as.data.frame()`; it converts a tibble back into a regular `data.frame`.
## Non-syntactic names
It's possible for a tibble to have column names that are not valid R variable names, names that are **non-syntactic**.
For example, the variables might not start with a letter or they might contain unusual characters like a space.
To refer to these variables, you need to surround them with backticks, `` ` ``:
2016-07-26 00:28:05 +08:00
```{r}
tb <- tibble(
`:)` = "smile",
` ` = "space",
`2000` = "number"
)
tb
```
2016-08-12 06:20:15 +08:00
You'll also need the backticks when working with these variables in other packages, like ggplot2, dplyr, and tidyr.
2016-10-04 21:37:11 +08:00
## Tibbles vs. data.frame
2016-07-19 21:57:22 +08:00
2016-10-04 21:37:11 +08:00
There are two main differences in the usage of a tibble vs. a classic `data.frame`: printing and subsetting.
2022-05-02 21:35:25 +08:00
If these difference cause problems when working with older packages, you can turn a tibble back to a regular data frame with `as.data.frame()`.
2016-07-19 21:57:22 +08:00
### Printing
Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen.
This makes it much easier to work with large data.
2022-05-02 21:35:25 +08:00
In addition to its name, each column reports its type, a nice feature inspired by `str()`:
2016-07-19 21:57:22 +08:00
```{r}
tibble(
2016-10-05 03:21:04 +08:00
a = lubridate::now() + runif(1e3) * 86400,
b = lubridate::today() + runif(1e3) * 30,
2016-07-19 21:57:22 +08:00
c = 1:1e3,
d = runif(1e3),
e = sample(letters, 1e3, replace = TRUE)
)
```
2022-05-02 21:35:25 +08:00
Where possible, tibbles also use color to draw your eye to important differences.
2022-02-16 01:59:19 +08:00
One of the most important distinctions is between the string `"NA"` and the missing value, `NA`:
```{r}
tibble(x = c("NA", NA))
```
2022-03-24 21:53:11 +08:00
Tibbles are designed to avoid overwhelming your console when you print large data frames.
But sometimes you need more output than the default display.
There are a few options that can help.
2016-07-22 22:33:12 +08:00
First, you can explicitly `print()` the data frame and control the number of rows (`n`) and the `width` of the display.
`width = Inf` will display all columns:
2016-07-26 00:28:05 +08:00
2022-03-24 21:53:11 +08:00
```{r}
2022-05-02 21:35:25 +08:00
library(nycflights13)
flights |>
2016-08-12 06:20:15 +08:00
print(n = 10, width = Inf)
2016-07-26 00:28:05 +08:00
```
2022-03-24 21:53:11 +08:00
You can also control the default print behavior by setting options:
2016-07-27 20:47:42 +08:00
- `options(tibble.print_max = n, tibble.print_min = m)`: if more than `n` rows, print only `m` rows.
Use `options(tibble.print_min = Inf)` to always show all rows.
2016-07-27 20:47:42 +08:00
- Use `options(tibble.width = Inf)` to always print all columns, regardless of the width of the screen.
2016-07-27 20:47:42 +08:00
2016-08-12 06:20:15 +08:00
You can see a complete list of options by looking at the package help with `package?tibble`.
A final option is to use RStudio's built-in data viewer to get a scrollable view of the complete dataset.
This is also often useful at the end of a long chain of manipulations.
2016-08-12 06:20:15 +08:00
```{r}
#| eval: false
2022-05-02 21:35:25 +08:00
flights |> View()
2016-08-12 06:20:15 +08:00
```
2016-07-27 20:47:42 +08:00
2022-05-02 21:35:25 +08:00
### Extracting variables
2016-07-19 21:57:22 +08:00
So far all the tools you've learned have worked with complete data frames.
2022-05-02 21:35:25 +08:00
If you want to pull out a single variable, you can use `dplyr::pull()`:
2016-07-19 21:57:22 +08:00
```{r}
tb <- tibble(
id = LETTERS[1:5],
x1 = 1:5,
y1 = 6:10
2016-07-19 21:57:22 +08:00
)
2022-05-02 21:35:25 +08:00
tb |> pull(x1) # by name
tb |> pull(1) # by position
```
`pull()` also takes an optional `name` argument that specifies the column to be used as names for a named vector, which you'll learn about in @sec-vectors.
2022-05-02 21:35:25 +08:00
```{r}
2022-02-24 03:15:52 +08:00
tb |> pull(x1, name = id)
```
2022-05-02 21:35:25 +08:00
You can also use the base R tools `$` and `[[`.
`[[` can extract by name or position; `$` only extracts by name but is a little less typing.
```{r}
2016-08-12 06:20:15 +08:00
# Extract by name
tb$x1
tb[["x1"]]
2016-08-12 06:20:15 +08:00
# Extract by position
tb[[1]]
2016-07-19 21:57:22 +08:00
```
Compared to a `data.frame`, tibbles are more strict: they never do partial matching, and they will generate a warning if the column you are trying to access does not exist.
```{r}
2022-05-02 21:35:25 +08:00
# Tibbles complain a lot:
tb$x
tb$z
2022-05-02 21:35:25 +08:00
# Data frame use partial matching and don't complain if a column doesn't exist
df <- as.data.frame(tb)
df$x
df$z
```
2016-10-04 21:37:11 +08:00
2022-05-02 21:35:25 +08:00
For this reason we sometimes joke that tibbles are lazy and surly: they do less and complain more.
2016-07-19 21:57:22 +08:00
2022-05-02 21:35:25 +08:00
### Subsetting
2016-07-19 21:57:22 +08:00
2022-05-02 21:35:25 +08:00
Lastly, there are some important differences when using `[`.
With `data.frame`s, `[` sometimes returns a `data.frame`, and sometimes returns a vector, which is a common source of bugs.
With tibbles, `[` always returns another tibble.
2022-05-02 21:35:25 +08:00
This can sometimes cause problems when working with older code.
If you hit one of those functions, just use `as.data.frame()` to turn your tibble back to a `data.frame`.
2016-07-25 04:04:41 +08:00
2022-05-02 21:35:25 +08:00
### Exercises
2016-07-25 04:04:41 +08:00
1. How can you tell if an object is a tibble?
(Hint: try printing `mtcars`, which is a regular `data.frame`).
2. Compare and contrast the following operations on a `data.frame` and equivalent tibble.
What is different?
Why might the default `data.frame` behaviors cause you frustration?
```{r}
#| eval: false
2016-07-25 04:04:41 +08:00
2016-10-04 21:37:11 +08:00
df <- data.frame(abc = 1, xyz = "a")
df$x
df[, "xyz"]
df[, c("abc", "xyz")]
```
3. If you have the name of a variable stored in an object, e.g. `var <- "mpg"`, how can you extract the reference variable from a tibble?
2016-10-07 21:16:20 +08:00
4. Practice referring to non-syntactic names in the following data frame by:
2016-10-05 03:21:04 +08:00
a. Extracting the variable called `1`.
b. Plotting a scatterplot of `1` vs `2`.
c. Creating a new column called `3` which is `2` divided by `1`.
d. Renaming the columns to `one`, `two` and `three`.
2016-08-12 06:20:15 +08:00
```{r}
annoying <- tibble(
`1` = 1:10,
`2` = `1` * 2 + rnorm(length(`1`))
)
```
2016-07-19 21:57:22 +08:00
5. What does `tibble::enframe()` do?
When might you use it?
2016-07-19 21:57:22 +08:00
6. What option controls how many additional column names are printed at the footer of a tibble?
2022-10-21 03:27:58 +08:00
## Summary
If this chapter leaves you wanting to learn more about tibbles, you might enjoy `vignette("tibble")`.