r4ds/tibble.Rmd

# Tibbles

## Introduction

Throughout this book we work with "tibbles" instead of R's traditional data.frame. Tibbles _are_ data frames, but they tweak some older behaviours to make life a littler easier. R is an old language, and some things that were useful 10 or 20 years ago now get in your way. It's difficult to change base R without breaking existing code, so most innovation occurs in packages. Here we will describe the __tibble__ package, which provides opinionated data frames that make working in the tidyverse a little easier. 

If this chapter leaves you wanting to learn more about tibbles, you might enjoy `vignette("tibble")`.

### Prerequisites

In this chapter we'll explore the __tibble__ package. Most chapters don't load the tibble package explicitly, because we're just using tibbles, not creating them. Here we're going to create them by hand (not from an existing data source), so we'll need to load it explicitly.

```{r setup}
library(tibble)
```

## Creating tibbles {#tibbles}

Almost all of the functions that you'll use in this book produce tibbles as tibbles are one of the unifying features of the tidyverse. Most other R packages use regular data frames, so you might want to coerce a data frame to a tibble. You can do that with `as_tibble()`:

```{r}
as_tibble(iris)
```

You can create a new tibble from individual vectors with `tibble()`. `tibble()` will automatically recycle inputs of length 1, and allows you to refer to variables that you just created, as shown below.

```{r}
tibble(
  x = 1:5, 
  y = 1, 
  z = x ^ 2 + y
)
```

If you're already familiar with `data.frame()`, note that `tibble()` does much less: it never changes the type of the inputs (e.g. it never converts strings to factors!), it never changes the names of variables, and it never creates row names.

It's possible for a tibble to have column names that are not valid R variable names, aka __non-syntactic__ names. For example, they might not start with a letter, or they might contain unusual characters like a space. To refer to these variables, you need to surround them with backticks, `` ` ``:

```{r}
tb <- tibble(
  `:)` = "smile", 
  ` ` = "space",
  `2000` = "number"
)
tb
```

You'll also need the backticks when working with these variables in other packages, like ggplot2, dplyr, and tidyr.

Another way to create a tibble is with `tribble()`, short for **tr**ansposed tibble.  `tribble()` is customised for data entry in code: column headings are defined by formulas (i.e. they start with `~`), and entries are separated by commas. This makes it possible to lay out small amounts of data in easy to read form.

```{r}
tribble(
  ~x, ~y, ~z,
  #--|--|----
  "a", 2, 3.6,
  "b", 1, 8.5
)
```

I often add a comment (the line starting with `#`), to make it really clear where the header is.

## Tibbles vs. data frames

There are two main differences in the usage of a data frame vs a tibble: printing and subsetting.

### Printing

Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen. This makes it much easier to work with large data. In addition to its name, each column reports its type, a nice feature borrowed from `str()`:

```{r}
tibble(
  a = lubridate::now() + runif(1e3) * 60,
  b = lubridate::today() + runif(1e3),
  c = 1:1e3,
  d = runif(1e3),
  e = sample(letters, 1e3, replace = TRUE)
)
```

Tibbles are designed so that you don't accidentally overwhelm your console when you print large data frames. But sometimes you need more output than the default display. There are a few options that can help.

First, you can explicitly `print()` the data frame and control the number of rows (`n`) and the `width` of the display. `width = Inf` will display all columns:

```{r, eval = FALSE}
nycflights13::flights %>% 
  print(n = 10, width = Inf)
```

You can also control the default print behaviour by setting options:

* `options(tibble.print_max = n, tibble.print_min = m)`: if more than `m`
  rows, print only `n` rows. Use `options(dplyr.print_max = Inf)` to always
  show all rows.

* Use `options(tibble.width = Inf)` to always print all columns, regardless
  of the width of the screen.

You can see a complete list of options by looking at the package help with `package?tibble`.

A final option is to use RStudio's built-in data viewer to get a scrollable view of the complete dataset. This is also often useful at the end of a long chain of manipulations.

```{r, eval = FALSE}
nycflights13::flights %>% 
  View()
```

### Subsetting

So far all the tools you've learned have worked with complete data frames. If you want to pull out a single variable, you need some new tools, `$` and `[[`. `[[` can extract by name or position; `$` only extracts by name but is a little less typing.

```{r}
df <- tibble(
  x = runif(5),
  y = rnorm(5)
)

# Extract by name
df$x
df[["x"]]

# Extract by position
df[[1]]
```

To use these in a pipe, you'll need to use the special placeholder `.`:

```{r, include = FALSE}
library(magrittr)
```

```{r}
df %>% .$x
df %>% .[["x"]]
```

## Interacting with older code

Some older functions don't work with tibbles. If you encounter one of these functions, use `as.data.frame()` to turn a tibble back to a data frame:

```{r}
class(as.data.frame(tb))
```

The main reason that some older functions don't work with tibble is the `[` function.  We don't use `[` much in this book much because `dplyr::filter()` and `dplyr::select()` allow you to solve the same problems with clearer code (but you will learn a little about it in [vector subsetting](#vector-subsetting). With base R data frames, `[` sometimes returns a data frame, and sometimes returns a vector. With tibbles, `[` always returns another tibble.

## Exercises

1.  How can you tell if an object is a tibble? (Hint: trying print `mtcars`,
    which is a regular data frame). 

1.  Practice referring to non-syntactic names by:

    1.  Plotting a scatterplot of `1` vs `2`.

    1.  Creating a new column called `3` which is `2` divided by `1`.
        
    1.  Renaming the columns to `one`, `two` and `three`. 
    
    1.  Extracting the variable called `1`.
    
    ```{r}
    annoying <- tibble(
      `1` = 1:10,
      `2` = `1` * 2 + rnorm(length(`1`))
    )
    ```

1.  What does `tibble::enframe()` do? When might you use it?

1.  What option controls how many additional column names are printed
    at the footer of a tibble?
Move tibble to own chapter. 2016-07-19 21:57:22 +08:00			`# Tibbles`

			`## Introduction`

More tibble tweaks 2016-08-18 21:47:39 +08:00			Throughout this book we work with "tibbles" instead of R's traditional data.frame. Tibbles _are_ data frames, but they tweak some older behaviours to make life a littler easier. R is an old language, and some things that were useful 10 or 20 years ago now get in your way. It's difficult to change base R without breaking existing code, so most innovation occurs in packages. Here we will describe the __tibble__ package, which provides opinionated data frames that make working in the tidyverse a little easier.
Tibble tweaks 2016-07-25 04:04:41 +08:00
Tibble proofing 2016-08-12 06:20:15 +08:00			If this chapter leaves you wanting to learn more about tibbles, you might enjoy `vignette("tibble")`.
Move tibble to own chapter. 2016-07-19 21:57:22 +08:00
			`### Prerequisites`

minor grammar fix in tibble intro (#276) 2016-08-17 21:32:19 +08:00			`In this chapter we'll explore the __tibble__ package. Most chapters don't load the tibble package explicitly, because we're just using tibbles, not creating them. Here we're going to create them by hand (not from an existing data source), so we'll need to load it explicitly.`
Move tibble to own chapter. 2016-07-19 21:57:22 +08:00
			```{r setup}
			`library(tibble)`
			```

			`## Creating tibbles {#tibbles}`

More tibble tweaks 2016-08-18 21:47:39 +08:00			Almost all of the functions that you'll use in this book produce tibbles as tibbles are one of the unifying features of the tidyverse. Most other R packages use regular data frames, so you might want to coerce a data frame to a tibble. You can do that with `as_tibble()`:
Move tibble to own chapter. 2016-07-19 21:57:22 +08:00
			```{r}
			`as_tibble(iris)`
			```

Tibble proofing 2016-08-12 06:20:15 +08:00			You can create a new tibble from individual vectors with `tibble()`. `tibble()` will automatically recycle inputs of length 1, and allows you to refer to variables that you just created, as shown below.
Move tibble to own chapter. 2016-07-19 21:57:22 +08:00
			```{r}
Tibble proofing 2016-08-12 06:20:15 +08:00			`tibble(`
			`x = 1:5,`
			`y = 1,`
			`z = x ^ 2 + y`
			`)`
Move tibble to own chapter. 2016-07-19 21:57:22 +08:00			```

Tibble proofing 2016-08-12 06:20:15 +08:00			If you're already familiar with `data.frame()`, note that `tibble()` does much less: it never changes the type of the inputs (e.g. it never converts strings to factors!), it never changes the names of variables, and it never creates row names.
Move tibble to own chapter. 2016-07-19 21:57:22 +08:00
Tibble proofing 2016-08-12 06:20:15 +08:00			It's possible for a tibble to have column names that are not valid R variable names, aka __non-syntactic__ names. For example, they might not start with a letter, or they might contain unusual characters like a space. To refer to these variables, you need to surround them with backticks, `` ` ``:
Tidy chapter updates 2016-07-26 00:28:05 +08:00
			```{r}
			`tb <- tibble(`
			`:)` = "smile",
			` ` = "space",
			`2000` = "number"
			`)`
			`tb`
			```

Tibble proofing 2016-08-12 06:20:15 +08:00			`You'll also need the backticks when working with these variables in other packages, like ggplot2, dplyr, and tidyr.`

			Another way to create a tibble is with `tribble()`, short for transposed tibble. `tribble()` is customised for data entry in code: column headings are defined by formulas (i.e. they start with `~`), and entries are separated by commas. This makes it possible to lay out small amounts of data in easy to read form.

Move tibble to own chapter. 2016-07-19 21:57:22 +08:00			```{r}
Tibble proofing 2016-08-12 06:20:15 +08:00			`tribble(`
Tibble tweaks 2016-07-27 20:47:42 +08:00			`~x, ~y, ~z,`
			`#--\|--\|----`
			`"a", 2, 3.6,`
			`"b", 1, 8.5`
Move tibble to own chapter. 2016-07-19 21:57:22 +08:00			`)`
			```

Tibble proofing 2016-08-12 06:20:15 +08:00			I often add a comment (the line starting with `#`), to make it really clear where the header is.
Tidy chapter updates 2016-07-26 00:28:05 +08:00
Move tibble to own chapter. 2016-07-19 21:57:22 +08:00			`## Tibbles vs. data frames`

Update tibble.Rmd (#249) Typo 2016-08-13 04:04:31 +08:00			`There are two main differences in the usage of a data frame vs a tibble: printing and subsetting.`
Move tibble to own chapter. 2016-07-19 21:57:22 +08:00
			`### Printing`

			Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen. This makes it much easier to work with large data. In addition to its name, each column reports its type, a nice feature borrowed from `str()`:

			```{r}
			`tibble(`
			`a = lubridate::now() + runif(1e3) * 60,`
			`b = lubridate::today() + runif(1e3),`
			`c = 1:1e3,`
			`d = runif(1e3),`
			`e = sample(letters, 1e3, replace = TRUE)`
			`)`
			```

Update tibble.Rmd (#249) Typo 2016-08-13 04:04:31 +08:00			`Tibbles are designed so that you don't accidentally overwhelm your console when you print large data frames. But sometimes you need more output than the default display. There are a few options that can help.`
Tibble printing tip 2016-07-22 22:33:12 +08:00
Tibble proofing 2016-08-12 06:20:15 +08:00			First, you can explicitly `print()` the data frame and control the number of rows (`n`) and the `width` of the display. `width = Inf` will display all columns:
Tidy chapter updates 2016-07-26 00:28:05 +08:00
			```{r, eval = FALSE}
Tibble proofing 2016-08-12 06:20:15 +08:00			`nycflights13::flights %>%`
			`print(n = 10, width = Inf)`
Tidy chapter updates 2016-07-26 00:28:05 +08:00			```

Tibble proofing 2016-08-12 06:20:15 +08:00			`You can also control the default print behaviour by setting options:`
Tibble tweaks 2016-07-27 20:47:42 +08:00
			* `options(tibble.print_max = n, tibble.print_min = m)`: if more than `m`
Tibble proofing 2016-08-12 06:20:15 +08:00			rows, print only `n` rows. Use `options(dplyr.print_max = Inf)` to always
Tibble tweaks 2016-07-27 20:47:42 +08:00			`show all rows.`

Tibble proofing 2016-08-12 06:20:15 +08:00			* Use `options(tibble.width = Inf)` to always print all columns, regardless
			`of the width of the screen.`
Tibble tweaks 2016-07-27 20:47:42 +08:00
Tibble proofing 2016-08-12 06:20:15 +08:00			You can see a complete list of options by looking at the package help with `package?tibble`.

			`A final option is to use RStudio's built-in data viewer to get a scrollable view of the complete dataset. This is also often useful at the end of a long chain of manipulations.`

			```{r, eval = FALSE}
			`nycflights13::flights %>%`
			`View()`
			```
Tibble tweaks 2016-07-27 20:47:42 +08:00
Move tibble to own chapter. 2016-07-19 21:57:22 +08:00			`### Subsetting`

Update tibble.Rmd (#249) Typo 2016-08-13 04:04:31 +08:00			So far all the tools you've learned have worked with complete data frames. If you want to pull out a single variable, you need some new tools, `$` and `[[`. `[[` can extract by name or position; `$` only extracts by name but is a little less typing.
Move tibble to own chapter. 2016-07-19 21:57:22 +08:00
			```{r}
Tibble proofing 2016-08-12 06:20:15 +08:00			`df <- tibble(`
			`x = runif(5),`
			`y = rnorm(5)`
Move tibble to own chapter. 2016-07-19 21:57:22 +08:00			`)`

Tibble proofing 2016-08-12 06:20:15 +08:00			`# Extract by name`
			`df$x`
			`df[["x"]]`

			`# Extract by position`
			`df[[1]]`
Move tibble to own chapter. 2016-07-19 21:57:22 +08:00			```

Tibble proofing 2016-08-12 06:20:15 +08:00			To use these in a pipe, you'll need to use the special placeholder `.`:

			```{r, include = FALSE}
			`library(magrittr)`
			```
Move tibble to own chapter. 2016-07-19 21:57:22 +08:00
			```{r}
Tibble proofing 2016-08-12 06:20:15 +08:00			`df %>% .$x`
			`df %>% .[["x"]]`
			```
Move tibble to own chapter. 2016-07-19 21:57:22 +08:00
Tibble proofing 2016-08-12 06:20:15 +08:00			`## Interacting with older code`
Move tibble to own chapter. 2016-07-19 21:57:22 +08:00
Tibble proofing 2016-08-12 06:20:15 +08:00			Some older functions don't work with tibbles. If you encounter one of these functions, use `as.data.frame()` to turn a tibble back to a data frame:

			```{r}
			`class(as.data.frame(tb))`
Move tibble to own chapter. 2016-07-19 21:57:22 +08:00			```

Fix a typo (#254) 2016-08-15 20:33:44 +08:00			The main reason that some older functions don't work with tibble is the `[` function. We don't use `[` much in this book much because `dplyr::filter()` and `dplyr::select()` allow you to solve the same problems with clearer code (but you will learn a little about it in [vector subsetting](#vector-subsetting). With base R data frames, `[` sometimes returns a data frame, and sometimes returns a vector. With tibbles, `[` always returns another tibble.
Tibble tweaks 2016-07-25 04:04:41 +08:00
Tibble proofing 2016-08-12 06:20:15 +08:00			`## Exercises`
Tibble tweaks 2016-07-25 04:04:41 +08:00
Tibble proofing 2016-08-12 06:20:15 +08:00			1. How can you tell if an object is a tibble? (Hint: trying print `mtcars`,
			`which is a regular data frame).`
Tibble tweaks 2016-07-25 04:04:41 +08:00
Tibble proofing 2016-08-12 06:20:15 +08:00			`1. Practice referring to non-syntactic names by:`
Tibble tweaks 2016-07-25 04:04:41 +08:00
Tibble proofing 2016-08-12 06:20:15 +08:00			1. Plotting a scatterplot of `1` vs `2`.

			1. Creating a new column called `3` which is `2` divided by `1`.

			1. Renaming the columns to `one`, `two` and `three`.

			1. Extracting the variable called `1`.

			```{r}
			`annoying <- tibble(`
			`1` = 1:10,
			`2` = `1` * 2 + rnorm(length(`1`))
			`)`
			```
Move tibble to own chapter. 2016-07-19 21:57:22 +08:00
Tibble proofing 2016-08-12 06:20:15 +08:00			1. What does `tibble::enframe()` do? When might you use it?
Move tibble to own chapter. 2016-07-19 21:57:22 +08:00
Tibble proofing 2016-08-12 06:20:15 +08:00			`1. What option controls how many additional column names are printed`
			`at the footer of a tibble?`