r4ds/wrangle.Rmd

# (PART) Wrangle {-}

# Introduction

Throughout this book we work with "tibbles" instead of the traditional data frame. Tibbles _are_ data frames, but tweak some older behaviours to make life a littler easier. R is an old language, and some things that were true 10 or 20 years ago no longer apply. It's difficult to change base R without breaking existing code, so most innovation occurs in packages. Here we will describe the tibble package, which provides opinionated data frames that make working in the tidyverse a little easier. You can learn more about tibbles in the accompanying vignette: `vignette("tibble")`.

```{r setup}
library(tibble)
```

## Creating tibbles {#tibbles}

The majority of the functions that you'll use in this book already produce tibbles. If you're working with functions from other packages, you might need to coerce a regular data frame a tibble. You can do that with `as_tibble()`:

```{r}
as_tibble(iris)
```

`as_tibble()` knows how to convert data frames, lists (provided the elements are equal length vectors), matrices, and tables. 

You can create a new tibble from individual vectors with `tibble()`:

```{r}
tibble(x = 1:5, y = 1, z = x ^ 2 + y)
```

`tibble()` automatically recycles inputs of length 1, and you can refer to variables that you just created. Compared to `data.frame()`, `tibble()` does much less: it never changes the type of the inputs (e.g. it never converts strings to factors!), it never changes the names of variables, and it never creates `row.names()`.

Another way to create a tibble is with `frame_data()`, which is customised for data entry in R code. Column headings are defined by formulas (`~`), and entries are separated by commas:

```{r}
frame_data(
  ~x, ~y,  ~z,
  "a", 2,  3.6,
  "b", 1,  8.5
)
```

## Tibbles vs. data frames

There are two main differences in the usage of a data frame vs a tibble: printing, and subsetting.

### Printing

Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen. This makes it much easier to work with large data. In addition to its name, each column reports its type, a nice feature borrowed from `str()`:

```{r}
tibble(
  a = lubridate::now() + runif(1e3) * 60,
  b = lubridate::today() + runif(1e3),
  c = 1:1e3,
  d = runif(1e3),
  e = sample(letters, 1e3, replace = TRUE)
)
```

You can control the default appearance with options:

* `options(tibble.print_max = n, tibble.print_min = m)`: if more than `m`
  rows, print `n` rows. Use `options(dplyr.print_max = Inf)` to always
  show all rows.

* `options(tibble.width = Inf)` will always print all columns, regardless
   of the width of the screen.

You can see a complete list of options by looking at the package help: `package?tibble`.

### Subsetting

Tibbles are stricter about subsetting. If you try to access a variable that does not exist, you'll get a warning. Unlike data frames, tibbles do not use partial matching on column names:

```{r}
df <- data.frame(
  abc = 1:10, 
  def = runif(10), 
  xyz = sample(letters, 10)
)
tb <- as_tibble(df)

df$a
tb$a
```

Tibbles clearly delineate `[` and `[[`: `[` always returns another tibble, `[[` always returns a vector.

```{r}
# With data frames, [ sometimes returns a data frame, and sometimes returns 
# a vector
df[, 1]

# With tibbles, [ always returns another tibble
tb[, 1]

# To extract a single element, you should always use [[
tb[[1]]
```

## Interacting with legacy code

Some older functions don't work with tibbles because they expect `df[, 1]` to return a vector, not a data frame. If you encounter one of these functions, use `as.data.frame()` to turn a tibble back to a data frame:

```{r}
class(as.data.frame(tb))
```
Restructure chapters once more 2016-04-27 15:04:29 +08:00			`# (PART) Wrangle {-}`
Use new part headings 2016-04-21 21:01:34 +08:00
			`# Introduction`
Ensure every chapter has a heading 2016-02-12 06:31:34 +08:00
Update tibble info 2016-07-06 21:10:54 +08:00			Throughout this book we work with "tibbles" instead of the traditional data frame. Tibbles _are_ data frames, but tweak some older behaviours to make life a littler easier. R is an old language, and some things that were true 10 or 20 years ago no longer apply. It's difficult to change base R without breaking existing code, so most innovation occurs in packages. Here we will describe the tibble package, which provides opinionated data frames that make working in the tidyverse a little easier. You can learn more about tibbles in the accompanying vignette: `vignette("tibble")`.
Update tibbles & move up in hierarchy 2016-03-25 22:39:49 +08:00
Update tibble info 2016-07-06 21:10:54 +08:00			```{r setup}
Update tibbles & move up in hierarchy 2016-03-25 22:39:49 +08:00			`library(tibble)`
			```

Working on data structures 2016-03-28 21:23:46 +08:00			`## Creating tibbles {#tibbles}`
Update tibbles & move up in hierarchy 2016-03-25 22:39:49 +08:00
Update tibble info 2016-07-06 21:10:54 +08:00			The majority of the functions that you'll use in this book already produce tibbles. If you're working with functions from other packages, you might need to coerce a regular data frame a tibble. You can do that with `as_tibble()`:
Update tibbles & move up in hierarchy 2016-03-25 22:39:49 +08:00
			```{r}
Update tibble info 2016-07-06 21:10:54 +08:00			`as_tibble(iris)`
Update tibbles & move up in hierarchy 2016-03-25 22:39:49 +08:00			```

Update tibble info 2016-07-06 21:10:54 +08:00			`as_tibble()` knows how to convert data frames, lists (provided the elements are equal length vectors), matrices, and tables.
Update tibbles & move up in hierarchy 2016-03-25 22:39:49 +08:00
Wrangle tweaks 2016-07-07 21:03:39 +08:00			You can create a new tibble from individual vectors with `tibble()`:
Update tibbles & move up in hierarchy 2016-03-25 22:39:49 +08:00
			```{r}
Update tibble info 2016-07-06 21:10:54 +08:00			`tibble(x = 1:5, y = 1, z = x ^ 2 + y)`
Update tibbles & move up in hierarchy 2016-03-25 22:39:49 +08:00			```

Wrangle tweaks 2016-07-07 21:03:39 +08:00			`tibble()` automatically recycles inputs of length 1, and you can refer to variables that you just created. Compared to `data.frame()`, `tibble()` does much less: it never changes the type of the inputs (e.g. it never converts strings to factors!), it never changes the names of variables, and it never creates `row.names()`.
Update tibbles & move up in hierarchy 2016-03-25 22:39:49 +08:00
Update tibble info 2016-07-06 21:10:54 +08:00			Another way to create a tibble is with `frame_data()`, which is customised for data entry in R code. Column headings are defined by formulas (`~`), and entries are separated by commas:
Update tibbles & move up in hierarchy 2016-03-25 22:39:49 +08:00
			```{r}
			`frame_data(`
			`~x, ~y, ~z,`
			`"a", 2, 3.6,`
			`"b", 1, 8.5`
			`)`
			```

"vs" should be "vs." 2016-05-24 07:30:53 +08:00			`## Tibbles vs. data frames`
Update tibbles & move up in hierarchy 2016-03-25 22:39:49 +08:00
			`There are two main differences in the usage of a data frame vs a tibble: printing, and subsetting.`

Update tibble info 2016-07-06 21:10:54 +08:00			`### Printing`

Update tibbles & move up in hierarchy 2016-03-25 22:39:49 +08:00			Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen. This makes it much easier to work with large data. In addition to its name, each column reports its type, a nice feature borrowed from `str()`:

			```{r}
Update tibble info 2016-07-06 21:10:54 +08:00			`tibble(`
			`a = lubridate::now() + runif(1e3) * 60,`
			`b = lubridate::today() + runif(1e3),`
			`c = 1:1e3,`
			`d = runif(1e3),`
			`e = sample(letters, 1e3, replace = TRUE)`
			`)`
Update tibbles & move up in hierarchy 2016-03-25 22:39:49 +08:00			```

			`You can control the default appearance with options:`

			* `options(tibble.print_max = n, tibble.print_min = m)`: if more than `m`
Update tibble info 2016-07-06 21:10:54 +08:00			rows, print `n` rows. Use `options(dplyr.print_max = Inf)` to always
Update tibbles & move up in hierarchy 2016-03-25 22:39:49 +08:00			`show all rows.`

			* `options(tibble.width = Inf)` will always print all columns, regardless
			`of the width of the screen.`

Update tibble info 2016-07-06 21:10:54 +08:00			You can see a complete list of options by looking at the package help: `package?tibble`.
Update tibbles & move up in hierarchy 2016-03-25 22:39:49 +08:00
Update tibble info 2016-07-06 21:10:54 +08:00			`### Subsetting`
Update tibbles & move up in hierarchy 2016-03-25 22:39:49 +08:00
Update tibble info 2016-07-06 21:10:54 +08:00			`Tibbles are stricter about subsetting. If you try to access a variable that does not exist, you'll get a warning. Unlike data frames, tibbles do not use partial matching on column names:`
Update tibbles & move up in hierarchy 2016-03-25 22:39:49 +08:00
			```{r}
Update tibble info 2016-07-06 21:10:54 +08:00			`df <- data.frame(`
			`abc = 1:10,`
			`def = runif(10),`
			`xyz = sample(letters, 10)`
			`)`
			`tb <- as_tibble(df)`

			`df$a`
			`tb$a`
Update tibbles & move up in hierarchy 2016-03-25 22:39:49 +08:00			```

Update tibble info 2016-07-06 21:10:54 +08:00			Tibbles clearly delineate `[` and `[[`: `[` always returns another tibble, `[[` always returns a vector.
Update tibbles & move up in hierarchy 2016-03-25 22:39:49 +08:00
			```{r}
Update tibble info 2016-07-06 21:10:54 +08:00			`# With data frames, [ sometimes returns a data frame, and sometimes returns`
			`# a vector`
			`df[, 1]`

			`# With tibbles, [ always returns another tibble`
			`tb[, 1]`

			`# To extract a single element, you should always use [[`
			`tb[[1]]`
Update tibbles & move up in hierarchy 2016-03-25 22:39:49 +08:00			```

			`## Interacting with legacy code`

			Some older functions don't work with tibbles because they expect `df[, 1]` to return a vector, not a data frame. If you encounter one of these functions, use `as.data.frame()` to turn a tibble back to a data frame:

Update tibble info 2016-07-06 21:10:54 +08:00			```{r}
			`class(as.data.frame(tb))`
Update tibbles & move up in hierarchy 2016-03-25 22:39:49 +08:00			```