r4ds/wrangle.Rmd

# (PART) Wrangle {-}

# Introduction

With data, the relationships between values matter as much as the values themselves. Tidy data encodes those relationships.

Throughout this book we work with "tibbles" instead of the traditional data frame. Tibbles _are_ data frames but they encode some patterns that make modern usage of R better. Unfortunately R is an old language, and things that made sense 10 or 20 years a go are no longer as valid. It's difficult to change base R without breaking existing code, so most innovation occurs in packages, providing new functions that you should use instead of the old ones.

```{r}
library(tibble)
```

## Creating tibbles {#tibbles}

The majority of the functions that you'll use in this book already produce tibbles. But if you're working with functions from other packages, you might need to coerce a regular data frame a tibble. You can do that with `as_data_frame()`:

```{r}
as_data_frame(iris)
```

`as_data_frame()` knows how to convert data frames, lists (provided the elements are equal length vectors), matrices, and tables. 

You can also create a new tibble from individual vectors with `data_frame()`:

```{r}
data_frame(x = 1:5, y = 1, z = x ^ 2 + y)
```

`data_frame()` does much less than `data.frame()`: it never changes the type of the inputs (e.g. it never converts strings to factors!), it never changes the names of variables, and it never creates `row.names()`. You can read more about these features in the vignette, `vignette("tibble")`.

You can define a tibble row-by-row with `frame_data()`:

```{r}
frame_data(
  ~x, ~y,  ~z,
  "a", 2,  3.6,
  "b", 1,  8.5
)
```

## Tibbles vs data frames

There are two main differences in the usage of a data frame vs a tibble: printing, and subsetting.

Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen. This makes it much easier to work with large data. In addition to its name, each column reports its type, a nice feature borrowed from `str()`:

```{r}
library(nycflights13)
flights
```

You can control the default appearance with options:

* `options(tibble.print_max = n, tibble.print_min = m)`: if more than `m`
  rows print `m` rows. Use `options(dplyr.print_max = Inf)` to always
  show all rows.

* `options(tibble.width = Inf)` will always print all columns, regardless
   of the width of the screen.

Tibbles are strict about subsetting. If you try to access a variable that does not exist, you'll get an error:

```{r, error = TRUE}
flights$yea
```

Tibbles also clearly delineate `[` and `[[`: `[` always returns another tibble, `[[` always returns a vector. No more `drop = FALSE`!

```{r}
class(iris[ , 1])
class(iris[ , 1, drop = FALSE])
class(as_data_frame(iris)[ , 1])
```

Contrast this with a data frame: sometimes `[` returns a data frame and
sometimes it just returns a single column:

```{r}
df1 <- data.frame(x = 1:3, y = 3:1)
class(df1[, 1:2])
class(df1[, 1])
```

## Interacting with legacy code

Some older functions don't work with tibbles because they expect `df[, 1]` to return a vector, not a data frame. If you encounter one of these functions, use `as.data.frame()` to turn a tibble back to a data frame:

```
class(as.data.frame(tbl_df(iris)))
```
Restructure chapters once more 2016-04-27 15:04:29 +08:00			`# (PART) Wrangle {-}`
Use new part headings 2016-04-21 21:01:34 +08:00
			`# Introduction`
Ensure every chapter has a heading 2016-02-12 06:31:34 +08:00
Reorders outline. Model section becomes do science section. 2016-01-21 05:18:50 +08:00			`With data, the relationships between values matter as much as the values themselves. Tidy data encodes those relationships.`
Update tibbles & move up in hierarchy 2016-03-25 22:39:49 +08:00
Small edits to work.Rmd 2016-04-07 02:20:52 +08:00			`Throughout this book we work with "tibbles" instead of the traditional data frame. Tibbles _are_ data frames but they encode some patterns that make modern usage of R better. Unfortunately R is an old language, and things that made sense 10 or 20 years a go are no longer as valid. It's difficult to change base R without breaking existing code, so most innovation occurs in packages, providing new functions that you should use instead of the old ones.`
Update tibbles & move up in hierarchy 2016-03-25 22:39:49 +08:00
			```{r}
			`library(tibble)`
			```

Working on data structures 2016-03-28 21:23:46 +08:00			`## Creating tibbles {#tibbles}`
Update tibbles & move up in hierarchy 2016-03-25 22:39:49 +08:00
			The majority of the functions that you'll use in this book already produce tibbles. But if you're working with functions from other packages, you might need to coerce a regular data frame a tibble. You can do that with `as_data_frame()`:

			```{r}
			`as_data_frame(iris)`
			```

Small edits to work.Rmd 2016-04-07 02:20:52 +08:00			`as_data_frame()` knows how to convert data frames, lists (provided the elements are equal length vectors), matrices, and tables.
Update tibbles & move up in hierarchy 2016-03-25 22:39:49 +08:00
			You can also create a new tibble from individual vectors with `data_frame()`:

			```{r}
			`data_frame(x = 1:5, y = 1, z = x ^ 2 + y)`
			```

			`data_frame()` does much less than `data.frame()`: it never changes the type of the inputs (e.g. it never converts strings to factors!), it never changes the names of variables, and it never creates `row.names()`. You can read more about these features in the vignette, `vignette("tibble")`.

			You can define a tibble row-by-row with `frame_data()`:

			```{r}
			`frame_data(`
			`~x, ~y, ~z,`
			`"a", 2, 3.6,`
			`"b", 1, 8.5`
			`)`
			```

			`## Tibbles vs data frames`

			`There are two main differences in the usage of a data frame vs a tibble: printing, and subsetting.`

			Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen. This makes it much easier to work with large data. In addition to its name, each column reports its type, a nice feature borrowed from `str()`:

			```{r}
			`library(nycflights13)`
			`flights`
			```

			`You can control the default appearance with options:`

			* `options(tibble.print_max = n, tibble.print_min = m)`: if more than `m`
			rows print `m` rows. Use `options(dplyr.print_max = Inf)` to always
			`show all rows.`

			* `options(tibble.width = Inf)` will always print all columns, regardless
			`of the width of the screen.`

			`Tibbles are strict about subsetting. If you try to access a variable that does not exist, you'll get an error:`

			```{r, error = TRUE}
			`flights$yea`
			```

			Tibbles also clearly delineate `[` and `[[`: `[` always returns another tibble, `[[` always returns a vector. No more `drop = FALSE`!

			```{r}
			`class(iris[ , 1])`
			`class(iris[ , 1, drop = FALSE])`
			`class(as_data_frame(iris)[ , 1])`
			```

			Contrast this with a data frame: sometimes `[` returns a data frame and
			`sometimes it just returns a single column:`

			```{r}
			`df1 <- data.frame(x = 1:3, y = 3:1)`
			`class(df1[, 1:2])`
			`class(df1[, 1])`
			```

			`## Interacting with legacy code`

			Some older functions don't work with tibbles because they expect `df[, 1]` to return a vector, not a data frame. If you encounter one of these functions, use `as.data.frame()` to turn a tibble back to a data frame:

			```
			`class(as.data.frame(tbl_df(iris)))`
			```