Tibble updates for transform chapter

This commit is contained in:
hadley 2016-07-08 10:31:52 -05:00
parent d42f2184dc
commit 2ebaee835d
1 changed files with 20 additions and 72 deletions

View File

@ -1,11 +1,5 @@
# Data transformation {#transform}
```{r setup-transform, include = FALSE}
library(dplyr)
library(nycflights13)
library(ggplot2)
```
Visualisation is an important tool for insight generation, but it is rare that you get the data in exactly the right form you need for visualisation. Often you'll need to create some new variables or summaries, or maybe you just want to rename the variables or reorder the observations in order to make the data a little easier to work with. You'll learn how to do all that (and more!) in this chapter which will teach you how to transform your data using the dplyr package.
When working with data you must:
@ -31,78 +25,32 @@ The dplyr package makes these steps fast and easy:
In this chapter you'll learn the key verbs of dplyr in the context of a new dataset on flights departing New York City in 2013.
### Prerequisites
In this chapter we're going to focus on how to use dplyr. We'll illustrate the key ideas using some data in nycflight3.
```{r}
library(dplyr)
library(nycflights13)
library(ggplot2)
```
## nycflights13
To explore the basic data manipulation verbs of dplyr, we'll use the `flights` data frame from the nycflights13 package. This data frame contains all `r format(nrow(nycflights13::flights), big.mark = ",")` flights that departed from New York City in 2013. The data comes from the US [Bureau of Transportation Statistics](http://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120&Link=0), and is documented in `?nycflights13`.
```{r}
library(dplyr)
library(nycflights13)
flights
```
The first important thing to notice about this dataset is that it prints a little differently to most data frames: it only shows the first few rows and all the columns that fit on one screen. If you want to see the whole dataset, use `View()` which will open the dataset in the RStudio viewer.
You might notice that this data frame prints little differently to other data frames you might have used: it only shows the first few rows and all the columns that fit on one screen (To see the whole dataset, you can run `View(flights)` which will open the dataset in the RStudio viewer). It prints differently because it's a __tibble__. Tibbles are data frames, but slightly tweaked to work better in the tidyverse. For now, you don't need to worry about the differences; we'll come back to tibbles in more detail in [wrangle].
You might also have noticed the row of three letter abbreviations under the column names. These describe the type of each variable:
It also prints an abbreviated description of the column type:
* int: integer
* dbl: double (real)
* chr: character
* lgl: logical
It prints differently because it has a different "class" to usual data frames:
```{r}
class(flights)
```
This is called a `tbl_df` (pronounced "tibble diff") or a `data_frame` (pronounced "data underscore frame"; cf. `data dot frame`). Generally, however, we won't worry about this relatively minor difference and will refer to everything as data frames.
You'll learn more about how `data_frame` works in data structures. If you want to convert your own data frames to this special case, use `as.data_frame()`. I recommend it for large data frames as it makes interactive exploration much less painful.
To create your own new tbl\_df from individual vectors, use `data_frame()`:
```{r}
data_frame(x = 1:3, y = c("a", "b", "c"))
```
--------------------------------------------------------------------------------
There are two other important differences between tbl_dfs and data.frames:
* When you subset a tbl\_df with `[`, it always returns another tbl\_df.
Contrast this with a data frame: sometimes `[` returns a data frame and
sometimes it just returns a single column (i.e. a vector):
```{r}
df1 <- data.frame(x = 1:3, y = 3:1)
class(df1[, 1:2])
class(df1[, 1])
df2 <- data_frame(x = 1:3, y = 3:1)
class(df2[, 1:2])
class(df2[, 1])
```
To extract a single column from a tbl\_df use `[[` or `$`:
```{r}
class(df2[[1]])
class(df2$x)
```
* When you extract a variable with `$`, tbl\_dfs never do partial
matching. They'll throw an error if the column doesn't exist:
```{r, error = TRUE}
df <- data.frame(abc = 1)
df$a
df2 <- data_frame(abc = 1)
df2$a
```
--------------------------------------------------------------------------------
* lgl: logical (`TRUE` or `FALSE`).
* int: integers.
* dbl: doubles (real numbers).
* chr: character strings.
## Dplyr verbs
@ -185,11 +133,11 @@ sqrt(2) ^ 2 == 2
1/49 * 49 == 1
```
It's better to check that you're close:
It's better instead to use `near()` to check that you're close:
```{r}
abs(sqrt(2) ^ 2 - 2) < 1e-6
abs(1/49 * 49 - 1) < 1e-6
near(sqrt(2) ^ 2, 2)
near(1 / 49 * 49, 1)
```
### Logical operators