Proofing transformation

This commit is contained in:
hadley 2016-07-22 09:15:55 -05:00
parent e798dac411
commit f69669b4c3
5 changed files with 84 additions and 132 deletions

View File

@ -557,14 +557,14 @@ As we move on from these introductory chapters, we'll transition to a more conci
```{r, eval = FALSE}
ggplot(data = faithful, mapping = aes(x = eruptions)) +
geom_histogram(binwidth = 0.25)
geom_freqpoly(binwidth = 0.25)
```
But the first couple of arguments to a function are typically so important that you should know them by heart. The first two arguments to `ggplot()` are `data` and `mapping`, and the first two arguments to `aes()` are `x` and `y`. In the remainder of the book, we won't supply those names. That saves typing and by reducing the amount of boilerplate makes it easier to see what's different between plots (that's a rely important programming concern that we'll come back in [functions]).
```{r, eval = FALSE}
ggplot(faithful, aes(eruptions)) +
geom_histogram(binwidth = 0.25)
geom_freqpoly(binwidth = 0.25)
```
Sometimes we'll turn the end of pipeline of data transformation into a plot. Watch for the transition from `%>%` to `+`. I wish this transition wasn't necessary but unfortunately ggplot2 was created before the pipe was discovered.

View File

@ -323,7 +323,7 @@ Now that we have the scheduled arrival and departure times for each flight in fl
```{r}
datetimes %>%
ggplot(aes(scheduled_departure)) +
geom_histogram(binwidth = 86400) # 86400 seconds = 1 day
geom_freqpoly(binwidth = 86400) # 86400 seconds = 1 day
```
Let's instead group flights by day of the week, to see which week days are the busiest, and by hour to see which times of the day are busiest. To do this we will need to extract the day of the week and hour that each flight was scheduled to depart.

View File

@ -2,30 +2,7 @@
## Introduction
Visualisation is an important tool for insight generation, but it is rare that you get the data in exactly the right form you need for visualisation. Often you'll need to create some new variables or summaries, or maybe you just want to rename the variables or reorder the observations in order to make the data a little easier to work with. You'll learn how to do all that (and more!) in this chapter which will teach you how to transform your data using the dplyr package.
When working with data you must:
1. Figure out what you want to do.
1. Precisely describe what you want to do in such a way that the
computer can understand it (i.e. program it).
1. Execute the program.
dplyr makes these steps fast and easy:
* By constraining your options, it simplifies how you can think about
common data manipulation tasks.
* It provides simple "verbs", functions that correspond to the most
common data manipulation tasks, to help you translate those thoughts
into code.
* It uses efficient data storage backends, so you spend less time
waiting for the computer.
In this chapter you'll learn the key verbs of dplyr in the context of a new dataset on flights departing New York City in 2013.
Visualisation is an important tool for insight generation, but it is rare that you get the data in exactly the right form you need. Often you'll need to create some new variables or summaries, or maybe you just want to rename the variables or reorder the observations in order to make the data a little easier to work with. You'll learn how to do all that (and more!) in this chapter which will teach you how to transform your data using the dplyr package and new dataset on flights departing New York City in 2013.
### Prerequisites
@ -37,32 +14,34 @@ library(nycflights13)
library(ggplot2)
```
Take careful note of the message that's printed when you load dplyr - it tells you that dplyr overwrite some functions in base R. If you want to use the base version of these functions after loading dplyr, you'll need to use their full names: `stats::filter()`, `base::intersect()`, etc.
### nycflights13
To explore the basic data manipulation verbs of dplyr, we'll use the `flights` data frame from the nycflights13 package. This data frame contains all `r format(nrow(nycflights13::flights), big.mark = ",")` flights that departed from New York City in 2013. The data comes from the US [Bureau of Transportation Statistics](http://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120&Link=0), and is documented in `?nycflights13`.
To explore the basic data manipulation verbs of dplyr, we'll use `nycflights13::flights`. This data frame contains all `r format(nrow(nycflights13::flights), big.mark = ",")` flights that departed from New York City in 2013. The data comes from the US [Bureau of Transportation Statistics](http://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120&Link=0), and is documented in `?flights`.
```{r}
flights
```
You might notice that this data frame prints little differently to other data frames you might have used in the past: it only shows the first few rows and all the columns that fit on one screen. (To see the whole dataset, you can run `View(flights)` which will open the dataset in the RStudio viewer). It prints differently because it's a __tibble__. Tibbles are data frames, but slightly tweaked to work better in the tidyverse. For now, you don't need to worry about the differences; we'll come back to tibbles in more detail in [wrangle].
You might notice that this data frame prints little differently to other data frames you might have used in the past: it only shows the first few rows and all the columns that fit on one screen. (To see the whole dataset, you can run `View(flights)` which will open the dataset in the RStudio viewer). It prints differently because it's a __tibble__. Tibbles are data frames, but slightly tweaked to work better in the tidyverse. For now, you don't need to worry about the differences; we'll come back to tibbles in more detail in [wrangle](#wrangle-intro).
You might also have noticed the row of three letter abbreviations under the column names. These describe the type of each variable:
* `lgl` stands for logical, vectors that contain only `TRUE` or `FALSE`.
* `int` stands for integers.
* `dbl` stands for doubles, aka real numbers.
* `chr` stands for character vectors that contain strings.
* `dbl` stands for doubles, or real numbers.
* `chr` stands for character vectors, or strings.
### Dplyr basics
In this chapter you are going to learn the five key five dplyr functions that allow you to solve vast majority of your data manipulation challenges:
* pick observations by their values (`filter()`),
* reorder the rows (`arrange()`),
* pick variables by their names (`select()`),
* create new variables with functions of existing variables (`mutate()`), or
* collapse many values down to a single summary (`summarise()`).
* Pick observations by their values (`filter()`).
* Reorder the rows (`arrange()`).
* Pick variables by their names (`select()`).
* Create new variables with functions of existing variables (`mutate()`).
* Collapse many values down to a single summary (`summarise()`).
These can all be used in conjunction with `group_by()` which changes the scope of each function from operating on the entire dataset to operating on it group-by-group. These six functions provide the verbs for a language of data manipulation.
@ -79,38 +58,19 @@ Together these properties make it easy to chain together multiple simple steps t
## Filter rows with `filter()`
`filter()` allows you to subset observations. The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame. For example, we can select all flights on January 1st with:
`filter()` allows you to subset observations based on their values. The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame. For example, we can select all flights on January 1st with:
```{r}
filter(flights, month == 1, day == 1)
```
--------------------------------------------------------------------------------
This is equivalent to the more verbose base code:
```{r, eval = FALSE}
flights[flights$month == 1 & flights$day == 1 &
!is.na(flights$month) & !is.na(flights$day), , drop = FALSE]
```
Or with the base `subset()` function:
```{r, eval = FALSE}
subset(flights, month == 1 & day == 1)
```
`filter()` works similarly to `subset()` except that you can give it any number of filtering conditions, which are applied simulatenously: a row must meet all criteria in order to be included in the result.
--------------------------------------------------------------------------------
When you run that line of code, dplyr executes the filtering operation and returns a new data frame. dplyr functions never modify their inputs, so if you want to save the result, you'll need to use the assignment operator, `<-`:
```{r}
jan1 <- filter(flights, month == 1, day == 1)
```
R either prints out the results, or saves them to a variable. If you want to do both, wrap the assignment in parentheses:
R either prints out the results, or saves them to a variable. If you want to do both, you can wrap the assignment in parentheses:
```{r}
(dec25 <- filter(flights, month == 12, day == 25))
@ -126,7 +86,7 @@ When you're starting out with R, the easiest mistake to make is to use `=` inste
filter(flights, month = 1)
```
Beware using `==` with floating point numbers:
There's another common problem you might encounter when using `==`: floating point numbers. These results might surprise you!
```{r}
sqrt(2) ^ 2 == 2
@ -144,23 +104,19 @@ near(1 / 49 * 49, 1)
### Logical operators
Multiple arguments to `filter()` are combined with "and": every expression must be true in order for a row to be included in the output. For other types of combinations, you'll need to use Boolean operators yourself:
Multiple arguments to `filter()` are combined with "and": every expression must be true in order for a row to be included in the output. For other types of combinations, you'll need to use Boolean operators yourself: `&` is "and", `|` is "or", and `!` is "not". Figure \@ref(fig:bool-ops) shows the complete set of Boolean operations.
```{r bool-ops, echo = FALSE, fig.cap = "Complete set of boolean operations. `x` is the left-hand circle, `y` is the right hand circle, and the shaded region show which parts each operator selects."}
knitr::include_graphics("diagrams/transform-logical.png")
```
The following code finds all flights that departed in November or December:
```{r, eval = FALSE}
filter(flights, month == 11 | month == 12)
```
Note the order of operations isn't like English. The following expression doesn't find on months that equal 11 or 12. Instead it finds all months that equal `11 | 12`, an expression that evaluates to `TRUE`. In a numeric context (like here), `TRUE` becomes one, so this finds all flights in January, not November or December (It is the equivalent of `filter(flights, month == 1)`). This is quite confusing!
```{r, eval = FALSE}
filter(flights, month == 11 | 12)
```
The following figure shows the complete set of Boolean operations:
```{r bool-ops, echo = FALSE, fig.cap = "Complete set of boolean operations. `x` is the left-hand circle, `y` is the right hand circle, and the shaded region show which parts each operator selects."}
knitr::include_graphics("diagrams/transform-logical.png")
```
The order of operations doesn't work like English. You can't write `filter(flights, month == 11 | 12)`, which you might literally translate into "finds all flights that departed in November or December". Instead it finds all months that equal `11 | 12`, an expression that evaluates to `TRUE`. In a numeric context (like here), `TRUE` becomes one, so this finds all flights in January, not November or December. This is quite confusing!
Sometimes you can simplify complicated subsetting by remembering De Morgan's law: `!(x & y)` is the same as `!x | !y`, and `!(x | y)` is the same as `!x & !y`. For example, if you wanted to find flights that weren't delayed (on arrival or departure) by more than two hours, you could use either of the following two filters:
@ -169,7 +125,7 @@ filter(flights, !(arr_delay > 120 | dep_delay > 120))
filter(flights, arr_delay <= 120, dep_delay <= 120)
```
Note that R has both `&` and `|` and `&&` and `||`. `&` and `|` are vectorised: you give them two vectors of logical values and they return a vector of logical values. `&&` and `||` are scalar operators: you give them individual `TRUE`s or `FALSE`s. They're used in `if` statements when programming. You'll learn about that later on in [Conditional execution].
As well as `&` and `|`, R also has `&&` and `||`. Don't use them here! You'll when you should use them in [conditional execution].
Sometimes you want to find all rows after the first `TRUE`, or all rows until the first `FALSE`. The window functions `cumany()` and `cumall()` allow you to find these values:
@ -183,6 +139,8 @@ filter(df, cumany(x)) # all rows after first TRUE
filter(df, cumall(y)) # all rows until first FALSE
```
(`tibble()` creates a dataset "by hand". You'll learn more about it in [tibbles].)
Whenever you start using complicated, multipart expressions in `filter()`, consider making them explicit variables instead. That makes it much easier to check your work. You'll learn how to create new variables shortly.
### Missing values
@ -216,7 +174,11 @@ x == y
# We don't know!
```
If you want to determine if a value is missing, use `is.na()`. (This is such a common mistake RStudio will remind you whenever you write `x == NA` in your script)
If you want to determine if a value is missing, use `is.na()`:
```{r}
is.na(x)
```
`filter()` only includes rows where the condition is `TRUE`; it excludes both `FALSE` and `NA` values. If you want to preserve missing values, ask for them explicitly:
@ -271,18 +233,6 @@ arrange(df, x)
arrange(df, desc(x))
```
--------------------------------------------------------------------------------
You can accomplish the same thing in base R using subsetting and `order()`:
```{r}
flights[order(flights$year, flights$month, flights$day), , drop = FALSE]
```
`arrange()` provides a more convenient way of sorting one variable in descending order with the `desc()` helper function.
--------------------------------------------------------------------------------
### Exercises
1. How could you use `arrange()` to sort all missing values to the start?
@ -318,7 +268,7 @@ There are a number of helper functions you can use within `select()`:
* `contains("ijk")`: matches names that contain "ijk".
* `matches("(.)\\1")`: selects variables that match a regular expression.
* `matches("(.)\\1")`: selects variables that match a regular expression.
This one matches any variables that contain repeated characters. You'll
learn more about regular expressions in [strings].
@ -344,12 +294,6 @@ Another option is to use `select()` in conjunction with the `everything()` helpe
select(flights, time_hour, air_time, everything())
```
--------------------------------------------------------------------------------
The `select()` function works similarly to the `select` argument in `base::subset()`. `select()` is its own function in dplyr because the dplyr philosophy is to have small functions that each do one thing well.
--------------------------------------------------------------------------------
### Exercises
1. Brainstorm as many ways as possible to select `dep_time`, `dep_delay`,
@ -411,21 +355,15 @@ transmute(flights,
)
```
--------------------------------------------------------------------------------
`mutate()` is similar to `transform()` in base R, but in `mutate()` you can refer to variables you've just created; in `transform()` you cannot.
--------------------------------------------------------------------------------
### Useful functions
There are many functions for creating new variables that you can use with `mutate()`. The key property is that the function must be vectorised: it needs to return the same number of outputs as inputs. There's no way to list every possible function that you might use, but here's a selection of functions that are frequently useful:
There are many functions for creating new variables that you can use with `mutate()`. The key property is that the function must be vectorised: it must take a vector of values as input, return a vector with the same number of values as output. There's no way to list every possible function that you might use, but here's a selection of functions that are frequently useful:
* Arithmetic operators: `+`, `-`, `*`, `/`, `^`. These are all vectorised, so
you can work with multiple columns. These operations use "recycling rules"
so if one parameter is shorter than the other, it will be automatically
extended to be the same length. This is most useful when one of the
arguments is a single number: `airtime / 60`, `hours * 60 + minute`, etc.
* Arithmetic operators: `+`, `-`, `*`, `/`, `^`. These are all vectorised,
using the so called "recycling rules". If one parameter is shorter than
the other, it will be automatically extended to be the same length. This
is most useful when one of the arguments is a single number: `airtime / 60`,
`hours * 60 + minute`, etc.
Arithmetic operators are also useful in conjunction with the aggregate
functions you'll learn about later. For example, `x / sum(x)` calculates
@ -487,6 +425,9 @@ There are many functions for creating new variables that you can use with `mutat
start with `min_rank()`. It does the most usual type of ranking
(e.g. 1st, 2nd, 2nd, 4th). The default gives smallest values the small
ranks; use `desc(x)` to give the largest values the smallest ranks.
If `min_rank()` doesn't do what you need, look at the variants
`row_number()`, `dense_rank()`, `cume_dist()`, `percent_rank()`,
`ntile()`.
```{r}
y <- c(1, 2, 2, NA, 3, 4)
@ -499,10 +440,6 @@ There are many functions for creating new variables that you can use with `mutat
) %>% knitr::kable()
```
If `min_rank()` doesn't do what you need, look at the variants
`row_number()`, `dense_rank()`, `cume_dist()`, `percent_rank()`,
`ntile()`.
### Exercises
```{r, eval = FALSE, echo = FALSE}
@ -518,20 +455,27 @@ ggplot(flights, aes(dep_sched %% 60)) + geom_histogram(binwidth = 1)
ggplot(flights, aes(air_time - airtime2)) + geom_histogram()
```
1. Currently `dep_time` and `arr_time` are convenient to look at, but
1. Currently `dep_time` and `sched_dep_time` are convenient to look at, but
hard to compute with because they're not really continuous numbers.
Convert them to a more convenient representation of number of minutes
since midnight.
1. Compare `airtime` with `arr_time - dep_time`. What do you expect to see?
What do you see? Why?
What do you see? What do you need to do to fix it?
1. Compare `dep_time`, `sched_dep_time`, and `dep_delay`. How would you
expect those three numbers to be related?
1. Find the 10 most delayed flights using a ranking function. How do you want
to handle ties? Carefully read the documentation for `min_rank()`.
1. What does `1:3 + 1:10` return? Why?
1. What trigonometric functions does R provide?
## Grouped summaries with `summarise()`
The last verb is `summarise()`. It collapses a data frame to a single row:
The last key verb is `summarise()`. It collapses a data frame to a single row:
```{r}
summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
@ -539,7 +483,7 @@ summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
(we'll come back to what that `na.rm = TRUE` means very shortly.)
That's not terribly useful unless we pair it with `group_by()`. This changes the unit of analysis from the complete dataset to individual groups. Then, when you use the dplyr verbs on a grouped data frame they'll be automatically applied "by group". For example, if we applied exactly the same code to a data frame grouped by date, we get the average delay per date:
`summarise()` is terribly useful unless we pair it with `group_by()`. This changes the unit of analysis from the complete dataset to individual groups. Then, when you use the dplyr verbs on a grouped data frame they'll be automatically applied "by group". For example, if we applied exactly the same code to a data frame grouped by date, we get the average delay per date:
```{r}
by_day <- group_by(flights, year, month, day)
@ -550,19 +494,20 @@ Together `group_by()` and `summarise()` provide one of the tools that you'll use
### Combining multiple operations with the pipe
Imagine that we want to explore the relationship between the distance and average delay for each location. Using what you already know about dplyr, you might write code like this:
Imagine that we want to explore the relationship between the distance and average delay for each location. Using what you know about dplyr, you might write code like this:
```{r, fig.width = 6}
by_dest <- group_by(flights, dest)
delay <- summarise(by_dest,
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE))
delay = mean(arr_delay, na.rm = TRUE)
)
delay <- filter(delay, count > 20, dest != "HNL")
# Interesting it looks like delays increase with distance up to
# ~750 miles and then decrease. Maybe as flights get longer there's
# more ability to make up delays in the air?
# It looks like delays increase with distance up to ~750 miles
# and then decrease. Maybe as flights get longer there's more
# ability to make up delays in the air?
ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
geom_point(aes(size = count), alpha = 1/3) +
geom_smooth(se = FALSE)
@ -577,7 +522,7 @@ There are three steps to prepare this data:
1. Filter to remove noisy points and Honolulu airport, which is almost
twice as far away as the next closest airport.
This code is a little frustrating to write because we have to give each intermediate data frame a name, even though we don't care about it. Naming things well is hard, so this slows us down.
This code is a little frustrating to write because we have to give each intermediate data frame a name, even though we don't care about it. Naming things is hard, so this slows down our analysis.
There's another way to tackle the same problem with the pipe, `%>%`:
@ -596,7 +541,7 @@ This focuses on the transformations, not what's being transformed, which makes t
Behind the scenes, `x %>% f(y)` turns into `f(x, y)`, and `x %>% f(y) %>% g(z)` turns into `g(f(x, y), z)` and so on. You can use the pipe to rewrite multiple operations in a way that you can read left-to-right, top-to-bottom. We'll use piping frequently from now on because it considerably improves the readability of code, and we'll come back to it in more detail in [pipes].
Working with the pipe is one of the key criteria for belonging to the tidyverse. The only exception is ggplot2: it was written before the pipe was discovered. Unfortunately, the next iteration of ggplot2, ggvis, which does use the pipe, isn't yet ready for prime time.
Working with the pipe is one of the key criteria for belonging to the tidyverse. The only exception is ggplot2: it was written before the pipe was discovered. Unfortunately, the next iteration of ggplot2, ggvis, which does use the pipe, isn't quite ready for prime time yet.
### Missing values
@ -638,7 +583,7 @@ delays <- not_cancelled %>%
)
ggplot(data = delays, mapping = aes(x = delay)) +
geom_histogram(binwidth = 10)
geom_freqpoly(binwidth = 10)
```
Wow, there are some planes that have an _average_ delay of 5 hours (300 minutes)!
@ -657,7 +602,7 @@ ggplot(data = delays, mapping = aes(x = n, y = delay)) +
geom_point()
```
Not surprisingly, there is much greater variation in the average delay when there are few flights. The shape of this plot is very characteristic: whenever you plot a mean (or most other summaries) vs. number of observations, you'll see that the variation decreases as the sample size increases.
Not surprisingly, there is much greater variation in the average delay when there are few flights. The shape of this plot is very characteristic: whenever you plot a mean (or other summary) vs. group size, you'll see that the variation decreases as the sample size increases.
When looking at this sort of plot, it's often useful to filter out the groups with the smallest numbers of observations, so you can see more of the pattern and less of the extreme variation in the smallest groups. This is what the following code does, as well as showing you a handy pattern for integrating ggplot2 into dplyr flows. It's a bit painful that you have to switch from `%>%` to `+`, but once you get the hang of it, it's quite convenient.
@ -674,13 +619,15 @@ RStudio tip: a useful keyboard shortcut is Cmd/Ctrl + Shift + P. This resends th
--------------------------------------------------------------------------------
There's another common variation of this type of pattern. Let's look at how the average performance of batters in baseball is related to the number of times they're at bat. Here I use the Lahman package to compute the batting average (number of hits / number of attempts) of every major league baseball player. When I plot the skill of the batter against the number of times batted, you see two patterns:
There's another common variation of this type of pattern. Let's look at how the average performance of batters in baseball is related to the number of times they're at bat. Here I use data from the __Lahman__ package to compute the batting average (number of hits / number of attempts) of every major league baseball player. When I plot the skill of the batter against the number of times batted, you see two patterns:
1. As above, the variation in our aggregate decreases as we get more
data points.
2. There's a positive correlation between skill and n. This is because teams
control who gets to play, and obviously they'll pick their best players.
2. There's a positive correlation between skill (batting average, `ba`) and
number of opportunities to hit the ball (at bat, `ab`). This is because
teams control who gets to play, and obviously they'll pick their best
players.
```{r}
# Convert to a tibble so it prints nicely
@ -714,7 +661,7 @@ Just using means, counts, and sum can get you a long way, but R provides many ot
* Measures of location: we've used `mean(x)`, but `median(x)` is also
useful. The mean is the sum divided by the length; the median is a value
where 50% of `x` is above, and 50% is below.
where 50% of `x` is above it, and 50% is below it.
It's sometimes useful to combine aggregation with logical subsetting.
We haven't talked about this sort of subsetting yet, but you'll learn more
@ -742,7 +689,10 @@ Just using means, counts, and sum can get you a long way, but R provides many ot
arrange(desc(distance_sd))
```
* Measures of rank: `min(x)`, `quantile(x, 0.25)`, `max(x)`.
* Measures of rank: `min(x)`, `quantile(x, 0.25)`, `max(x)`. Quantiles
are a generalisation of the median. For example, `quantile(x, 0.25)`
will find a value of `x` that is greater than 25% of the values,
and less than the remaining 75%.
```{r}
# When do the first and last flights leave each day?
@ -873,6 +823,8 @@ daily %>%
1. For each plane, count the number of flights before the first delay
of greater than 1 hour.
1. What does the `sort` argument to `count()` do. When might you use it?
## Grouped mutates (and filters)
Grouping is most useful in conjunction with `summarise()`, but you can also do convenient operations with `mutate()` and `filter()`:
@ -880,7 +832,7 @@ Grouping is most useful in conjunction with `summarise()`, but you can also do c
* Find the worst members of each group:
```{r}
flights %>%
flights_sml %>%
group_by(year, month, day) %>%
filter(rank(arr_delay) < 10)
```
@ -898,7 +850,8 @@ Grouping is most useful in conjunction with `summarise()`, but you can also do c
```{r}
popular_dests %>%
filter(arr_delay > 0) %>%
mutate(prop_delay = arr_delay / sum(arr_delay))
mutate(prop_delay = arr_delay / sum(arr_delay)) %>%
select(year:day, dest, arr_delay, prop_delay)
```
A grouped filter is a grouped mutate followed by an ungrouped filter. I generally avoid them except for quick and dirty manipulations: otherwise it's hard to check that you've done the manipulation correctly.
@ -918,8 +871,7 @@ Functions that work most naturally in grouped mutates and filters are known as
1. Delays are typically temporally correlated: even once the problem that
caused the initial delay has been resolved, later flights are delayed
to allow earlier flights to leave. Using `lag()` explore how the delay
of a flight is related to the delay of the flight that left just
before.
of a flight is related to the delay of the immediately preceeding flight.
1. Look at each destination. Can you find flights that are suspiciously
fast? (i.e. flights that represent a potential data entry error). Compute

View File

@ -469,8 +469,8 @@ ggplot(data = diamonds) +
On the x axis, the chart displays `cut`, a variable from `diamonds`. On the y axis, it displays count, but count is not a variable in `diamonds`! Where does count come from? Many graphs, like scatterplots, plot the raw values of your dataset. Other graphs, like bar charts, calculate new values to plot:
* __bar charts__ and __histograms__ bin your data and then plot bin counts,
the number of points that fall in each bin.
* __bar charts__, __histograms__, and __frequency polygons__ bin your data
and then plot bin counts, the number of points that fall in each bin.
* __smoothers__ fit a model to your data and then plot predictions from the
model.

View File

@ -1,6 +1,6 @@
# (PART) Wrangle {-}
# Introduction
# Introduction {#wrangle-intro}
In this part of the book, you'll learn about data wrangling, the art of getting your data into R in a useful form. Data wrangling encompasses three main pieces: