Minor polishing to get back into the swing of things
This commit is contained in:
parent
e80ed2d577
commit
821b51d536
|
@ -62,16 +62,6 @@ There are three other common types that aren't used here but you'll encounter la
|
||||||
### dplyr basics
|
### dplyr basics
|
||||||
|
|
||||||
In this chapter you are going to learn the primary dplyr verbs which will allow you to solve the vast majority of your data manipulation challenges.
|
In this chapter you are going to learn the primary dplyr verbs which will allow you to solve the vast majority of your data manipulation challenges.
|
||||||
They are organised into four camps:
|
|
||||||
|
|
||||||
- Functions that operate on **rows**: `filter()` subsets rows based on the values of the columns, `slice()` and friends subsets rows based on their position, and `arrange()` changes the order of the rows.
|
|
||||||
|
|
||||||
- Functions that operate on **columns**: `mutate()` creates new columns, `select()` columns, `rename()` changes their names, and `relocate()` which changes their positions.
|
|
||||||
|
|
||||||
- Functions that operate on **groups**: `group_by()` divides data up into groups for analysis, and `summarise()` reduces each group to a single row.
|
|
||||||
|
|
||||||
Later, in Chapter \@ref(relational-data), you'll learn about other verbs that work with **tables**, like the join functions and the set operations.
|
|
||||||
|
|
||||||
All dplyr verbs work the same way:
|
All dplyr verbs work the same way:
|
||||||
|
|
||||||
1. The first argument is a data frame.
|
1. The first argument is a data frame.
|
||||||
|
@ -81,11 +71,22 @@ All dplyr verbs work the same way:
|
||||||
3. The result is a new data frame.
|
3. The result is a new data frame.
|
||||||
|
|
||||||
Together these properties make it easy to chain together multiple simple steps to achieve a complex result.
|
Together these properties make it easy to chain together multiple simple steps to achieve a complex result.
|
||||||
|
The verbs are organised into four groups:
|
||||||
|
|
||||||
|
- Functions that operate on **rows**: `filter()` subsets rows based on the values of the columns and `arrange()` changes the order of the rows.
|
||||||
|
|
||||||
|
- Functions that operate on **columns**: `mutate()` creates new columns, `select()` columns, `rename()` changes their names, and `relocate()` changes their positions.
|
||||||
|
|
||||||
|
- Functions that operate on **groups**: `group_by()` divides data up into groups for analysis, and `summarise()` reduces each group to a single row.
|
||||||
|
|
||||||
|
- Functions that operate on **tables**, like the join functions and the set operations.
|
||||||
|
We'll come back to these in in Chapter \@ref(relational-data).
|
||||||
|
|
||||||
Let's dive in and see how these verbs work.
|
Let's dive in and see how these verbs work.
|
||||||
|
|
||||||
## Rows
|
## Rows
|
||||||
|
|
||||||
These functions affect the rows (the observations), leaving the columns (the variables) unchanged.
|
`filter()` and `arrange()` affect the rows (the observations), leaving the columns (the variables) unchanged.
|
||||||
`filter()` changes which rows are included without changing the order, `arrange()` changes the order without changing the membership.
|
`filter()` changes which rows are included without changing the order, `arrange()` changes the order without changing the membership.
|
||||||
|
|
||||||
### `filter()`
|
### `filter()`
|
||||||
|
@ -111,6 +112,7 @@ jan1 <- filter(flights, month == 1, day == 1)
|
||||||
To use filtering effectively, you have to know how to select the observations that you want using the comparison operators.
|
To use filtering effectively, you have to know how to select the observations that you want using the comparison operators.
|
||||||
R provides the standard suite: `>`, `>=`, `<`, `<=`, `!=` (not equal), and `==` (equal).
|
R provides the standard suite: `>`, `>=`, `<`, `<=`, `!=` (not equal), and `==` (equal).
|
||||||
It also provides `%in%`: `filter(df, x %in% c(a, b, c))` will return all rows where `x` is `a`, `b`, or `c`.
|
It also provides `%in%`: `filter(df, x %in% c(a, b, c))` will return all rows where `x` is `a`, `b`, or `c`.
|
||||||
|
We'll come back to these operations again in Chapter \@ref(logicals-numbers).
|
||||||
|
|
||||||
When you're starting out with R, the easiest mistake to make is to use `=` instead of `==` when testing for equality.
|
When you're starting out with R, the easiest mistake to make is to use `=` instead of `==` when testing for equality.
|
||||||
`filter()` will let you know when this happens:
|
`filter()` will let you know when this happens:
|
||||||
|
@ -158,7 +160,7 @@ arrange(flights, desc(dep_delay))
|
||||||
|
|
||||||
## Columns
|
## Columns
|
||||||
|
|
||||||
These functions affect the columns (the variables) without changing the rows (the observations).
|
`mutate()`, `select()`, `rename()`, and `relocate()` affect the columns (the variables) without changing the rows (the observations).
|
||||||
`mutate()` creates new variables that are functions of the existing variables; `select()`, `rename()`, and `relocate()` changes which variables are present, their names, and their positions.
|
`mutate()` creates new variables that are functions of the existing variables; `select()`, `rename()`, and `relocate()` changes which variables are present, their names, and their positions.
|
||||||
|
|
||||||
### `mutate()`
|
### `mutate()`
|
||||||
|
@ -187,8 +189,8 @@ mutate(flights,
|
||||||
)
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
The leading `.` is a sign that `.before` is an argument to the function, not a new variable being created.
|
The leading `.` is a sign that `.before` is an argument to the function, not the name of a new variable.
|
||||||
You can also use `.after` to add after a variable, and use a variable name instead of a position:
|
You can also use `.after` to add after a variable, and in both `.before` and `.after` you can the name of a variable name instead of a position:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
mutate(flights,
|
mutate(flights,
|
||||||
|
@ -212,7 +214,7 @@ mutate(flights,
|
||||||
### `select()` {#select}
|
### `select()` {#select}
|
||||||
|
|
||||||
It's not uncommon to get datasets with hundreds or even thousands of variables.
|
It's not uncommon to get datasets with hundreds or even thousands of variables.
|
||||||
In this case, the first challenge is often narrowing in on the variables you're actually interested in.
|
In this case, the first challenge is often focussing on just the variables you're interested in.
|
||||||
`select()` allows you to rapidly zoom in on a useful subset using operations based on the names of the variables.
|
`select()` allows you to rapidly zoom in on a useful subset using operations based on the names of the variables.
|
||||||
|
|
||||||
`select()` is not terribly useful with the flights data because we only have 19 variables, but you can still get the general idea of how it works:
|
`select()` is not terribly useful with the flights data because we only have 19 variables, but you can still get the general idea of how it works:
|
||||||
|
@ -239,7 +241,7 @@ There are a number of helper functions you can use within `select()`:
|
||||||
- `num_range("x", 1:3)`: matches `x1`, `x2` and `x3`.
|
- `num_range("x", 1:3)`: matches `x1`, `x2` and `x3`.
|
||||||
|
|
||||||
See `?select` for more details.
|
See `?select` for more details.
|
||||||
Once you know regular expressions (the topic of Chapter \@ref(regular-expressions)) you'll also be use `matches()` to select variables that match a regexp.
|
Once you know regular expressions (the topic of Chapter \@ref(regular-expressions)) you'll also be use `matches()` to select variables that match a pattern.
|
||||||
|
|
||||||
You can rename variables as you `select()` them by using `=`.
|
You can rename variables as you `select()` them by using `=`.
|
||||||
The new name appears on the left hand side of the `=`, and the old variable appears on the right hand side:
|
The new name appears on the left hand side of the `=`, and the old variable appears on the right hand side:
|
||||||
|
@ -267,7 +269,7 @@ By default it moves variables to the front:
|
||||||
relocate(flights, time_hour, air_time)
|
relocate(flights, time_hour, air_time)
|
||||||
```
|
```
|
||||||
|
|
||||||
But you can use the `.before` and `.after` arguments to choose where to place them:
|
But like with `mutate()`, you can use the `.before` and `.after` arguments to choose where to place them:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
relocate(flights, year:dep_time, .after = time_hour)
|
relocate(flights, year:dep_time, .after = time_hour)
|
||||||
|
@ -406,13 +408,13 @@ daily %>% summarise(n = n())
|
||||||
|
|
||||||
If you're happy with this behaviour, you can explicitly define it in order to suppress the message:
|
If you're happy with this behaviour, you can explicitly define it in order to suppress the message:
|
||||||
|
|
||||||
```{r results = FALSE}
|
```{r, results = FALSE}
|
||||||
daily %>% summarise(n = n(), .groups = "drop_last")
|
daily %>% summarise(n = n(), .groups = "drop_last")
|
||||||
```
|
```
|
||||||
|
|
||||||
Alternatively, you can change the default behaviour by setting a different value, e.g. `"drop"` for dropping all levels of grouping or `"keep"` for same grouping structure as `daily`:
|
Alternatively, you can change the default behaviour by setting a different value, e.g. `"drop"` for dropping all levels of grouping or `"keep"` for same grouping structure as `daily`:
|
||||||
|
|
||||||
```{r results = FALSE}
|
```{r, results = FALSE}
|
||||||
daily %>% summarise(n = n(), .groups = "drop")
|
daily %>% summarise(n = n(), .groups = "drop")
|
||||||
daily %>% summarise(n = n(), .groups = "keep")
|
daily %>% summarise(n = n(), .groups = "keep")
|
||||||
```
|
```
|
||||||
|
@ -433,26 +435,17 @@ daily %>%
|
||||||
|
|
||||||
For the purposes of summarising, ungrouped data is treated as if all your data was in a single group, so you get one row back.
|
For the purposes of summarising, ungrouped data is treated as if all your data was in a single group, so you get one row back.
|
||||||
|
|
||||||
### Selecting rows
|
|
||||||
|
|
||||||
`arrange()` and `filter()` are mostly unaffected by grouping.
|
|
||||||
But the slice functions are super useful:
|
|
||||||
|
|
||||||
- `slice_head()` and `slice_tail()` select the first or last rows in each group.
|
|
||||||
|
|
||||||
- `slice_max()` and `slice_min()` select the rows in each group with highest or lowest values.
|
|
||||||
|
|
||||||
- `slice_sample()` random selects rows from each group.
|
|
||||||
|
|
||||||
Each of these verbs takes either a `n` or `prop` argument depending on whether you want to select a fixed number of rows, or a number of rows proportional to the group size.
|
|
||||||
|
|
||||||
### Other verbs
|
### Other verbs
|
||||||
|
|
||||||
|
`group_by()` is usually paired with `summarise()`, but it's good to know how it affects other verbs:
|
||||||
|
|
||||||
- `select()`, `rename()`, `relocate()`: grouping has no affect
|
- `select()`, `rename()`, `relocate()`: grouping has no affect
|
||||||
|
|
||||||
- `filter()`, `mutate()`: computation happens per group.
|
- `mutate()`: computation happens per group.
|
||||||
This doesn't affect the functions you currently know but is very useful once you learn about window functions, Section \@ref(window-functions).
|
This doesn't affect the functions you currently know but is very useful once you learn about window functions, Section \@ref(window-functions).
|
||||||
|
|
||||||
|
- `arrange()` and `filter()` are mostly unaffected by grouping, unless you are doing computation (e.g. `filter(flights, dep_delay == min(dep_delay)`), in which case the `mutate()` caveat applies.
|
||||||
|
|
||||||
### Exercises
|
### Exercises
|
||||||
|
|
||||||
1. Which carrier has the worst delays?
|
1. Which carrier has the worst delays?
|
||||||
|
|
Loading…
Reference in New Issue