O'Reilly feedback

This commit is contained in:
Hadley Wickham 2022-09-29 10:58:31 -05:00
parent be5905a09c
commit 86324b358d
4 changed files with 10 additions and 10 deletions

View File

@ -339,7 +339,7 @@ gss_cat |>
count(partyid)
```
`fct_recode()` will leave levels that aren't explicitly mentioned as is, and will warn you if you accidentally refer to a level that doesn't exist.
`fct_recode()` will leave the levels that aren't explicitly mentioned as is, and will warn you if you accidentally refer to a level that doesn't exist.
To combine groups, you can assign multiple old levels to the same new level:

View File

@ -344,7 +344,7 @@ flights |>
In most cases, however, `any()` and `all()` are a little too crude, and it would be nice to be able to get a little more detail about how many values are `TRUE` or `FALSE`.
That leads us to the numeric summaries.
### Numeric summaries
### Numeric summaries of logical vectors
When you use a logical vector in a numeric context, `TRUE` becomes 1 and `FALSE` becomes 0.
This makes `sum()` and `mean()` very useful with logical vectors because `sum(x)` will give the number of `TRUE`s and `mean(x)` the proportion of `TRUE`s.
@ -382,7 +382,7 @@ flights |>
### Logical subsetting
There's one final use for logical vectors in summaries: you can use a logical vector to filter a single variable to a subset of interest.
This makes use of the base `[` (pronounced subset) operator, which you'll learn more about this in @sec-vector-subsetting.
This makes use of the base `[` (pronounced subset) operator, which you'll learn more about in @sec-vector-subsetting.
Imagine we wanted to look at the average delay just for flights that were actually delayed.
One way to do so would be to first filter the flights:

View File

@ -35,7 +35,7 @@ To begin, let's explore a few handy tools for creating or eliminating missing ex
### Last observation carried forward
A common use for missing values is as a data entry convenience.
Sometimes data that has been entered by hand, missing values indicate that the value in the previous row has been repeated:
When data is entered by hand, missing values sometimes indicate that the value in the previous row has been repeated (or carried forward):
```{r}
treatment <- tribble(
@ -60,7 +60,7 @@ You can use the `.direction` argument to fill in missing values that have been g
### Fixed values
Some times missing values represent some fixed and known value, mostly commonly 0.
Some times missing values represent some fixed and known value, most commonly 0.
You can use `dplyr::coalesce()` to replace them:
```{r}

View File

@ -28,7 +28,7 @@ library(tidyverse)
library(nycflights13)
```
### Counts
## Counts
It's surprising how much data science you can do with just counts and a little basic arithmetic, so dplyr strives to make counting as easy as possible with `count()`.
This function is great for quick exploration and checks during analysis:
@ -59,7 +59,7 @@ flights |>
)
```
`n()` is a special summary function that doesn't take any arguments and instead access information about the "current" group.
`n()` is a special summary function that doesn't take any arguments and instead accesses information about the "current" group.
This means that it only works inside dplyr verbs:
```{r}
@ -554,7 +554,7 @@ You can lead or lag by more than one position by using the second argument, `n`.
8. Find all destinations that are flown by at least two carriers.
Use those destinations to come up with a relative ranking of the carriers based on their performance for the same destination.
## Summaries
## Numeric summaries
Just using the counts, means, and sums that we've introduced already can get you a long way, but R provides many other useful summary functions.
Here are a selection that you might find useful.
@ -621,12 +621,12 @@ flights |>
### Spread
Sometimes you're not so interested in where the bulk of the data lies, but how it is spread out.
Sometimes you're not so interested in where the bulk of the data lies, but in how it is spread out.
Two commonly used summaries are the standard deviation, `sd(x)`, and the inter-quartile range, `IQR()`.
We won't explain `sd()` here since you're probably already familiar with it, but `IQR()` might be new --- it's `quantile(x, 0.75) - quantile(x, 0.25)` and gives you the range that contains the middle 50% of the data.
We can use this to reveal a small oddity in the `flights` data.
You might expect that the spread of the distance between origin and destination to be zero, since airports are always in the same place.
You might expect the spread of the distance between origin and destination to be zero, since airports are always in the same place.
But the code below makes it looks like one airport, [EGE](https://en.wikipedia.org/wiki/Eagle_County_Regional_Airport), might have moved.
```{r}