Implement typo fixes & suggestions, closes #1031

This commit is contained in:
Mine Çetinkaya-Rundel 2022-06-01 00:52:33 -04:00
parent 3e5aaa10b5
commit 7618f2e3cb
1 changed files with 10 additions and 10 deletions

View File

@ -118,7 +118,7 @@ There are a couple of variants of `n()` that you might find useful:
Transformation functions work well with `mutate()` because their output is the same length as the input.
The vast majority of transformation functions are already built into base R.
It's impractical to list them all so this section will give show the most useful.
It's impractical to list them all so this section will show the most useful ones.
As an example, while R provides all the trigonometric functions that you might dream of, I don't list them here because they're rarely needed for data science.
### Arithmetic and recycling rules
@ -139,7 +139,7 @@ x / c(5, 5, 5, 5)
```
Generally, you only want to recycle single numbers (i.e. vectors of length 1), but R will recycle any shorter length vector.
It usually (but not always) warning if the longer vector isn't a multiple of the shorter:
It usually (but not always) gives you a warning if the longer vector isn't a multiple of the shorter:
```{r}
x * c(1, 2)
@ -444,7 +444,7 @@ df <- tibble(x = runif(10))
df |>
mutate(
row0 = row_number() - 1,
three_groups = row0 %/% (n() / 3),
three_groups = row0 %% 3,
three_in_each_group = row0 %/% 3,
)
```
@ -531,13 +531,13 @@ Depending on the shape of the distribution of the variable you're interested in,
For example, for symmetric distributions we generally report the mean while for skewed distributions we usually report the median.
@fig-mean-vs-median compares the mean vs the median when looking at the hourly vs median departure delay.
The median delay is always smaller than the mean delay because because flight sometimes leave multiple hours late, but never leave multiple hours early.
The median delay is always smaller than the mean delay because because flights sometimes leave multiple hours late, but never leave multiple hours early.
```{r}
#| label: fig-mean-vs-median
#| fig-cap: >
#| A scatterplot showing the differences of summarising hourly depature
#| delay with median instead of median.
#| delay with median instead of mean.
#| fig-alt: >
#| All points fall below a 45° line, meaning that the median delay is
#| always less than the mean delay. Most points are clustered in a
@ -559,7 +559,7 @@ flights |>
You might also wonder about the **mode**, or the most common value.
This is a summary that only works well for very simple cases (which is why you might have learned about it in high school), but it doesn't work well for many real datasets.
If the data is discrete, there may be multiple most common values, and if the data is continuous, there might be no most common value because every value is every so slightly different.
If the data is discrete, there may be multiple most common values, and if the data is continuous, there might be no most common value because every value is ever so slightly different.
For these reasons, the mode tends not to be used by statisticians and there's no mode function included in base R[^numbers-1].
[^numbers-1]: The `mode()` function does something quite different!
@ -584,7 +584,7 @@ flights |>
### Spread
Sometimes you're not so interested in where the bulk of the data lies, but how spread out it.
Sometimes you're not so interested in where the bulk of the data lies, but how it is spread out.
Two commonly used summaries are the standard deviation, `sd(x)`, and the inter-quartile range, `IQR()`.
I won't explain `sd()` here since you're probably already familiar with it, but `IQR()` might be new --- it's `quantile(x, 0.75) - quantile(x, 0.25)` and gives you the range that contains the middle 50% of the data.
@ -670,7 +670,7 @@ Finally, don't forget what you learned in @sec-sample-size: whenever creating nu
### Positions
There's one final type of summary that's useful for numeric vectors, but also works with every other type of value: extracting a value at specific position.
You can do this with the base R `[` function, but we're not cover it until @sec-vector-subsetting, because it's a very powerful and general function.
You can do this with the base R `[` function, but we're not going to cover it until @sec-vector-subsetting, because it's a very powerful and general function.
For now we'll introduce three specialized functions that you can use to extract values at a specified position: `first(x)`, `last(x)`, and `nth(x, n)`.
For example, we can find the first and last departure for each day:
@ -689,8 +689,8 @@ flights |>
If you're familiar with `[`, you might wonder if you ever need these functions.
I think there are main reasons: the `default` argument and the `order_by` argument.
`default` allows you to set a default value that's use if the requested position doesn't exist, e.g. you're trying to get the 3rd element from a two element group.
`order_by` lets you locally override the existing ordering of the rows, so you can
`default` allows you to set a default value that's used if the requested position doesn't exist, e.g. you're trying to get the 3rd element from a two element group.
`order_by` lets you locally override the existing ordering of the rows, so you can get the element at the position in the ordering by `order_by()`.
Extracting values at positions is complementary to filtering on ranks.
Filtering gives you all variables, with each observation in a separate row: