* Update numbers.qmd

* Update numbers.qmd
This commit is contained in:
mcsnowface 2022-12-01 00:11:30 -07:00 committed by GitHub
parent 2069938079
commit bf474ecffc
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 16 additions and 17 deletions

View File

@ -33,7 +33,7 @@ library(nycflights13)
## Making numbers
In most cases, you'll get numbers already recorded in one of R's numeric types: integer or double.
In some cases, however, you'll encounter them as strings, possibly because you've created them by pivoting from column headers or something has gone wrong in your data import process.
In some cases, however, you'll encounter them as strings, possibly because you've created them by pivoting from column headers or because something has gone wrong in your data import process.
readr provides two useful functions for parsing strings into numbers: `parse_double()` and `parse_number()`.
Use `parse_double()` when you have numbers that have been written as strings:
@ -62,7 +62,7 @@ flights |> count(dest)
(Despite the advice in @sec-workflow-style, we usually put `count()` on a single line because it's usually used at the console for a quick check that a calculation is working as expected.)
If you want to see the most common values add `sort = TRUE`:
If you want to see the most common values, add `sort = TRUE`:
```{r}
flights |> count(dest, sort = TRUE)
@ -225,7 +225,7 @@ In R, `%/%` does integer division and `%%` computes the remainder:
1:10 %% 3
```
Modular arithmetic is handy for the flights dataset, because we can use it to unpack the `sched_dep_time` variable into and `hour` and `minute`:
Modular arithmetic is handy for the flights dataset, because we can use it to unpack the `sched_dep_time` variable into `hour` and `minute`:
```{r}
flights |>
@ -273,12 +273,12 @@ starting <- 100
interest <- 1.05
money <- tibble(
year = 2000 + 1:50,
money = starting * interest^(1:50)
year = 1:50,
money = starting * interest ^ year
)
```
If you plot this data, you'll get an exponential curve:
If you plot this data, you'll get an exponential curve showing how your money grows year by year at an interest rate of 1.05:
```{r}
ggplot(money, aes(year, money)) +
@ -293,12 +293,12 @@ ggplot(money, aes(year, money)) +
scale_y_log10()
```
This a straight line because a little algebra reveals that `log(money) = log(starting) + n * log(interest)`, which matches the pattern for a line, `y = m * x + b`.
This a straight line because a little algebra reveals that `log10(money) = log10(interest) * year + log10(starting)`, which matches the pattern for a line, `y = m * x + b`.
This is a useful pattern: if you see a (roughly) straight line after log-transforming the y-axis, you know that there's underlying exponential growth.
If you're log-transforming your data with dplyr you have a choice of three logarithms provided by base R: `log()` (the natural log, base e), `log2()` (base 2), and `log10()` (base 10).
We recommend using `log2()` or `log10()`.
`log2()` is easy to interpret because difference of 1 on the log scale corresponds to doubling on the original scale and a difference of -1 corresponds to halving; whereas `log10()` is easy to back-transform because (e.g) 3 is 10\^3 = 1000.
`log2()` is easy to interpret because a difference of 1 on the log scale corresponds to doubling on the original scale and a difference of -1 corresponds to halving; whereas `log10()` is easy to back-transform because (e.g.) 3 is 10\^3 = 1000.
The inverse of `log()` is `exp()`; to compute the inverse of `log2()` or `log10()` you'll need to use `2^` or `10^`.
@ -339,7 +339,7 @@ floor(x)
ceiling(x)
```
These functions don't have a digits argument, so you can instead scale down, round, and then scale back up:
These functions don't have a `digits` argument, so you can instead scale down, round, and then scale back up:
```{r}
# Round down to nearest two digits
@ -583,8 +583,8 @@ df |>
3. What time of day should you fly if you want to avoid delays as much as possible?
4. What does `flights |> group_by(dest() |> filter(row_number() < 4)` do?
What does `flights |> group_by(dest() |> filter(row_number(dep_delay) < 4)` do?
4. What does `flights |> group_by(dest) |> filter(row_number() < 4)` do?
What does `flights |> group_by(dest) |> filter(row_number(dep_delay) < 4)` do?
5. For each destination, compute the total minutes of delay.
For each flight, compute the proportion of the total delay for its destination.
@ -607,8 +607,7 @@ df |>
```
7. Look at each destination.
Can you find flights that are suspiciously fast?
(i.e. flights that represent a potential data entry error).
Can you find flights that are suspiciously fast (i.e. flights that represent a potential data entry error)?
Compute the air time of a flight relative to the shortest flight to that destination.
Which flights were most delayed in the air?
@ -618,7 +617,7 @@ df |>
## Numeric summaries
Just using the counts, means, and sums that we've introduced already can get you a long way, but R provides many other useful summary functions.
Here are a selection that you might find useful.
Here is a selection that you might find useful.
### Center
@ -629,7 +628,7 @@ Depending on the shape of the distribution of the variable you're interested in,
For example, for symmetric distributions we generally report the mean while for skewed distributions we usually report the median.
@fig-mean-vs-median compares the mean vs the median when looking at the hourly vs median departure delay.
The median delay is always smaller than the mean delay because because flights sometimes leave multiple hours late, but never leave multiple hours early.
The median delay is always smaller than the mean delay because flights sometimes leave multiple hours late, but never leave multiple hours early.
```{r}
#| label: fig-mean-vs-median
@ -666,7 +665,7 @@ For these reasons, the mode tends not to be used by statisticians and there's no
What if you're interested in locations other than the center?
`min()` and `max()` will give you the largest and smallest values.
Another powerful tool is `quantile()` which is a generalization of the median: `quantile(x, 0.25)` will find the value of `x` that is greater than 25% of the values, `quantile(x, 0.5)` is equivalent to the median, and `quantile(x, 0.95)` will find a value that's greater than 95% of the values.
Another powerful tool is `quantile()` which is a generalization of the median: `quantile(x, 0.25)` will find the value of `x` that is greater than 25% of the values, `quantile(x, 0.5)` is equivalent to the median, and `quantile(x, 0.95)` will find the value that's greater than 95% of the values.
For the `flights` data, you might want to look at the 95% quantile of delays rather than the maximum, because it will ignore the 5% of most delayed flights which can be quite extreme.
@ -767,7 +766,7 @@ Finally, don't forget what you learned in @sec-sample-size: whenever creating nu
### Positions
There's one final type of summary that's useful for numeric vectors, but also works with every other type of value: extracting a value at specific position.
There's one final type of summary that's useful for numeric vectors, but also works with every other type of value: extracting a value at a specific position.
You can do this with the base R `[` function, but we're not going to cover it in detail until @sec-subset-many, because it's a very powerful and general function.
For now we'll introduce three specialized functions that you can use to extract values at a specified position: `first(x)`, `last(x)`, and `nth(x, n)`.