parent
2069938079
commit
bf474ecffc
33
numbers.qmd
33
numbers.qmd
|
@ -33,7 +33,7 @@ library(nycflights13)
|
||||||
## Making numbers
|
## Making numbers
|
||||||
|
|
||||||
In most cases, you'll get numbers already recorded in one of R's numeric types: integer or double.
|
In most cases, you'll get numbers already recorded in one of R's numeric types: integer or double.
|
||||||
In some cases, however, you'll encounter them as strings, possibly because you've created them by pivoting from column headers or something has gone wrong in your data import process.
|
In some cases, however, you'll encounter them as strings, possibly because you've created them by pivoting from column headers or because something has gone wrong in your data import process.
|
||||||
|
|
||||||
readr provides two useful functions for parsing strings into numbers: `parse_double()` and `parse_number()`.
|
readr provides two useful functions for parsing strings into numbers: `parse_double()` and `parse_number()`.
|
||||||
Use `parse_double()` when you have numbers that have been written as strings:
|
Use `parse_double()` when you have numbers that have been written as strings:
|
||||||
|
@ -62,7 +62,7 @@ flights |> count(dest)
|
||||||
|
|
||||||
(Despite the advice in @sec-workflow-style, we usually put `count()` on a single line because it's usually used at the console for a quick check that a calculation is working as expected.)
|
(Despite the advice in @sec-workflow-style, we usually put `count()` on a single line because it's usually used at the console for a quick check that a calculation is working as expected.)
|
||||||
|
|
||||||
If you want to see the most common values add `sort = TRUE`:
|
If you want to see the most common values, add `sort = TRUE`:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
flights |> count(dest, sort = TRUE)
|
flights |> count(dest, sort = TRUE)
|
||||||
|
@ -225,7 +225,7 @@ In R, `%/%` does integer division and `%%` computes the remainder:
|
||||||
1:10 %% 3
|
1:10 %% 3
|
||||||
```
|
```
|
||||||
|
|
||||||
Modular arithmetic is handy for the flights dataset, because we can use it to unpack the `sched_dep_time` variable into and `hour` and `minute`:
|
Modular arithmetic is handy for the flights dataset, because we can use it to unpack the `sched_dep_time` variable into `hour` and `minute`:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
flights |>
|
flights |>
|
||||||
|
@ -273,12 +273,12 @@ starting <- 100
|
||||||
interest <- 1.05
|
interest <- 1.05
|
||||||
|
|
||||||
money <- tibble(
|
money <- tibble(
|
||||||
year = 2000 + 1:50,
|
year = 1:50,
|
||||||
money = starting * interest^(1:50)
|
money = starting * interest ^ year
|
||||||
)
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
If you plot this data, you'll get an exponential curve:
|
If you plot this data, you'll get an exponential curve showing how your money grows year by year at an interest rate of 1.05:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
ggplot(money, aes(year, money)) +
|
ggplot(money, aes(year, money)) +
|
||||||
|
@ -293,12 +293,12 @@ ggplot(money, aes(year, money)) +
|
||||||
scale_y_log10()
|
scale_y_log10()
|
||||||
```
|
```
|
||||||
|
|
||||||
This a straight line because a little algebra reveals that `log(money) = log(starting) + n * log(interest)`, which matches the pattern for a line, `y = m * x + b`.
|
This a straight line because a little algebra reveals that `log10(money) = log10(interest) * year + log10(starting)`, which matches the pattern for a line, `y = m * x + b`.
|
||||||
This is a useful pattern: if you see a (roughly) straight line after log-transforming the y-axis, you know that there's underlying exponential growth.
|
This is a useful pattern: if you see a (roughly) straight line after log-transforming the y-axis, you know that there's underlying exponential growth.
|
||||||
|
|
||||||
If you're log-transforming your data with dplyr you have a choice of three logarithms provided by base R: `log()` (the natural log, base e), `log2()` (base 2), and `log10()` (base 10).
|
If you're log-transforming your data with dplyr you have a choice of three logarithms provided by base R: `log()` (the natural log, base e), `log2()` (base 2), and `log10()` (base 10).
|
||||||
We recommend using `log2()` or `log10()`.
|
We recommend using `log2()` or `log10()`.
|
||||||
`log2()` is easy to interpret because difference of 1 on the log scale corresponds to doubling on the original scale and a difference of -1 corresponds to halving; whereas `log10()` is easy to back-transform because (e.g) 3 is 10\^3 = 1000.
|
`log2()` is easy to interpret because a difference of 1 on the log scale corresponds to doubling on the original scale and a difference of -1 corresponds to halving; whereas `log10()` is easy to back-transform because (e.g.) 3 is 10\^3 = 1000.
|
||||||
|
|
||||||
The inverse of `log()` is `exp()`; to compute the inverse of `log2()` or `log10()` you'll need to use `2^` or `10^`.
|
The inverse of `log()` is `exp()`; to compute the inverse of `log2()` or `log10()` you'll need to use `2^` or `10^`.
|
||||||
|
|
||||||
|
@ -339,7 +339,7 @@ floor(x)
|
||||||
ceiling(x)
|
ceiling(x)
|
||||||
```
|
```
|
||||||
|
|
||||||
These functions don't have a digits argument, so you can instead scale down, round, and then scale back up:
|
These functions don't have a `digits` argument, so you can instead scale down, round, and then scale back up:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
# Round down to nearest two digits
|
# Round down to nearest two digits
|
||||||
|
@ -583,8 +583,8 @@ df |>
|
||||||
|
|
||||||
3. What time of day should you fly if you want to avoid delays as much as possible?
|
3. What time of day should you fly if you want to avoid delays as much as possible?
|
||||||
|
|
||||||
4. What does `flights |> group_by(dest() |> filter(row_number() < 4)` do?
|
4. What does `flights |> group_by(dest) |> filter(row_number() < 4)` do?
|
||||||
What does `flights |> group_by(dest() |> filter(row_number(dep_delay) < 4)` do?
|
What does `flights |> group_by(dest) |> filter(row_number(dep_delay) < 4)` do?
|
||||||
|
|
||||||
5. For each destination, compute the total minutes of delay.
|
5. For each destination, compute the total minutes of delay.
|
||||||
For each flight, compute the proportion of the total delay for its destination.
|
For each flight, compute the proportion of the total delay for its destination.
|
||||||
|
@ -607,8 +607,7 @@ df |>
|
||||||
```
|
```
|
||||||
|
|
||||||
7. Look at each destination.
|
7. Look at each destination.
|
||||||
Can you find flights that are suspiciously fast?
|
Can you find flights that are suspiciously fast (i.e. flights that represent a potential data entry error)?
|
||||||
(i.e. flights that represent a potential data entry error).
|
|
||||||
Compute the air time of a flight relative to the shortest flight to that destination.
|
Compute the air time of a flight relative to the shortest flight to that destination.
|
||||||
Which flights were most delayed in the air?
|
Which flights were most delayed in the air?
|
||||||
|
|
||||||
|
@ -618,7 +617,7 @@ df |>
|
||||||
## Numeric summaries
|
## Numeric summaries
|
||||||
|
|
||||||
Just using the counts, means, and sums that we've introduced already can get you a long way, but R provides many other useful summary functions.
|
Just using the counts, means, and sums that we've introduced already can get you a long way, but R provides many other useful summary functions.
|
||||||
Here are a selection that you might find useful.
|
Here is a selection that you might find useful.
|
||||||
|
|
||||||
### Center
|
### Center
|
||||||
|
|
||||||
|
@ -629,7 +628,7 @@ Depending on the shape of the distribution of the variable you're interested in,
|
||||||
For example, for symmetric distributions we generally report the mean while for skewed distributions we usually report the median.
|
For example, for symmetric distributions we generally report the mean while for skewed distributions we usually report the median.
|
||||||
|
|
||||||
@fig-mean-vs-median compares the mean vs the median when looking at the hourly vs median departure delay.
|
@fig-mean-vs-median compares the mean vs the median when looking at the hourly vs median departure delay.
|
||||||
The median delay is always smaller than the mean delay because because flights sometimes leave multiple hours late, but never leave multiple hours early.
|
The median delay is always smaller than the mean delay because flights sometimes leave multiple hours late, but never leave multiple hours early.
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
#| label: fig-mean-vs-median
|
#| label: fig-mean-vs-median
|
||||||
|
@ -666,7 +665,7 @@ For these reasons, the mode tends not to be used by statisticians and there's no
|
||||||
|
|
||||||
What if you're interested in locations other than the center?
|
What if you're interested in locations other than the center?
|
||||||
`min()` and `max()` will give you the largest and smallest values.
|
`min()` and `max()` will give you the largest and smallest values.
|
||||||
Another powerful tool is `quantile()` which is a generalization of the median: `quantile(x, 0.25)` will find the value of `x` that is greater than 25% of the values, `quantile(x, 0.5)` is equivalent to the median, and `quantile(x, 0.95)` will find a value that's greater than 95% of the values.
|
Another powerful tool is `quantile()` which is a generalization of the median: `quantile(x, 0.25)` will find the value of `x` that is greater than 25% of the values, `quantile(x, 0.5)` is equivalent to the median, and `quantile(x, 0.95)` will find the value that's greater than 95% of the values.
|
||||||
|
|
||||||
For the `flights` data, you might want to look at the 95% quantile of delays rather than the maximum, because it will ignore the 5% of most delayed flights which can be quite extreme.
|
For the `flights` data, you might want to look at the 95% quantile of delays rather than the maximum, because it will ignore the 5% of most delayed flights which can be quite extreme.
|
||||||
|
|
||||||
|
@ -767,7 +766,7 @@ Finally, don't forget what you learned in @sec-sample-size: whenever creating nu
|
||||||
|
|
||||||
### Positions
|
### Positions
|
||||||
|
|
||||||
There's one final type of summary that's useful for numeric vectors, but also works with every other type of value: extracting a value at specific position.
|
There's one final type of summary that's useful for numeric vectors, but also works with every other type of value: extracting a value at a specific position.
|
||||||
You can do this with the base R `[` function, but we're not going to cover it in detail until @sec-subset-many, because it's a very powerful and general function.
|
You can do this with the base R `[` function, but we're not going to cover it in detail until @sec-subset-many, because it's a very powerful and general function.
|
||||||
For now we'll introduce three specialized functions that you can use to extract values at a specified position: `first(x)`, `last(x)`, and `nth(x, n)`.
|
For now we'll introduce three specialized functions that you can use to extract values at a specified position: `first(x)`, `last(x)`, and `nth(x, n)`.
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue