In this chapter, you'll learn useful tools for creating and manipulating with numeric vectors.
We'll start by doing into a little more detail of `count()` before diving into various numeric transformations.
You'll then learn about more general transformations that are often used with numeric vectors, but also work with other types.
Then you'll learn about a few more useful summaries before we finish up with a comparison of function variants that have similar names and similar actions, but are each designed for a specific use case.
(Despite the advice in Chapter \@ref(code-style), I usually put `count()` on a single line because I'm usually using it at the console for a quick check that my calculation is working as expected.)
Base R provides many useful transformation functions that you can use with `mutate()`.
We'll come back to this distinction later in Section \@ref(variants), but the key property that they all possess is that the output is the same length as the input.
There's no way to list every possible function that you might use, so this section will aim give a selection of the most useful.
One category that I've deliberately omit is the trigonometric functions; R provides all the trig functions that you might expect, but they're rarely needed for data science.
Generally, you want to recycle vectors of length 1, but R supports a rather more general rule where it will recycle any shorter length vector, usually (but not always) warning if the longer vector isn't a multiple of the shorter:
Modular arithmetic is the technical name for the type of math you did before you learned about real numbers, i.e. when you did division that yield a whole number and a remainder.
We can combine that with the `mean(is.na(x))` trick from Section \@ref(logical-summaries) to see how the proportion of delayed flights varies over the course of the day.
The results are shown in Figure \@ref(fig:prop-cancelled).
```{r prop-cancelled}
#| fig.cap: >
#| A line plot with scheduled departure hour on the x-axis, and proportion
#| of cancelled flights on the y-axis. Cancellations seem to accumulate
#| over the course of the day until 8pm, very late flights are much
#| less likely to be cancelled.
#| fig.alt: >
#| A line plot showing how proportion of cancelled flights changes over
#| the course of the day. The proportion starts low at around 0.5% at
#| 6am, then steadily increases over the course of the day until peaking
#| at 4% at 7pm. The proportion of cancelled flights then drops rapidly
For example, take compounding interest --- the amount of money you have at `year + 1` is the amount of money you had at `year` multiplied by the interest rate.
That gives a formula like `money = starting * interest ^ year`:
```{r}
starting <- 100
interest <- 1.05
money <- tibble(
year = 2000 + 1:50,
money = starting * interest^(1:50)
)
```
If you plot this data, you'll get a curve:
```{r}
ggplot(money, aes(year, money)) +
geom_line()
```
Log transforming the y-axis gives a straight line:
```{r}
ggplot(money, aes(year, money)) +
geom_line() +
scale_y_log10()
```
We get a straight line because (after a little algebra) we get `log(money) = log(starting) + n * log(interest)`, which matches the pattern for a straight line, `y = m * x + b`.
This is a useful pattern: if you see a (roughly) straight line after log-transforming the y-axis, you know that there's an underlying multiplicative relationship.
If you're log-transforming your data with dplyr, instead of relying on ggplot2 to do it for you, you have a choice of three logarithms: `log()` (the natural log, base e), `log2()` (base 2), and `log10()` (base 10).
`log2()` is easy to interpret because difference of 1 on the log scale corresponds to doubling on the original scale and a difference of -1 corresponds to halving; whereas `log10()` is easy to back-transform because (e.g) 3 is 10\^3 = 1000.
Base R provides `cumsum()`, `cumprod()`, `cummin()`, `cummax()` for running, or cumulative, sums, products, mins and maxes, and dplyr provides `cummean()` for cumulative means.
If `min_rank()` doesn't do what you need, look at the variants `dplyr::row_number()`, `dplyr::dense_rank()`, `dplyr::percent_rank()`, `dplyr::cume_dist()`, `dplyr::ntile()`, as well as base R's `rank()`.
If your rows have a meaningful order, you can use base R's `[`, or dplyr's `first(x)`, `nth(x, 2)`, or `last(x)` to extract values at a certain position.
The chief advantage of `first()` and `nth()` over `[` is that you can set a default value if that position does not exist (i.e. you're trying to get the 3rd element from a group that only has two elements).
The chief advantage of `last()` over `[`, is writing `last(x)` rather than `x[length(x)]`.
1. Find the 10 most delayed flights using a ranking function.
How do you want to handle ties?
Carefully read the documentation for `min_rank()`.
2. Which plane (`tailnum`) has the worst on-time record?
3. What time of day should you fly if you want to avoid delays as much as possible?
4. For each destination, compute the total minutes of delay.
For each flight, compute the proportion of the total delay for its destination.
5. Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave.
Using `lag()`, explore how the delay of a flight is related to the delay of the immediately preceding flight.
6. Look at each destination.
Can you find flights that are suspiciously fast?
(i.e. flights that represent a potential data entry error).
Compute the air time of a flight relative to the shortest flight to that destination.
Which flights were most delayed in the air?
7. Find all destinations that are flown by at least two carriers.
Just using means, counts, and sum can get you a long way, but R provides many other useful summary functions.
### Center
We've used `mean(x)`, but `median(x)` is also useful.
The mean is the sum divided by the length; the median is a value where 50% of `x` is above it, and 50% is below it.
```{r}
flights |>
group_by(month) |>
summarise(
med_arr_delay = median(arr_delay, na.rm = TRUE),
med_dep_delay = median(dep_delay, na.rm = TRUE)
)
```
Don't forget what you learned in Section \@ref(sample-size): whenever creating numerical summaries, it's a good idea to include the number of observations in each group.
The interquartile range `IQR(x)` and median absolute deviation `mad(x)` are robust equivalents that may be more useful if you have outliers.
IQR is `quantile(x, 0.75) - quantile(x, 0.25)`.
`mad()` is derivied similarly to `sd()`, but inside being the average of the squared distances from the mean, it's the median of the absolute differences from the median.
As the names suggest, the summary functions are typically paired with `summarise()`, but they can also be usefully paired with `mutate()`, particularly when you want do some sort of group standardization.
- Summary functions take a vector and always return a length 1 vector. Typically used with `summarise()`
- Cumulative functions take a vector and return the same length. Used with `mutate()`.
- Paired functions take a pair of functions and return a vector the same length (using the recycling rules if the vectors aren't the same length). Used with `mutate()`