617 lines
21 KiB
Plaintext
617 lines
21 KiB
Plaintext
# Numeric vectors {#numbers}
|
|
|
|
```{r, results = "asis", echo = FALSE}
|
|
status("polishing")
|
|
```
|
|
|
|
## Introduction
|
|
|
|
In this chapter, you'll learn useful tools for creating and manipulating numeric vectors.
|
|
We'll start by going into a little more detail of `count()` before diving into various numeric transformations.
|
|
You'll then learn about more general transformations that can be applied to other types of vector, but are often used with numeric vectors.
|
|
Then you'll learn about a few more useful summaries before we finish up with a comparison of function variants that have similar names and similar actions, but are each designed for a specific use case.
|
|
|
|
### Prerequisites
|
|
|
|
This chapter mostly uses functions from base R, which are available without loading any packages.
|
|
But we still need the tidyverse because we'll use these base R functions inside of tidyverse functions like `mutate()` and `filter()`.
|
|
Like in the last chapter, we'll use real examples from nycflights13, as well as toy examples made with `c()` and `tribble()`.
|
|
|
|
```{r setup, message = FALSE}
|
|
library(tidyverse)
|
|
library(nycflights13)
|
|
```
|
|
|
|
### Counts
|
|
|
|
It's surprising how much data science you can do with just counts and a little basic arithmetic, so dplyr strives to make counting as easy as possible with `count()`.
|
|
This function is great for quick exploration and checks during analysis:
|
|
|
|
```{r}
|
|
flights |> count(dest)
|
|
```
|
|
|
|
(Despite the advice in Chapter \@ref(code-style), I usually put `count()` on a single line because I'm usually using it at the console for a quick check that my calculation is working as expected.)
|
|
|
|
If you want to see the most common values add `sort = TRUE`:
|
|
|
|
```{r}
|
|
flights |> count(dest, sort = TRUE)
|
|
```
|
|
|
|
And remember that if you want to see all the values, you can use `|> View()` or `|> print(n = Inf)`.
|
|
|
|
You can perform the same computation "by hand" with `group_by()`, `summarise()` and `n()`.
|
|
This is useful because it allows you to compute other summaries at the same time:
|
|
|
|
```{r}
|
|
flights |>
|
|
group_by(dest) |>
|
|
summarise(
|
|
n = n(),
|
|
delay = mean(arr_delay, na.rm = TRUE)
|
|
)
|
|
```
|
|
|
|
`n()` is a special summary function that doesn't take any arguments and instead access information about the "current" group.
|
|
This means that it only works inside dplyr verbs:
|
|
|
|
```{r, error = TRUE}
|
|
n()
|
|
```
|
|
|
|
There are a couple of variants of `n()` that you might find useful:
|
|
|
|
- `n_distinct(x)` counts the number of distinct (unique) values of one or more variables.
|
|
For example, we could figure out which destinations are served by the most carriers:
|
|
|
|
```{r}
|
|
flights |>
|
|
group_by(dest) |>
|
|
summarise(
|
|
carriers = n_distinct(carrier)
|
|
) |>
|
|
arrange(desc(carriers))
|
|
```
|
|
|
|
- A weighted count is a sum.
|
|
For example you could "count" the number of miles each plane flew:
|
|
|
|
```{r}
|
|
flights |>
|
|
group_by(tailnum) |>
|
|
summarise(miles = sum(distance))
|
|
```
|
|
|
|
Weighted counts are a common problem so `count()` has a `wt` argument that does the same thing:
|
|
|
|
```{r}
|
|
flights |> count(tailnum, wt = distance)
|
|
```
|
|
|
|
- You can count missing values by combining `sum()` and `is.na()`.
|
|
In the flights dataset this represents flights that are cancelled:
|
|
|
|
```{r}
|
|
flights |>
|
|
group_by(dest) |>
|
|
summarise(n_cancelled = sum(is.na(dep_time)))
|
|
```
|
|
|
|
### Exercises
|
|
|
|
1. How can you use `count()` to count the number rows with a missing value for a given variable?
|
|
2. Expand the following calls to `count()` to instead use `group_by()`, `summarise()`, and `arrange()`:
|
|
1. `flights |> count(dest, sort = TRUE)`
|
|
|
|
2. `flights |> count(tailnum, wt = distance)`
|
|
|
|
## Numeric transformations
|
|
|
|
Transformation functions work well with `mutate()` because their output is the same length as the input.
|
|
The vast majority of transformation functions are already built into base R.
|
|
It's impractical to list them all so this section will give show the most useful.
|
|
As an example, while R provides all the trigonometric functions that you might dream of, I don't list them here because they're rarely needed for data science.
|
|
|
|
### Arithmetic and recycling rules
|
|
|
|
We introduced the basics of arithmetic (`+`, `-`, `*`, `/`, `^`) in Chapter \@ref(workflow-basics) and have used them a bunch since.
|
|
These functions don't need a huge amount of explanation because they do what you learned in grade school.
|
|
But we need to briefly talk about the **recycling rules** which determine what happens when the left and right hand sides have different lengths.
|
|
This is important for operations like `flights |> mutate(air_time = air_time / 60)` because there are 336,776 numbers on the left of `/` but only one on the right.
|
|
|
|
R handles mismatched lengths by **recycling,** or repeating, the short vector.
|
|
We can see this in operation more easily if we create some vectors outside of a data frame:
|
|
|
|
```{r}
|
|
x <- c(1, 2, 10, 20)
|
|
x / 5
|
|
# is shorthand for
|
|
x / c(5, 5, 5, 5)
|
|
```
|
|
|
|
Generally, you only want to recycle single numbers (i.e. vectors of length 1), but R will recycle any shorter length vector.
|
|
It usually (but not always) warning if the longer vector isn't a multiple of the shorter:
|
|
|
|
```{r}
|
|
x * c(1, 2)
|
|
x * c(1, 2, 3)
|
|
```
|
|
|
|
These recycling rules are also applied to logical comparisons (`==`, `<`, `<=`, `>`, `>=`, `!=`) and can lead to a surprising result if you accidentally use `==` instead of `%in%` and the data frame has an unfortunate number of rows.
|
|
For example, take this code which attempts to find all flights in January and February:
|
|
|
|
```{r}
|
|
flights |>
|
|
filter(month == c(1, 2))
|
|
```
|
|
|
|
The code runs without error, but it doesn't return what you want.
|
|
Because of the recycling rules it finds flights in odd numbered rows that departed in January and flights in even numbered rows that departed in February.
|
|
And unforuntately there's no warning because `nycflights` has an even number of rows.
|
|
|
|
To protect you from this type of silent failure, most tidyverse functions use a stricter form of recycling that only recycles single values.
|
|
Unfortunately that doesn't help here, or in many other cases, because the key computation is performed by the base R function `==`, not `filter()`.
|
|
|
|
### Minimum and maximum
|
|
|
|
The arithmetic functions work with pairs of variables.
|
|
Two closely related functions are `pmin()` and `pmax()`, which when given two or more variables will return the smallest or largest value in each row:
|
|
|
|
```{r}
|
|
df <- tribble(
|
|
~x, ~y,
|
|
1, 3,
|
|
5, 2,
|
|
7, NA,
|
|
)
|
|
|
|
df |>
|
|
mutate(
|
|
min = pmin(x, y, na.rm = TRUE),
|
|
max = pmax(x, y, na.rm = TRUE)
|
|
)
|
|
```
|
|
|
|
These are different to the summary functions `min()` and `max()` which take multiple observations and return a single value.
|
|
We'll come back to those in Section \@ref(min-max-summary).
|
|
|
|
### Modular arithmetic
|
|
|
|
Modular arithmetic is the technical name for the type of math you did before you learned about real numbers, i.e. division that yields a whole number and a remainder.
|
|
In R, `%/%` does integer division and `%%` computes the remainder:
|
|
|
|
```{r}
|
|
1:10 %/% 3
|
|
1:10 %% 3
|
|
```
|
|
|
|
Modular arithmetic is handy for the flights dataset, because we can use it to unpack the `sched_dep_time` variable into and `hour` and `minute`:
|
|
|
|
```{r}
|
|
flights |>
|
|
mutate(
|
|
hour = sched_dep_time %/% 100,
|
|
minute = sched_dep_time %% 100,
|
|
.keep = "used"
|
|
)
|
|
```
|
|
|
|
We can combine that with the `mean(is.na(x))` trick from Section \@ref(logical-summaries) to see how the proportion of delayed flights varies over the course of the day.
|
|
The results are shown in Figure \@ref(fig:prop-cancelled).
|
|
|
|
```{r prop-cancelled}
|
|
#| fig.cap: >
|
|
#| A line plot with scheduled departure hour on the x-axis, and proportion
|
|
#| of cancelled flights on the y-axis. Cancellations seem to accumulate
|
|
#| over the course of the day until 8pm, very late flights are much
|
|
#| less likely to be cancelled.
|
|
#| fig.alt: >
|
|
#| A line plot showing how proportion of cancelled flights changes over
|
|
#| the course of the day. The proportion starts low at around 0.5% at
|
|
#| 6am, then steadily increases over the course of the day until peaking
|
|
#| at 4% at 7pm. The proportion of cancelled flights then drops rapidly
|
|
#| getting down to around 1% by midnight.
|
|
flights |>
|
|
group_by(hour = sched_dep_time %/% 100) |>
|
|
summarise(prop_cancelled = mean(is.na(dep_time)), n = n()) |>
|
|
filter(hour > 1) |>
|
|
ggplot(aes(hour, prop_cancelled)) +
|
|
geom_line(colour = "grey50") +
|
|
geom_point(aes(size = n))
|
|
```
|
|
|
|
### Logarithms
|
|
|
|
Logarithms are an incredibly useful transformation for dealing with data that ranges across multiple orders of magnitude.
|
|
They also convert exponential growth to linear growth.
|
|
For example, take compounding interest --- the amount of money you have at `year + 1` is the amount of money you had at `year` multiplied by the interest rate.
|
|
That gives a formula like `money = starting * interest ^ year`:
|
|
|
|
```{r}
|
|
starting <- 100
|
|
interest <- 1.05
|
|
|
|
money <- tibble(
|
|
year = 2000 + 1:50,
|
|
money = starting * interest^(1:50)
|
|
)
|
|
```
|
|
|
|
If you plot this data, you'll get an exponential curve:
|
|
|
|
```{r}
|
|
ggplot(money, aes(year, money)) +
|
|
geom_line()
|
|
```
|
|
|
|
Log transforming the y-axis gives a straight line:
|
|
|
|
```{r}
|
|
ggplot(money, aes(year, money)) +
|
|
geom_line() +
|
|
scale_y_log10()
|
|
```
|
|
|
|
This a straight line because a little algebra reveals that `log(money) = log(starting) + n * log(interest)`, which matches the pattern for a line, `y = m * x + b`.
|
|
This is a useful pattern: if you see a (roughly) straight line after log-transforming the y-axis, you know that there's underlying exponential growth.
|
|
|
|
If you're log-transforming your data with dplyr you have a choice of three logarithms provided by base R: `log()` (the natural log, base e), `log2()` (base 2), and `log10()` (base 10).
|
|
I recommend using `log2()` or `log10()`.
|
|
`log2()` is easy to interpret because difference of 1 on the log scale corresponds to doubling on the original scale and a difference of -1 corresponds to halving; whereas `log10()` is easy to back-transform because (e.g) 3 is 10\^3 = 1000.
|
|
|
|
The inverse of `log()` is `exp()`; to compute the inverse of `log2()` or `log10()` you'll need to use `2^` or `10^`.
|
|
|
|
### Rounding
|
|
|
|
Use `round(x)` to round a number to the nearest integer:
|
|
|
|
```{r}
|
|
round(123.456)
|
|
```
|
|
|
|
You can control the precision of the rounding with the second argument, `digits`.
|
|
`round(x, digits)` rounds to the nearest `10^-n` so `digits = 2` will round to the nearest 0.01.
|
|
This definition is useful because it implies `round(x, -3)` will round to the nearest thousand, which indeed it does:
|
|
|
|
```{r}
|
|
round(123.456, 2) # two digits
|
|
round(123.456, 1) # one digit
|
|
round(123.456, -1) # round to nearest ten
|
|
round(123.456, -2) # round to nearest hundred
|
|
```
|
|
|
|
There's one weirdness with `round()` that seems surprising at first glance:
|
|
|
|
```{r}
|
|
round(c(1.5, 2.5))
|
|
```
|
|
|
|
`round()` uses what's known as "round half to even" or Banker's rounding: if a number is half way between two integers, it will be rounded to the **even** integer.
|
|
This is a good strategy because it keeps the rounding unbiased: half of all 0.5s are rounded up, and half are rounded down.
|
|
|
|
`round()` is paired with `floor()` which always rounds down and `ceiling()` which always rounds up:
|
|
|
|
```{r}
|
|
x <- 123.456
|
|
|
|
floor(x)
|
|
ceiling(x)
|
|
```
|
|
|
|
These functions don't have a digits argument, so you can instead scale down, round, and then scale back up:
|
|
|
|
```{r}
|
|
# Round down to nearest two digits
|
|
floor(x / 0.01) * 0.01
|
|
# Round up to nearest two digits
|
|
ceiling(x / 0.01) * 0.01
|
|
```
|
|
|
|
You can use the same technique if you want to `round()` to a multiple of some other number:
|
|
|
|
```{r}
|
|
# Round to nearest multiple of 4
|
|
round(x / 4) * 4
|
|
|
|
# Round to nearest 0.25
|
|
round(x / 0.25) * 0.25
|
|
```
|
|
|
|
### Cumulative and rolling aggregates
|
|
|
|
Base R provides `cumsum()`, `cumprod()`, `cummin()`, `cummax()` for running, or cumulative, sums, products, mins and maxes.
|
|
dplyr provides `cummean()` for cumulative means.
|
|
Cumulative sums tend to come up the most in practice:
|
|
|
|
```{r}
|
|
x <- 1:10
|
|
cumsum(x)
|
|
```
|
|
|
|
If you need more complex rolling or sliding aggregates, try the [slider](https://davisvaughan.github.io/slider/) package by Davis Vaughan.
|
|
The following example illustrates some of its features.
|
|
|
|
```{r}
|
|
library(slider)
|
|
|
|
# Same as a cumulative sum
|
|
slide_vec(x, sum, .before = Inf)
|
|
# Sum the current element and the one before it
|
|
slide_vec(x, sum, .before = 1)
|
|
# Sum the current element and the two before and after it
|
|
slide_vec(x, sum, .before = 2, .after = 2)
|
|
# Only compute if the window is complete
|
|
slide_vec(x, sum, .before = 2, .after = 2, .complete = TRUE)
|
|
```
|
|
|
|
### Exercises
|
|
|
|
1. Explain in words what each line of the code used to generate Figure \@ref(fig:prop-cancelled) does.
|
|
|
|
## General transformations
|
|
|
|
The following sections describe some general transformations which are often used with numeric vectors, but can be applied to all other column types.
|
|
|
|
### Missing values {#missing-values-numbers}
|
|
|
|
You can fill in missing values with dplyr's `coalesce()`:
|
|
|
|
```{r}
|
|
x <- c(1, NA, 5, NA, 10)
|
|
coalesce(x, 0)
|
|
```
|
|
|
|
`coalesce()` is vectorised, so you can find the non-missing values from a pair of vectors:
|
|
|
|
```{r}
|
|
y <- c(2, 3, 4, NA, 5)
|
|
coalesce(x, y)
|
|
```
|
|
|
|
### Ranks
|
|
|
|
dplyr provides a number of ranking functions inspired by SQL, but you should always start with `dplyr::min_rank()`.
|
|
It uses the typical method for dealing with ties, e.g. 1st, 2nd, 2nd, 4th.
|
|
|
|
```{r}
|
|
x <- c(1, 2, 2, 3, 4, NA)
|
|
min_rank(x)
|
|
```
|
|
|
|
Note that the smallest values get the lowest ranks; use `desc(x)` to give the largest values the smallest ranks:
|
|
|
|
```{r}
|
|
min_rank(desc(x))
|
|
```
|
|
|
|
If `min_rank()` doesn't do what you need, look at the variants `dplyr::row_number()`, `dplyr::dense_rank()`, `dplyr::percent_rank()`, and `dplyr::cume_dist()`.
|
|
See the documentation for details.
|
|
|
|
```{r}
|
|
df <- data.frame(x = x)
|
|
df |> mutate(
|
|
row_number = row_number(x),
|
|
dense_rank = dense_rank(x),
|
|
percent_rank = percent_rank(x),
|
|
cume_dist = cume_dist(x)
|
|
)
|
|
```
|
|
|
|
You can achieve many of the same results by picking the appropriate `ties.method` argument to base R's `rank()`; you'll probably also want to set `na.last = "keep"` to keep `NA`s as `NA`.
|
|
|
|
`row_number()` can also be used without a variable when you're inside a dplyr verb, in which case it'll give within `mutate()`.
|
|
When combined with `%%` and `%/%` this can be a useful tool for dividing data into similarly sized groups:
|
|
|
|
```{r}
|
|
flights |>
|
|
mutate(
|
|
row = row_number(),
|
|
three_groups = (row - 1) %% 3,
|
|
three_in_each_group = (row - 1) %/% 3,
|
|
.keep = "none"
|
|
)
|
|
```
|
|
|
|
### Offsets
|
|
|
|
`dplyr::lead()` and `dplyr::lag()` allow you to refer the values just before or just after the "current" value.
|
|
They return a vector of the same length, padded with NAs at the start or end.
|
|
|
|
```{r}
|
|
x <- c(2, 5, 11, 11, 19, 35)
|
|
lag(x)
|
|
lag(x, 2)
|
|
lead(x)
|
|
```
|
|
|
|
- `x - lag(x)` gives you the difference between the current and previous value.
|
|
|
|
```{r}
|
|
x - lag(x)
|
|
```
|
|
|
|
- `x == lag(x)` tells you when the current value changes.
|
|
This is often useful combined with the cumulative tricks describe in Section \@ref(cumulative-tricks).
|
|
|
|
```{r}
|
|
x == lag(x)
|
|
```
|
|
|
|
### Exercises
|
|
|
|
1. Find the 10 most delayed flights using a ranking function.
|
|
How do you want to handle ties?
|
|
Carefully read the documentation for `min_rank()`.
|
|
|
|
2. Which plane (`tailnum`) has the worst on-time record?
|
|
|
|
3. What time of day should you fly if you want to avoid delays as much as possible?
|
|
|
|
4. What does `flights |> group_by(dest() |> filter(row_number() < 4)` do?
|
|
What does `flights |> group_by(dest() |> filter(row_number(dep_delay) < 4)` do?
|
|
|
|
5. For each destination, compute the total minutes of delay.
|
|
For each flight, compute the proportion of the total delay for its destination.
|
|
|
|
6. Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave.
|
|
Using `lag()`, explore how the delay of a flight is related to the delay of the immediately preceding flight.
|
|
|
|
7. Look at each destination.
|
|
Can you find flights that are suspiciously fast?
|
|
(i.e. flights that represent a potential data entry error).
|
|
Compute the air time of a flight relative to the shortest flight to that destination.
|
|
Which flights were most delayed in the air?
|
|
|
|
8. Find all destinations that are flown by at least two carriers.
|
|
Use that information to rank the carriers.
|
|
|
|
## Summaries
|
|
|
|
Just using means, counts, and sum can get you a long way, but R provides many other useful summary functions.
|
|
Here are a section that you might find useful.
|
|
|
|
### Center
|
|
|
|
We've mostly used `mean(x)` so far, but `median(x)` is also useful.
|
|
The mean is the sum divided by the length; the median is a value where 50% of `x` is above it, and 50% is below it.
|
|
This makes it more robust to unusual values.
|
|
|
|
```{r}
|
|
flights |>
|
|
group_by(month) |>
|
|
summarise(
|
|
med_arr_delay = median(arr_delay, na.rm = TRUE),
|
|
med_dep_delay = median(dep_delay, na.rm = TRUE)
|
|
)
|
|
```
|
|
|
|
Don't forget what you learned in Section \@ref(sample-size): whenever creating numerical summaries, it's a good idea to include the number of observations in each group.
|
|
|
|
### Minimum, maximum, and quantiles {#min-max-summary}
|
|
|
|
Quantiles are a generalization of the median.
|
|
For example, `quantile(x, 0.25)` will find a value of `x` that is greater than 25% of the values, and less than the remaining 75%.
|
|
`min()` and `max()` are like the 0% and 100% quantiles: they're the smallest and biggest numbers.
|
|
|
|
```{r}
|
|
# When do the first and last flights leave each day?
|
|
flights |>
|
|
group_by(year, month, day) |>
|
|
summarise(
|
|
first = min(dep_time, na.rm = TRUE),
|
|
last = max(dep_time, na.rm = TRUE)
|
|
)
|
|
```
|
|
|
|
Using the median and 95% quantile is coming in performance monitoring.
|
|
`median()` shows you what the (bare) majority of people experience, and 95% shows you the worst case, excluding 5% of outliers.
|
|
|
|
### Spread
|
|
|
|
The root mean squared deviation, or standard deviation `sd(x)`, is the standard measure of spread.
|
|
|
|
```{r}
|
|
# Why is distance to some destinations more variable than to others?
|
|
flights |>
|
|
group_by(origin, dest) |>
|
|
summarise(distance_sd = sd(distance), n = n()) |>
|
|
filter(distance_sd > 0)
|
|
|
|
# Did it move?
|
|
flights |>
|
|
filter(dest == "EGE") |>
|
|
select(time_hour, dest, distance, origin) |>
|
|
ggplot(aes(time_hour, distance, colour = origin)) +
|
|
geom_point()
|
|
```
|
|
|
|
<https://en.wikipedia.org/wiki/Eagle_County_Regional_Airport> --- seasonal airport.
|
|
Nothing in wikipedia suggests a move in 2013.
|
|
|
|
The interquartile range `IQR(x)` and median absolute deviation `mad(x)` are robust equivalents that may be more useful if you have outliers.
|
|
IQR is `quantile(x, 0.75) - quantile(x, 0.25)`.
|
|
`mad()` is derivied similarly to `sd()`, but inside being the average of the squared distances from the mean, it's the median of the absolute differences from the median.
|
|
|
|
### Positions
|
|
|
|
Base R provides a powerful tool for extracting subsets of vectors called `[`.
|
|
This book doesn't cover `[` until Section \@ref(vector-subsetting) so for now we'll introduce three specialized functions that are useful inside of `summarise()` if you want to extract values at a specified position: `first()`, `last()`, and `nth()`.
|
|
|
|
For example, we can find the first and last departure for each day:
|
|
|
|
```{r}
|
|
flights |>
|
|
group_by(year, month, day) |>
|
|
summarise(
|
|
first_dep = first(dep_time),
|
|
last_dep = last(dep_time)
|
|
)
|
|
```
|
|
|
|
Compared to `[`, these functions allow you to set a `default` value if requested position doesn't exist (e.g. you're trying to get the 3rd element from a group that only has two elements) and you can use `order_by` argument.
|
|
|
|
Extracting values at positions is complementary to filtering on ranks.
|
|
Filtering gives you all variables, with each observation in a separate row:
|
|
|
|
```{r}
|
|
flights |>
|
|
group_by(year, month, day) |>
|
|
mutate(r = min_rank(desc(sched_dep_time))) |>
|
|
filter(r %in% c(1, max(r)))
|
|
```
|
|
|
|
### With `mutate()`
|
|
|
|
As the names suggest, the summary functions are typically paired with `summarise()`, but they can also be usefully paired with `mutate()`, particularly when you want do some sort of group standardization.
|
|
|
|
- `x / sum(x)` calculates the proportion of a total.
|
|
- `(x - mean(x)) / sd(x)` computes a Z-score (standardised to mean 0 and sd 1).
|
|
- `x / x[1]` computes an index based on the first observation.
|
|
|
|
### Exercises
|
|
|
|
1. Currently `dep_time` and `sched_dep_time` are convenient to look at, but hard to compute with because they're not really continuous numbers.
|
|
Convert them to a more convenient representation of number of minutes since midnight.
|
|
|
|
2. What trigonometric functions does R provide?
|
|
|
|
3. Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights.
|
|
Consider the following scenarios:
|
|
|
|
- A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.
|
|
|
|
- A flight is always 10 minutes late.
|
|
|
|
- A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.
|
|
|
|
- 99% of the time a flight is on time.
|
|
1% of the time it's 2 hours late.
|
|
|
|
Which is more important: arrival delay or departure delay?
|
|
|
|
## Variants
|
|
|
|
We've seen a few variants of different functions
|
|
|
|
| Summary | Cumulative | Paired |
|
|
|---------|------------|--------|
|
|
| `sum` | `cumsum` | `+` |
|
|
| `prod` | `cumprod` | `*` |
|
|
| `all` | `cumall` | `&` |
|
|
| `any` | `cumany` | `|` |
|
|
| `min` | `cummin` | `pmin` |
|
|
| `max` | `cummax` | `pmax` |
|
|
|
|
- Summary functions take a vector and always return a length 1 vector. Typically used with `summarise()`
|
|
- Cumulative functions take a vector and return the same length. Used with `mutate()`.
|
|
- Paired functions take a pair of functions and return a vector the same length (using the recycling rules if the vectors aren't the same length). Used with `mutate()`
|
|
|
|
```{r}
|
|
x <- c(1, 2, 3, 5)
|
|
sum(x)
|
|
cumsum(x)
|
|
x + 10
|
|
```
|
|
|