# Vector tools ## Introduction `%in%` ## Counts - Counts: You've seen `n()`, which takes no arguments, and returns the size of the current group. To count the number of non-missing values, use `sum(!is.na(x))`. To count the number of distinct (unique) values, use `n_distinct(x)`. ```{r} # Which destinations have the most carriers? not_cancelled %>% group_by(dest) %>% summarise(carriers = n_distinct(carrier)) %>% arrange(desc(carriers)) ``` Counts are so useful that dplyr provides a simple helper if all you want is a count: ```{r} not_cancelled %>% count(dest) ``` Just like with `group_by()`, you can also provide multiple variables to `count()`. ```{r} not_cancelled %>% count(carrier, dest) ``` You can optionally provide a weight variable. For example, you could use this to "count" (sum) the total number of miles a plane flew: ```{r} not_cancelled %>% count(tailnum, wt = distance) ``` ## Window functions - Offsets: `lead()` and `lag()` allow you to refer to leading or lagging values. This allows you to compute running differences (e.g. `x - lag(x)`) or find when values change (`x != lag(x)`). They are most useful in conjunction with `group_by()`, which you'll learn about shortly. ```{r} (x <- 1:10) lag(x) lead(x) ``` - Ranking: there are a number of ranking functions, but you should start with `min_rank()`. It does the most usual type of ranking (e.g. 1st, 2nd, 2nd, 4th). The default gives smallest values the small ranks; use `desc(x)` to give the largest values the smallest ranks. ```{r} y <- c(1, 2, 2, NA, 3, 4) min_rank(y) min_rank(desc(y)) ``` If `min_rank()` doesn't do what you need, look at the variants `row_number()`, `dense_rank()`, `percent_rank()`, `cume_dist()`, `ntile()`. See their help pages for more details. ```{r} row_number(y) dense_rank(y) percent_rank(y) cume_dist(y) ``` - Measures of position: `first(x)`, `nth(x, 2)`, `last(x)`. These work similarly to `x[1]`, `x[2]`, and `x[length(x)]` but let you set a default value if that position does not exist (i.e. you're trying to get the 3rd element from a group that only has two elements). For example, we can find the first and last departure for each day: ```{r} not_cancelled %>% group_by(year, month, day) %>% summarise( first_dep = first(dep_time), last_dep = last(dep_time) ) ``` These functions are complementary to filtering on ranks. Filtering gives you all variables, with each observation in a separate row: ```{r} not_cancelled %>% group_by(year, month, day) %>% mutate(r = min_rank(desc(dep_time))) %>% filter(r %in% range(r)) ```