To finish off the programming section, we're going to give you a quick tour of the most important base R functions that we don't otherwise discuss in the book.
We teach the tidyverse in this book because tidyverse packages share a common design philosophy, increasing the consistency across functions, and making each new function or package a little easier to learn and use.
It's not possible to use the tidyverse without using base R, so we've actually already taught you a **lot** of base R functions: from `library()` to load packages, to `sum()` and `mean()` for numeric summaries, to the factor, date, and POSIXct data types, and of course all the basic operators like `+`, `-`, `/`, `*`, `|`, `&`, and `!`.
What we haven't focused on so far is base R workflows, so we will highlight a few of those in this chapter.
`[` is used to extract sub-components from vectors and data frames, and is called like `x[i]` or `x[i, j]`.
In this section, we'll introduce you to the power of `[`, first showing you how you can use it with vectors, then how the same principles extend in a straightforward way to two-dimensional (2d) structures like data frames.
We'll then help you cement that knowledge by showing how various dplyr verbs are special cases of `[`.
There are quite a few different ways[^base-r-1] that you can use `[` with a data frame, but the most important way is to select rows and columns independently with `df[rows, cols]`. Here `rows` and `cols` are vectors as described above.
For example, `df[rows, ]` and `df[, cols]` select just rows or just columns, using the empty subset to preserve the other dimension.
[^base-r-1]: Read <https://adv-r.hadley.nz/subsetting.html#subset-multiple> to see how you can also subset a data frame like it is a 1d object and how you can subset it with a matrix.
Here are a couple of examples:
```{r}
df <- tibble(
x = 1:3,
y = c("a", "e", "f"),
z = runif(3)
)
# Select first row and second column
df[1, 2]
# Select all rows and columns x and y
df[, c("x" , "y")]
# Select rows where `x` is greater than 1 and all columns
df[df$x > 1, ]
```
We'll come back to `$` shortly, but you should be able to guess what `df$x` does from the context: it extracts the `x` variable from `df`.
We need to use it here because `[` doesn't use tidy evaluation, so you need to be explicit about the source of the `x` variable.
There's an important difference between tibbles and data frames when it comes to `[`.
In most places, you can use "tibble" and "data frame" interchangeably, so when we want to draw particular attention to R's built-in data frame, we'll write `data.frame`.
If `df` is a `data.frame`, then `df[, cols]` will return a vector if `col` selects a single column and a data frame if it selects more than one column.
In this section, we'll show you how to use `[[` and `$` to pull columns out of data frames, discuss a couple more differences between `data.frames` and tibbles, and emphasize some important differences between `[` and `[[` when used with lists.
There are a couple of important differences between tibbles and base `data.frame`s when it comes to `$`.
Data frames match the prefix of any variable names (so-called **partial matching**) and don't complain if a column doesn't exist:
```{r}
df <- data.frame(x1 = 1)
df$x
df$z
```
Tibbles are more strict: they only ever match variable names exactly and they will generate a warning if the column you are trying to access doesn't exist:
```{r}
tb <- tibble(x1 = 1)
tb$x
tb$z
```
For this reason we sometimes joke that tibbles are lazy and surly: they do less and complain more.
In this context apply and map are synonyms because another way of saying "map a function over each element of a vector" is "apply a function over each element of a vector".
Here we'll give you a quick overview of this family so you can recognize them in the wild.
The most important member of this family is `lapply()`, which is very similar to `purrr::map()`[^base-r-3].
In fact, because we haven't used any of `map()`'s more advanced features, you can replace every `map()` call in @sec-iteration with `lapply()`.
[^base-r-3]: It just lacks convenient features like progress bars and reporting which element caused the problem if there's an error.
There's no exact base R equivalent to `across()` but you can get close by using `[` with `lapply()`.
This works because under the hood, data frames are lists of columns, so calling `lapply()` on a data frame applies the function to each column.
```{r}
df <- tibble(a = 1, b = 2, c = "a", d = "b", e = 4)
# First find numeric columns
num_cols <- sapply(df, is.numeric)
num_cols
# Then transform each column with lapply() then replace the original values
df[, num_cols] <- lapply(df[, num_cols, drop = FALSE], \(x) x * 2)
df
```
The code above uses a new function, `sapply()`.
It's similar to `lapply()` but it always tries to simplify the result, hence the `s` in its name, here producing a logical vector instead of a list.
We don't recommend using it for programming, because the simplification can fail and give you an unexpected type, but it's usually fine for interactive use.
purrr has a similar function called `map_vec()` that we didn't mention in @sec-iteration.
Base R provides a stricter version of `sapply()` called `vapply()`, short for **v**ector apply.
It takes an additional argument that specifies the expected type, ensuring that simplification occurs the same way regardless of the input.
For example, we could replace the `sapply()` call above with this `vapply()` where we specify that we expect `is.numeric()` to return a logical vector of length 1:
```{r}
vapply(df, is.numeric, logical(1))
```
The distinction between `sapply()` and `vapply()` is really important when they're inside a function (because it makes a big difference to the function's robustness to unusual inputs), but it doesn't usually matter in data analysis.
Another important member of the apply family is `tapply()` which computes a single grouped summary:
Unfortunately `tapply()` returns its results in a named vector which requires some gymnastics if you want to collect multiple summaries and grouping variables into a data frame (it's certainly possible to not do this and just work with free floating vectors, but in our experience that just delays the work).
If you want to see how you might use `tapply()` or other base techniques to perform other grouped summaries, Hadley has collected a few techniques [in a gist](https://gist.github.com/hadley/c430501804349d382ce90754936ab8ec).
The final member of the apply family is the titular `apply()`, which works with matrices and arrays.
The most straightforward use of `for` loops is to achieve the same affect as `walk()`: call some function with a side-effect on each element of a list.
Things get a little trickier if you want to save the output of the `for` loop, for example reading all of the excel files in a directory like we did in @sec-iteration:
There are a few different techniques that you can use, but we recommend being explicit about what the output is going to look like upfront.
In this case, we're going to want a list the same length as `paths`, which we can create with `vector()`:
```{r}
files <- vector("list", length(paths))
```
Then instead of iterating over the elements of `paths`, we'll iterate over their indices, using `seq_along()` to generate one index for each element of paths:
Many R users who don't otherwise use the tidyverse prefer ggplot2 for plotting due to helpful features like sensible defaults, automatic legends, and a modern look.
However, base R plotting functions can still be useful because they're so concise --- it takes very little typing to do a basic exploratory plot.
In this chapter, we've shown you a selection of base R functions useful for subsetting and iteration.
Compared to approaches discussed elsewhere in the book, these functions tend to have more of a "vector" flavor than a "data frame" flavor because base R functions tend to take individual vectors, rather than a data frame and some column specification.