New chapter on important base R functions (#1113)

This commit is contained in:
Hadley Wickham 2022-11-04 10:29:04 -05:00 committed by GitHub
parent 07aaa45d01
commit a586ec7ea8
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
7 changed files with 543 additions and 810 deletions

View File

@ -74,9 +74,8 @@ book:
- part: program.qmd
chapters:
- functions.qmd
- vectors.qmd
- tibble.qmd
- iteration.qmd
- base-R.qmd
- part: communicate.qmd
chapters:

537
base-R.qmd Normal file
View File

@ -0,0 +1,537 @@
# A field guide to base R
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
status("polishing")
```
To finish off the programming section, we're going to give you a quick tour of the most important base R functions that we don't otherwise discuss in the book.
These tools are particularly useful as you do more programming and will help you read code that you'll encounter in the wild.
This is a good place to remind you that the tidyverse is not only way to solve data science problems.
We teach the tidyverse in this book because tidyverse packages share a common design philosophy, which increases the consistency across functions, making each new function or package a little easier to learn and use.
It's not possible to use the tidyverse without using base R, so we've actually already taught you a **lot** of base R functions: from `library()` to load packages, to `sum()` and `mean()` for numeric summaries, to the factor, date, and POSIXct data types, and of course all the basic operators like `+`, `-`, `/`, `*`, `|`, `&`, and `!`.
What we haven't focused on so far is base R workflows, so we will highlight a few of those in this chapter.
After you read this book you'll learn other approaches to the same problems using base R, data.table, and other packages.
You'll certainly encounter these other approaches when you start reading R code written by other people, particularly if you're using StackOverflow.
It's 100% okay to write code that uses a mix of approaches, and don't let anyone tell you otherwise!
In this chapter, we'll focus on four big topics: subsetting with `[`, subsetting with `[[` and `$`, the apply family of functions, and for loops.
To finish off, we'll briefly discuss two important plotting functions.
### Prerequisites
```{r}
#| label: setup
#| message: false
library(tidyverse)
```
## Selecting multiple elements with `[`
`[` is used to extract sub-components from vectors and data frames, and is called like `x[i]` or `x[i, j]`.
In this section, we'll introduce you to the power of `[`, first showing you how you can use it with vectors, then how the same principles extend in a straightforward way to two-dimensional (2d) structures like data frames.
We'll then help you cement that knowledge by showing how various dplyr verbs are special cases of `[`.
### Subsetting vectors
There are five main types of things that you can subset a vector with, i.e. that can be the `i` in `x[i]`:
1. **An vector of positive integers**.
Subsetting with positive integers keeps the elements at those positions:
```{r}
x <- c("one", "two", "three", "four", "five")
x[c(3, 2, 5)]
```
By repeating a position, you can actually make a longer output than input, making the term "subsetting" a bit of a misnomer.
```{r}
x[c(1, 1, 5, 5, 5, 2)]
```
2. **A vector of negative integers**.
Negative values drop the elements at the specified positions:
```{r}
x[c(-1, -3, -5)]
```
3. **A logical vector**.
Subsetting with a logical vector keeps all values corresponding to a `TRUE` value.
This is most often useful in conjunction with the comparison functions.
```{r}
x <- c(10, 3, NA, 5, 8, 1, NA)
# All non-missing values of x
x[!is.na(x)]
# All even (or missing!) values of x
x[x %% 2 == 0]
```
Note that, unlike `filter()`, `NA` indices will be included in the output as `NA`s.
4. **A character vector**.
If you have a named vector, you can subset it with a character vector:
```{r}
x <- c(abc = 1, def = 2, xyz = 5)
x[c("xyz", "def")]
```
As with subsetting with positive integers, you can use a character vector to duplicate individual entries.
5. **Nothing**.
The final type of subsetting is nothing, `x[]`, which returns the complete `x`.
This is not useful for subsetting vectors, but as well see shortly it is useful when subsetting 2d structures like tibbles.
### Subsetting data frames
There are quite a few different ways[^base-r-1] that you can use `[` with a data frame, but the most important way is to selecting rows and columns independently with `df[rows, cols]`. Here `rows` and `cols` are vectors as described above.
For example, `df[rows, ]` and `df[, cols]` select just rows or just columns, using the empty subset to preserve the other dimension.
[^base-r-1]: Read <https://adv-r.hadley.nz/subsetting.html#subset-multiple> to see how you can also subset a data frame like it is a 1d object and how you can subset it with a matrix.
Here are a couple of examples:
```{r}
df <- tibble(
x = 1:3,
y = c("a", "e", "f"),
z = runif(3)
)
# Select first row and second column
df[1, 2]
# Select all rows and columns x and y
df[, c("x" , "y")]
# Select rows where `x` is greater than 1 and all columns
df[df$x > 1, ]
```
We'll come back to `$` shortly, but you should be able to guess what `df$x` does from the context: it extracts the `x` variable from `df`.
We need to use it here because `[` doesn't use tidy evaluation, so you need to be explicit about the source of the `x` variable.
There's an important difference between tibbles and data frames when it comes to `[`.
In this book we've mostly used tibbles, which *are* data frames, but they tweak some older behaviors to make your life a little easier.
In most places, you can use tibbles and data frame interchangeably, so went we want to draw particular attention to R's built-in data frame, we'll write `data.frame`s.
So if `df` is a `data.frame`, then `df[, cols]` will return a vector if `col` selects a single column and a data frame if it selects more than one column.
If `df` is a tibble, then `[` will always return a tibble.
```{r}
df1 <- data.frame(x = 1:3)
df1[, "x"]
df2 <- tibble(x = 1:3)
df2[, "x"]
```
One way to avoid this ambiguity with `data.frame`s is to explicitly specify `drop = FALSE`:
```{r}
df1["x", , drop = FALSE]
```
### dplyr equivalents
A number of dplyr verbs are special cases of `[`:
- `filter()` is equivalent to subsetting the rows with a logical vector, taking care to exclude missing values:
```{r}
#| results: false
df <- tibble(
x = c(2, 3, 1, 1, NA),
y = letters[1:5],
z = runif(5)
)
df |> filter(x > 1)
# same as
df[!is.na(df$x) & df$x > 1, ]
```
Another common technique in the wild is to use `which()` for its side-effect of dropping missing values: `df[which(df$x > 1), ]`.
- `arrange()` is equivalent to subsetting the rows with an integer vector, usually created with `order()`:
```{r}
#| results: false
df |> arrange(x, y)
# same as
df[order(df$x, df$y), ]
```
You can use `order(decreasing = TRUE)` to sort all columns in descending order or `-rank(col)` to individual sort columns in decreasing order.
- Both `select()` and `relocate()` are similar to subsetting the columns with a character vector:
```{r}
#| results: false
df |> select(x, z)
# same as
df[, c("x", "z")]
```
Base R also provides a function that combines the features of `filter()` and `select()`[^base-r-2] called `subset()`:
[^base-r-2]: But it doesn't handle grouped data frames differently and it doesn't support selection helper functions like `starts_with()`.
```{r}
df |>
filter(x > 1) |>
select(y, z)
# same as
df |> subset(x > 1, c(y, z))
```
This function was the inspiration for much of dplyr's syntax.
### Exercises
1. Create functions that take a vector as input and return:
a. The elements at even numbered positions.
b. Every element except the last value.
c. Only even values (and no missing values).
2. Why is `x[-which(x > 0)]` not the same as `x[x <= 0]`?
Read the documentation for `which()` and do some experiments to figure it out.
## Selecting a single element `$` and `[[`
`[`, which selects many elements, is paired with `[[` and `$`, which extract a single element.
In this section, we'll show you how to use `[[` and `$` to pull columns out of a data frames, discuss a couple more differences between `data.frames` and tibbles, and emphasize some important differences between `[` and `[[` when used with lists.
### Data frames
`[[` and `$` can be used like `pull()` to extract columns out of a data frame.
`[[` can access by position or by name, and `$` is specialized for access by name:
```{r}
tb <- tibble(
x = 1:4,
y = c(10, 4, 1, 21)
)
# by position
tb[[1]]
# by name
tb[["x"]]
tb$x
```
They can also be used to create new columns, the base R equivalent of `mutate()`:
```{r}
tb$z <- tb$x + tb$y
tb
```
There are a number other base approaches to creating new columns including with `transform()`, `with()`, and `within()`.
Hadley collected a few examples at <https://gist.github.com/hadley/1986a273e384fb2d4d752c18ed71bedf>.
Using `$` directly is convenient when performing quick summaries.
For example, if you just want find the size of the biggest diamond or the possible values of `cut`, there's no need to use `summarise()`:
```{r}
max(diamonds$carat)
levels(diamonds$cut)
```
### Tibbles
There are a couple of important differences between tibbles and base `data.frame`s when it comes to `$`.
Data frames match the prefix of any variable names (so-called **partial matching**) and don't complain if a column doesn't exist:
```{r}
df <- data.frame(x1 = 1)
df$x
df$z
```
Tibbles are more strict: they only ever match variable names exactly and they will generate a warning if the column you are trying to access doesn't exist:
```{r}
tb <- tibble(x1 = 1)
tb$x
tb$z
```
For this reason we sometimes joke that tibbles are lazy and surly: they do less and complain more.
### Lists
`[[` and `$` are also really important for working with lists, and it's important to understand how they differ to `[`.
Lets illustrate the differences with a list named `l`:
```{r}
l <- list(
a = 1:3,
b = "a string",
c = pi,
d = list(-1, -5)
)
```
- `[` extracts a sub-list.
It doesn't matter how many elements you extract, the result will always be a list.
```{r}
str(l[1:2])
str(l[4])
```
Like with vectors, you can subset with a logical, integer, or character vector.
- `[[` and `$` extract a single component from a list.
They remove a level of hierarchy from the list.
```{r}
str(l[[1]])
str(l[[4]])
str(l$a)
```
The difference between `[` and `[[` is particularly important for lists because `[[` drills down into the list while `[` returns a new, smaller list.
To help you remember the difference, take a look at the an unusual pepper shaker shown in @fig-pepper-1.
If this pepper shaker is your list `pepper`, then, `pepper[1]` is a pepper shaker containing a single pepper packet, as in @fig-pepper-2.
If we suppose this pepper shaker is a list `pepper`, then, `pepper[1]` is a pepper shaker containing a single pepper packet, as in @fig-pepper-2.
`pepper[2]` would look the same, but would contain the second packet.
`pepper[1:2]` would be a pepper shaker containing two pepper packets.
`pepper[[1]]` would extract the pepper packet itself, as in @fig-pepper-3.
```{r}
#| label: fig-pepper-1
#| echo: false
#| out-width: "25%"
#| fig-cap: >
#| A pepper shaker that Hadley once found in his hotel room.
#| fig-alt: >
#| A photo of a glass pepper shaker. Instead of the pepper shaker
#| containing pepper, it contains many packets of pepper.
knitr::include_graphics("images/pepper.jpg")
```
```{r}
#| label: fig-pepper-2
#| echo: false
#| out-width: "25%"
#| fig-cap: >
#| `pepper[1]`
#| fig-alt: >
#| A photo of the glass pepper shaker containing just one packet of
#| pepper.
knitr::include_graphics("images/pepper-1.jpg")
```
```{r}
#| label: fig-pepper-3
#| echo: false
#| out-width: "25%"
#| fig-cap: >
#| `pepper[[1]]`
#| fig-alt: A photo of single packet of pepper.
knitr::include_graphics("images/pepper-2.jpg")
```
This same principle applies when you use 1d `[` with a data frame:
```{r}
df <- tibble(x = 1:3, y = 3:5)
# returns a one-column data frame
df["x"]
# returns the contents of x
df[["x"]]
```
### Exercises
1. What happens when you use `[[` with a positive integer that's bigger than the length of the vector?
What happens when you subset with a name that doesn't exist?
2. What would `pepper[[1]][1]` be?
What about `pepper[[1]][[1]]`?
## Apply family
In @sec-iteration, you learned tidyverse techniques for iteration like `dplyr::across()` and the map family of functions.
In this section, you'll learn about their base equivalents, the **apply family**.
In this context apply and maps are synonyms because another way of saying "map a function over each element of a vector" is "apply a function over each element of a vector".
Here we'll give you a quick overview of this family so you can recognize them in the wild.
The most important member of this family is `lapply()`, which is very similar to `purrr::map()`[^base-r-3].
In fact, because we haven't used any of `map()`'s more advanced features, you can replace every `map()` call in @sec-iteration with `lapply()`.
[^base-r-3]: It just lacks convenient features like progress bars and reporting which element caused the problem if there's an error.
There's no exact base R equivalent to `across()` but you can get close by using `[` with `lapply()`.
This works because under the hood, data frames are lists of columns, so calling `lapply()` on a data frame applies the function to each column.
```{r}
df <- tibble(a = 1, b = 2, c = "a", d = "b", e = 4)
# First find numeric columns
num_cols <- sapply(df, is.numeric)
num_cols
# Then transform each column with lapply() then replace the original values
df[, num_cols] <- lapply(df[, num_cols, drop = FALSE], \(x) x * 2)
df
```
The code above uses a new function, `sapply()`.
It's similar to `lapply()` but it always tries to simplify the result, hence the `s` in its name, here producing a logical vector instead of a list.
We don't recommend using it for programming, because the simplification can fail and give you an unexpected type, but it's usually fine for interactive use.
purrr has a similar function called `map_vec()` that we didn't mention in @sec-iteration.
Base R provides a stricter version of `sapply()` called `vapply()`, short for **v**ector apply.
It takes an additional argument that specifies the expected type, ensuring that simplification occurs the same way regardless of the input.
For example, we could replace the `sapply()` call above with this `vapply()` where we specify that we expect `is.numeric()` to return a logical vector of length 1:
```{r}
vapply(df, is.numeric, logical(1))
```
The distinction between `sapply()` and `vapply()` is really important when they're inside a function (because it makes a big difference to the function's robustness to unusual inputs), but it doesn't usually matter in data analysis.
Another important member of the apply family is `tapply()` which computes a single grouped summary:
```{r}
diamonds |>
group_by(cut) |>
summarise(price = mean(price))
tapply(diamonds$price, diamonds$cut, mean)
```
Unfortunately `tapply()` returns its results in a named vector which requires some gymnastics if you want to collect multiple summaries and grouping variables into a data frame (it's certainly possible to not do this and just work with free floating vectors, but in our experience that just delays the work).
If you want to see how you might use `tapply()` or other base techniques to perform other grouped summaries, Hadley has collected a few techniques [in a gist](https://gist.github.com/hadley/c430501804349d382ce90754936ab8ec).
The final member of the apply family is the titular `apply()`, which works with matrices and arrays.
In particular, watch out of `apply(df, 2, something)` which is a slow and potentially dangerous way of doing `lapply(df, something)`.
This rarely comes up in data science because we usually work with data frames and not matrices.
## For loops
For loops are the fundamental building block of iteration that both the apply and map families use under the hood.
For loops are powerful and general tool that are important to learn as you become a more experienced R programmer.
The basic structure of a for loop looks like this:
```{r}
#| eval: false
for (element in vector) {
# do something with element
}
```
The most straightforward use of `for()` loops is achieve the same affect as `walk()`: call some function with a side-effect on each element of a list.
For example, in @sec-save-database instead of using walk:
```{r}
#| eval: false
paths |> walk(append_file)
```
We could have used a for loop:
```{r}
#| eval: false
for (path in paths) {
append_file(path)
}
```
Things get a little trickier if you want to save the output of the for-loop, for example reading all of the excel files in a directory like we did in @sec-iteration:
```{r}
paths <- dir("data/gapminder", pattern = "\\.xlsx$", full.names = TRUE)
files <- map(paths, readxl::read_excel)
```
There are a few different techniques that you can use, but we recommend being explicit about what the output is going to look like upfront.
In this case, we're going to want a list the same length as `paths`, which we can create with `vector()`:
```{r}
files <- vector("list", length(paths))
```
Then instead of iterating over the elements of `paths`, we'll iterate over their indices, using `seq_along()` to generate one index for each element of paths:
```{r}
seq_along(paths)
```
Using the indices is important because it allows us to link to each each position in the input with the corresponding position in the output:
```{r}
for (i in seq_along(paths)) {
files[[i]] <- readxl::read_excel(paths[[i]])
}
```
To combine the list of tibbles into a single tibble you can use `do.call()` + `rbind()`:
```{r}
do.call(rbind, files)
```
Rather than making a list and saving the results as we go, a simpler approach is to build up the data frame piece-by-piece:
```{r}
out <- NULL
for (path in paths) {
out <- rbind(out, readxl::read_excel(path))
}
```
We recommend avoiding this pattern because it can become very slow when the vector is very long.
This the source of the persistent canard that `for` loops are slow: they're not, but iteratively growing a vector is.
## Plots
Many R users who don't otherwise use the tidyverse prefer ggplot2 for plotting due to helpful features like sensible defaults, automatic legends, modern look.
However, base R plotting functions can still be useful because they're so concise --- it's very little typing to do a basic exploratory plot.
There are two main types of base plot you'll see in the wild: scatterplots and histograms, produced with `plot()` and `hist()` respectively.
Here's a quick example from the diamonds dataset:
```{r}
#| dev: png
hist(diamonds$carat)
plot(diamonds$carat, diamonds$price)
```
Note that base plotting functions work with vectors, so you need to pull columns out of the data frame using `$` or some other technique.
## Summary
In this chapter, we've shown you selection of base R functions useful for subsetting and iteration.
Compared to approaches discussed elsewhere in the book, these functions tend have more of a "vector" flavor than a "data frame" flavor because base R functions tend to take individual vectors, rather than a data frame and some column specification.
This often makes life easier for programming and so becomes more important as you write more functions and begin to write your own packages.
This chapter concludes the programming section of the book.
You've made a solid start on your journey to becoming not just a data scientist who uses R, but a data scientist who can *program* in R.
We hope these chapters have sparked your interested in programming and that you're are looking forward to learning more outside of this book.

View File

@ -47,8 +47,7 @@ flights
If you've used R before, you might notice that this data frame prints a little differently to other data frames you've seen.
That's because it's a **tibble**, a special type of data frame used by the tidyverse to avoid some common gotchas.
The most important difference is the way it prints: tibbles are designed for large datasets, so they only show the first few rows and only the columns that fit on one screen.
To see everything, use `View(flights)` to open the dataset in the RStudio viewer.
We'll come back to other important differences in @sec-tibbles.
To see everything you can use `print(flights, width = Inf)` to show everything in the console, but it's generally more convenient to instead use `View(flights)` to open the dataset in the scrollable RStudio viewer.
You might have noticed the short abbreviations that follow each column name.
These tell you the type of each variable: `<int>` is short for integer, `<dbl>` is short for double (aka real numbers), `<chr>` for character (aka strings), and `<dttm>` for date-time.

View File

@ -255,14 +255,13 @@ This makes the plot easier to read because the colors of the line at the far rig
#| layout-ncol: 2
#| fig-width: 4
#| fig-height: 2
#| fig-alt:
#| - >
#| fig-alt: >
#| A line plot with age on the x-axis and proportion on the y-axis.
#| There is one line for each category of marital status: no answer,
#| never married, separated, divorced, widowed, and married. It is
#| a little hard to read the plot because the order of the legend is
#| unrelated to the lines on the plot.
#| - >
#|
#| Rearranging the legend makes the plot easier to read because the
#| legend colours now match the order of the lines on the far right
#| of the plot. You can see some unsuprising patterns: the proportion

View File

@ -19,7 +19,6 @@ You've already learned a number of special purpose tools for iteration:
Now it's time to learn some more general tools.
Tools for iteration can quickly become very abstract, but in this chapter we'll keep things concrete to make as easy as possible to learn the basics.
We're going to focus on three related tools for three related tasks: modifying multiple columns, reading multiple files, and saving multiple objects.
We'll conclude with a brief discussion of `for`-loops, an important iteration technique that we deliberately don't cover here, and provide a few pointers for learning more.
### Prerequisites
@ -938,24 +937,11 @@ unlink(paths)
1. Imagine you have a table of student data containing (amongst other variables) `school_name` and `student_id`. Sketch out what code you'd write if you want to save all the information for each student in file called `{student_id}.csv` in the `{school}` directory.
## For loops
Before we finish up this chapter, we have a duty to mention another important technique for iteration in R, the `for` loop.
`for` loops are powerful and general tool that you definitely need to learn as you become a more experienced R programmer.
But we skip them here because, as you've seen, you can solve a whole bunch of useful problems just by learning `across()`, `map()`, and `walk2()`.
If you'd like to learn more about for loops, <https://adv-r.hadley.nz/control-flow.html#loops> is one place to start.
Some people will tell you to avoid `for` loops because they are slow.
They're wrong!
(Well at least they're rather out of date, as `for` loops haven't been slow for many years.) The chief benefit of using functions like `map()` is not speed, but clarity: once you've mastered the basic idea, they make your code easier to write and to read.
## Summary
In this chapter you learn iteration tools to solve three problems that come up frequently when doing data science: manipulating multiple columns, reading multiple files, and saving multiple outputs.
But in general, iteration is a super power: if you know the right iteration technique, you can easily go from fixing one problems to fixing any number of problems.
Once you've mastered the techniques in this chapter, we highly recommend learning more by reading <https://purrr.tidyverse.org> and the [Functionals chapter](https://adv-r.hadley.nz/functionals.html) of *Advanced R*.
This chapter concludes the programming section of the book.
You've now learned the basics of programming in R.
You know now the data types that underpin all of the objects you work with, and have two powerful techniques (functions and iteration) for reducing the duplication in your code.
We hope you've got a taste for how programming can help your analyses, and you've made a solid start on your journey to become not just a data scientist who uses R, but a data science who can program in R.
If you know much about iteration in other languages you might be surprised that we didn't discuss the `for` loop.
That comes up in the next chapter where we'll discuss some important base R functions that we don't otherwise use in the book but are important to know about.

View File

@ -1,174 +0,0 @@
# Tibbles {#sec-tibbles}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
status("complete")
```
## Introduction
Throughout this book we work with "tibbles" instead of R's traditional `data.frame`.
Tibbles *are* data frames, but they tweak some older behaviors to make your life a little easier.
R is an old language, and some things that were useful 10 or 20 years ago now get in your way.
It's difficult to change base R without breaking existing code, so most innovation occurs in packages.
Here we will describe the **tibble** package, which provides opinionated data frames that make working in the tidyverse a little easier.
In most places, we use the term tibble and data frame interchangeably; when we want to draw particular attention to R's built-in data frame, we'll call them `data.frame`s.
### Prerequisites
In this chapter we'll explore the **tibble** package, part of the core tidyverse.
```{r}
#| label: setup
#| message: false
library(tidyverse)
```
## Tibbles vs. data.frame
There are two main differences in the usage of a tibble vs. a classic `data.frame`: printing and subsetting.
If these difference cause problems when working with older packages, you can turn a tibble back to a regular data frame with `as.data.frame()`.
### Printing
Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen.
This makes it much easier to work with large data.
In addition to its name, each column reports its type, a nice feature inspired by `str()`:
```{r}
tibble(
a = lubridate::now() + runif(1e3) * 86400,
b = lubridate::today() + runif(1e3) * 30,
c = 1:1e3,
d = runif(1e3),
e = sample(letters, 1e3, replace = TRUE)
)
```
Where possible, tibbles also use color to draw your eye to important differences.
One of the most important distinctions is between the string `"NA"` and the missing value, `NA`:
```{r}
tibble(x = c("NA", NA))
```
Tibbles are designed to avoid overwhelming your console when you print large data frames.
But sometimes you need more output than the default display.
There are a few options that can help.
First, you can explicitly `print()` the data frame and control the number of rows (`n`) and the `width` of the display.
`width = Inf` will display all columns:
```{r}
library(nycflights13)
flights |>
print(n = 10, width = Inf)
```
You can also control the default print behavior by setting options:
- `options(tibble.print_max = n, tibble.print_min = m)`: if more than `n` rows, print only `m` rows.
Use `options(tibble.print_min = Inf)` to always show all rows.
- Use `options(tibble.width = Inf)` to always print all columns, regardless of the width of the screen.
You can see a complete list of options by looking at the package help with `package?tibble`.
A final option is to use RStudio's built-in data viewer to get a scrollable view of the complete dataset.
This is also often useful at the end of a long chain of manipulations.
```{r}
#| eval: false
flights |> View()
```
### Extracting variables
So far all the tools you've learned have worked with complete data frames.
If you want to pull out a single variable, you can use `dplyr::pull()`:
```{r}
tb <- tibble(
id = LETTERS[1:5],
x1 = 1:5,
y1 = 6:10
)
tb |> pull(x1) # by name
tb |> pull(1) # by position
```
`pull()` also takes an optional `name` argument that specifies the column to be used as names for a named vector, which you'll learn about in @sec-vectors.
```{r}
tb |> pull(x1, name = id)
```
You can also use the base R tools `$` and `[[`.
`[[` can extract by name or position; `$` only extracts by name but is a little less typing.
```{r}
# Extract by name
tb$x1
tb[["x1"]]
# Extract by position
tb[[1]]
```
Compared to a `data.frame`, tibbles are more strict: they never do partial matching, and they will generate a warning if the column you are trying to access does not exist.
```{r}
# Tibbles complain a lot:
tb$x
tb$z
# Data frame use partial matching and don't complain if a column doesn't exist
df <- as.data.frame(tb)
df$x
df$z
```
For this reason we sometimes joke that tibbles are lazy and surly: they do less and complain more.
### Subsetting
Lastly, there are some important differences when using `[`.
With `data.frame`s, `[` sometimes returns a `data.frame`, and sometimes returns a vector, which is a common source of bugs.
With tibbles, `[` always returns another tibble.
This can sometimes cause problems when working with older code.
If you hit one of those functions, just use `as.data.frame()` to turn your tibble back to a `data.frame`.
### Exercises
1. How can you tell if an object is a tibble?
(Hint: try printing `mtcars`, which is a regular `data.frame`).
2. Compare and contrast the following operations on a `data.frame` and equivalent tibble.
What is different?
Why might the default `data.frame` behaviors cause you frustration?
```{r}
#| eval: false
df <- data.frame(abc = 1, xyz = "a")
df$x
df[, "xyz"]
df[, c("abc", "xyz")]
```
3. If you have the name of a variable stored in an object, e.g. `var <- "mpg"`, how can you extract the reference variable from a tibble?
4. What does `tibble::enframe()` do?
When might you use it?
5. What option controls how many additional column names are printed at the footer of a tibble?
## Summary
If this chapter leaves you wanting to learn more about tibbles, you might enjoy `vignette("tibble")`.

View File

@ -1,613 +0,0 @@
# Vectors {#sec-vectors}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
```
## Introduction
So far we've talked about individual data types individual like numbers, strings, factors, tibbles and more.
Now it's time to learn more about how they fit together into a holistic structure.
Relatively little immediate benefit but a necessary foundation for building your programming knowledge.
In this chapter we'll explore the **vector** data type, the type that underlies pretty much all objects that we use to store data in R.
### Prerequisites
The focus of this chapter is on base R data structures, so it isn't essential to load any packages.
We will, however, use a handful of functions from the **purrr** package to avoid some inconsistencies in base R.
```{r}
#| label: setup
#| message: false
library(tidyverse)
```
## Vectors
There are two fundamental types of vectors:
1. **Atomic** vectors, of which there are six types: **logical**, **integer**, **double**, **character**, **complex**, and **raw**.
Integer and double vectors are collectively known as **numeric** vectors.
Raw and complex are rarely used during data analysis, so we won't discuss them here.
2. **Lists**, which are sometimes called recursive vectors because lists can contain other lists.
The chief difference between atomic vectors and lists is that atomic vectors are **homogeneous** (every element is the same type), while lists can be **heterogeneous** (every element can be a different type).
@fig-datatypes summarizes the interrelationships.
```{r}
#| label: fig-datatypes
#| echo: false
#| out-width: ~
#| fig-cap: >
#| The hierarchy of R's vector types.
#| fig-alt: >
#| A diagram that uses nested sets to show how R's vector types
#| are related. There are two types at the top level: vectors and
#| NULL. Inside vectors there are two types: atomic and list.
#| Inside atomic there are three types: logical, numeric, and
#| character. Inside numeric there are two types: integer, and
#| double.
knitr::include_graphics("diagrams/data-structures.png", dpi = 270)
```
### Properties
Every vector has two key properties:
1. Its **type**, which is one of logical, integer, double, character, list etc.
You can determine this with `typeof()`.
```{r}
typeof(letters)
typeof(1:10)
typeof(2.5)
```
Sometimes you want to do different things based on the type of vector.
One option is to use `typeof()`.
Another is to use a test function which returns a `TRUE` or `FALSE`.
Base R provides many functions like `is.vector()` and `is.atomic()`, but they often return surprising results.
Instead, it's safer to use the `is_*` functions provided by purrr, which correspond exactly to @fig-datatypes.
2. Its **length**, which you can determine with `length()`.
```{r}
x <- list("a", "b", 1:10)
length(x)
```
Vectors can also contain arbitrary additional metadata in the form of attributes.
These attributes are used to create **S3 vectors** which build on additional behavior.
You've seen three S3 vectors in this book: factors, dates, and date-times.
We'll come back those in @sec-s3-vectors.
### Atomic vectors
While technically speaking there are six types of atomic vector, in principle we only worry about three: logical vectors, numeric vectors, and character vectors.
- Logical vectors were the subject of @sec-logicals. They're the simplest type of atomic vector because they can take only three possible values: `FALSE`, `TRUE`, and `NA`.
- Numeric vectors were the subject of @sec-numbers. Numeric vectors can either be integers or doubles. We lump them together in this book because there are few important differences when doing data analysis. The one important difference was discussed in @sec-fp-comparison: doubles are fundamentally approximations because they floating point numbers that can not always be precisely represented with a fixed amount of memory.
- Character vectors were the subject of @sec-strings. They're the most complex type of atomic vector, because each element of a character vector is a string, and a string can contain any amount of data.
### Lists {#sec-lists}
Lists are a step up in complexity from atomic vectors, because lists can contain other lists.
This makes them suitable for representing hierarchical or tree-like structures, as you saw in @sec-rectangling.
You create a list with `list()`.
Unlike atomic vectors, `list()` can contain a mix of objects:
```{r}
y <- list("a", 1L, 1.5, TRUE)
str(y)
```
Lists can even contain other lists!
```{r}
z <- list(list(1, 2), list(3, 4))
str(z)
```
### Missing values and `NULL`
Note that each type of atomic vector has its own missing value:
```{r}
NA # logical
NA_integer_ # integer
NA_real_ # double
NA_character_ # character
```
This is usually unimportant because `NA` will almost always be automatically converted to the correct type.
There's one other related object: `NULL`.
`NULL` is often used to represent the absence of a vector (as opposed to `NA` which is used to represent the absence of a value in a vector).
`NULL` typically behaves like a vector of length 0.
`NULL` is sort of the equivalent of a missing value inside a list.
### Names
All types of vectors can be named.
You can name them during creation with `c()` or `list()`:
```{r}
x <- c(x = 1, y = 2, z = 4)
x
```
It's important to notice this display, because it can be surprising at first.
`str()` is always a great tool to check the object is structured as you expect.
```{r}
str(x)
```
Or after the fact with `purrr::set_names()`:
```{r}
x <- list(1, 2, 3)
x |>
set_names(c("a", "b", "c")) |>
str()
```
You can also pass `set_names()` a function.
This is particularly useful if you have a character vector.
And we'll see an important use for it in @sec-data-in-the-path.
```{r}
x <- c("a", "b", "c")
x |> set_names(str_to_upper)
```
Named vectors are most useful for subsetting, described next.
### Coercion
There are two ways to convert, or coerce, one type of vector to another:
1. Explicit coercion happens when you call a function like `as.logical()`, `as.integer()`, `as.double()`, or `as.character()`.
Whenever you find yourself using explicit coercion, you should always check whether you can make the fix upstream, so that the vector never had the wrong type in the first place.
For example, you may need to tweak your readr `col_types` specification.
2. Implicit coercion happens when you use a vector in a specific context that expects a certain type of vector.
For example, when you use a logical vector with a numeric summary function, or when you use a double vector where an integer vector is expected.
Because explicit coercion is used relatively rarely, and is largely easy to understand, we'll focus on implicit coercion here.
Just beware using them on lists; if you need to get a list into a simple vector, put it inside a data frame and use the tools from @sec-rectangling.
```{r}
as.character(list(1, 2, 3))
as.character(list(1, list(2, list(3))))
```
You've already seen the most important type of implicit coercion: using a logical vector in a numeric context.
In this case `TRUE` is converted to `1` and `FALSE` converted to `0`.
That means the sum of a logical vector is the number of trues, and the mean of a logical vector is the proportion of trues:
```{r}
x <- sample(20, 100, replace = TRUE)
y <- x > 10
sum(y) # how many are greater than 10?
mean(y) # what proportion are greater than 10?
```
It's also important to understand what happens when you try and create a vector containing multiple types with `c()`: logical \< integer \< double \< character \< list.
Generally rather too flexible.
```{r}
typeof(c(TRUE, 1L))
typeof(c(1L, 1.5))
typeof(c(1.5, "a"))
```
### Exercises
1. Carefully read the documentation of `is.vector()`.
What does it actually test for?
Why does `is.atomic()` not agree with the definition of atomic vectors above?
2. Describe the difference between `is.finite(x)` and `!is.infinite(x)`.
3. A logical vector can take 3 possible values.
How many possible values can an integer vector take?
How many possible values can a double take?
Use Google to do some research.
4. Brainstorm at least four functions that allow you to convert a double to an integer.
How do they differ?
Be precise.
5. What functions from the readr package allow you to turn a string into logical, integer, and double vector?
6. Compare and contrast `setNames()` with `purrr::set_names()`.
7. Draw the following lists as nested sets:
a. `list(a, b, list(c, d), list(e, f))`
b. `list(list(list(list(list(list(a))))))`
## Subsetting {#sec-vector-subsetting}
There are three subsetting tools in base R: `[`, `[[`, and `$`.
`[` selects a vector; `[[` selects a single value, and `$` selects a single number based on named.
We'll see how they apply to atomic vectors and lists.
And then how they combine to provide an alternative to `filter()` and `select()` for working with data frames.
To explain more complicated list manipulation functions, it's helpful to have a visual representation of lists and vectors.
For example, take these three lists:
```{r}
x1 <- list(c(1, 2), c(3, 4))
x2 <- list(list(1, 2), list(3, 4))
x3 <- list(1, list(2, list(3)))
```
We'll draw them as follows:
```{r}
#| echo: false
#| out-width: "75%"
knitr::include_graphics("diagrams/lists-structure.png")
```
There are three principles:
1. Lists have rounded corners.
Atomic vectors have square corners.
2. Children are drawn inside their parent, and have a slightly darker background to make it easier to see the hierarchy.
3. The orientation of the children (i.e. rows or columns) isn't important, so we'll pick a row or column orientation to either save space or illustrate an important property in the example.
To learn more about the applications of subsetting, reading the "Subsetting" chapter of *Advanced R*: <http://adv-r.had.co.nz/Subsetting.html#applications>.
### Atomic vectors
`[` is the subsetting function, and is called like `x[a]`.
There are four types of things that you can subset a vector with:
1. A numeric vector containing only integers.
The integers must either be all positive, all negative, or zero.
Subsetting with positive integers keeps the elements at those positions:
```{r}
x <- c("one", "two", "three", "four", "five")
x[c(3, 2, 5)]
```
By repeating a position, you can actually make a longer output than input.
(This makes subsetting a bit of a misnomer).
```{r}
x[c(1, 1, 5, 5, 5, 2)]
```
Negative values drop the elements at the specified positions:
```{r}
x[c(-1, -3, -5)]
```
It's an error to mix positive and negative values:
```{r}
#| error: true
x[c(1, -1)]
```
The error message mentions subsetting with zero, which returns no values:
```{r}
x[0]
```
This is not useful very often, but it can be helpful if you want to create unusual data structures to test your functions with.
2. Subsetting with a logical vector keeps all values corresponding to a `TRUE` value.
This is most often useful in conjunction with the comparison functions.
```{r}
x <- c(10, 3, NA, 5, 8, 1, NA)
# All non-missing values of x
x[!is.na(x)]
# All even (or missing!) values of x
x[x %% 2 == 0]
```
3. If you have a named vector, you can subset it with a character vector:
```{r}
x <- c(abc = 1, def = 2, xyz = 5)
x[c("xyz", "def")]
```
Like with positive integers, you can also use a character vector to duplicate individual entries.
4. The simplest type of subsetting is nothing, `x[]`, which returns the complete `x`.
This is not useful for subsetting vectors, but as well see shortly it is useful when subsetting 2d structures like tibbles.
There is an important variation of `[` called `[[`.
`[[` only ever extracts a single element, and always drops names.
It's a good idea to use it whenever you want to make it clear that you're extracting a single item, as in a for loop.
The distinction between `[` and `[[` is most important for lists, as we'll see shortly.
### Lists
There are three ways to subset a list, which we'll illustrate with a list named `a`:
```{r}
a <- list(a = 1:3, b = "a string", c = pi, d = list(-1, -5))
```
- `[` extracts a sub-list.
The result will always be a list.
```{r}
str(a[1:2])
str(a[4])
```
Like with vectors, you can subset with a logical, integer, or character vector.
- `[[` extracts a single component from a list.
It removes a level of hierarchy from the list.
```{r}
str(a[[1]])
str(a[[4]])
```
- `$` is a shorthand for extracting named elements of a list.
It works similarly to `[[` except that you don't need to use quotes.
```{r}
a$a
a[["a"]]
```
The distinction between `[` and `[[` is really important for lists, because `[[` drills down into the list while `[` returns a new, smaller list.
Compare the code and output above with the visual representation in @fig-lists-subsetting.
```{r}
#| label: fig-lists-subsetting
#| echo: false
#| out-width: "75%"
#| fig-cap: >
#| Subsetting a list, visually.
knitr::include_graphics("diagrams/lists-subsetting.png")
```
The difference between `[` and `[[` is very important, but it's easy to get confused.
To help you remember, let me show you an unusual pepper shaker in @fig-pepper-1.
If this pepper shaker is your list `pepper`, then, `pepper[1]` is a pepper shaker containing a single pepper packet, as in @fig-pepper-2.
`pepper[2]` would look the same, but would contain the second packet.
`pepper[1:2]` would be a pepper shaker containing two pepper packets.
`pepper[[1]]` would extract the pepper packet itself, as in @fig-pepper-3.
```{r}
#| label: fig-pepper-1
#| echo: false
#| out-width: "25%"
#| fig-cap: >
#| A pepper shaker that Hadley once found in his hotel room.
#| fig-alt: >
#| A photo of a glass pepper shaker. Instead of the pepper shaker
#| containing pepper, it contains many packets of pepper.
knitr::include_graphics("images/pepper.jpg")
```
```{r}
#| label: fig-pepper-2
#| echo: false
#| out-width: "25%"
#| fig-cap: >
#| `pepper[1]`
#| fig-alt: >
#| A photo of the glass pepper shaker containing just one packet of
#| pepper.
knitr::include_graphics("images/pepper-1.jpg")
```
```{r}
#| label: fig-pepper-3
#| echo: false
#| out-width: "25%"
#| fig-cap: >
#| `pepper[[1]]`
#| fig-alt: A single packet of pepper.
knitr::include_graphics("images/pepper-2.jpg")
```
### Data frames
1d subsetting behaves like a list.
2d behaves like a combination of subsetting rows and columns.
### Exercises
4. Create functions that take a vector as input and return:
a. The last value. Should you use `[` or `[[`?
b. The elements at even numbered positions.
c. Every element except the last value.
d. Only even numbers (and no missing values).
5. Why is `x[-which(x > 0)]` not the same as `x[x <= 0]`?
6. What happens when you subset with a positive integer that's bigger than the length of the vector?
What happens when you subset with a name that doesn't exist?
7. What happens if you subset a tibble as if you're subsetting a list?
What are the key differences between a list and a tibble?
## Attributes and S3 vectors {#sec-s3-vectors}
Any vector can contain arbitrary additional metadata through its **attributes**.
You can think of attributes as named list of vectors that can be attached to any object.
You can get and set individual attribute values with `attr()` or see them all at once with `attributes()`.
```{r}
x <- 1:10
attr(x, "greeting")
attr(x, "greeting") <- "Hi!"
attr(x, "farewell") <- "Bye!"
attributes(x)
```
There are three very important attributes that are used to implement fundamental parts of R:
1. **Names** are used to name the elements of a vector.
2. **Dimensions** (dims, for short) make a vector behave like a matrix or array.
3. **Class** is used to implement the S3 object oriented system.
You've seen names above, and we won't cover dimensions because we don't use matrices in this book.
- Factors (`factor`) are built on top of integer vectors.
- Dates (`date`) are built on top of double vectors.
- Date-times (`POSIXct`) are built on top of double vectors.
### Class
It remains to describe the class, which controls how **generic functions** work.
Generic functions are key to object oriented programming in R, because they make functions behave differently for different classes of input.
A detailed discussion of object oriented programming is beyond the scope of this book, but you can read more about it in *Advanced R* at <http://adv-r.had.co.nz/OO-essentials.html#s3>.
Here's what a typical generic function looks like:
```{r}
as.Date
```
The call to "UseMethod" means that this is a generic function, and it will call a specific **method**, a function, based on the class of the first argument.
(All methods are functions; not all functions are methods).
You can list all the methods for a generic with `methods()`:
```{r}
methods("as.Date")
```
For example, if `x` is a character vector, `as.Date()` will call `as.Date.character()`; if it's a factor, it'll call `as.Date.factor()`.
You can see the specific implementation of a method with `getS3method()`:
```{r}
getS3method("as.Date", "default")
getS3method("as.Date", "numeric")
```
The most important S3 generic is `print()`: it controls how the object is printed when you type its name at the console.
Other important generics are the subsetting functions `[`, `[[`, and `$`.
### Factors
Factors are designed to represent categorical data that can take a fixed set of possible values.
Factors are built on top of integers, and have a levels attribute:
```{r}
x <- factor(c("ab", "cd", "ab"), levels = c("ab", "cd", "ef"))
typeof(x)
attributes(x)
```
### Dates and date-times
Dates in R are numeric vectors that represent the number of days since 1 January 1970.
```{r}
x <- as.Date("1971-01-01")
unclass(x)
typeof(x)
attributes(x)
```
Date-times are numeric vectors with class `POSIXct` that represent the number of seconds since 1 January 1970.
(In case you were wondering, "POSIXct" stands for "Portable Operating System Interface", calendar time.)
```{r}
x <- lubridate::ymd_hm("1970-01-01 01:00")
unclass(x)
typeof(x)
attributes(x)
```
The `tzone` attribute is optional.
It controls how the time is printed, not what absolute time it refers to.
```{r}
attr(x, "tzone") <- "US/Pacific"
x
attr(x, "tzone") <- "US/Eastern"
x
```
There is another type of date-times called POSIXlt.
These are built on top of named lists:
```{r}
y <- as.POSIXlt(x)
typeof(y)
attributes(y)
```
POSIXlts are rare inside the tidyverse.
They do crop up in base R, because they are needed to extract specific components of a date, like the year or month.
Since lubridate provides helpers for you to do this instead, you don't need them.
POSIXct's are always easier to work with, so if you find you have a POSIXlt, you should always convert it to a regular date time with `lubridate::as_datetime()`.
## Other types
### Tibbles
Tibbles are augmented lists: they have class "tbl_df" + "tbl" + "data.frame", and `names` (column) and `row.names` attributes:
```{r}
tb <- tibble::tibble(x = 1:5, y = 5:1)
typeof(tb)
attributes(tb)
```
The difference between a tibble and a list is that all the elements of a data frame must be vectors with the same length.
All functions that work with tibbles enforce this constraint.
Traditional `data.frame`s have a very similar structure:
```{r}
df <- data.frame(x = 1:5, y = 5:1)
typeof(df)
attributes(df)
```
The main difference is the class.
The class of tibble includes "data.frame" which means tibbles inherit the regular data frame behaviour by default.
### Exercises
1. What does `hms::hms(3600)` return?
How does it print?
What primitive type is the augmented vector built on top of?
What attributes does it use?
2. Try and make a tibble that has columns with different lengths.
What happens?