Proofing vectors

This commit is contained in:
hadley 2016-08-18 16:12:42 -05:00
parent eecf629110
commit d50aa09b80
2 changed files with 115 additions and 104 deletions

Binary file not shown.

Before

Width:  |  Height:  |  Size: 7.1 KiB

After

Width:  |  Height:  |  Size: 75 KiB

View File

@ -1,37 +1,51 @@
```{r include=FALSE, cache=FALSE}
set.seed(1014)
options(digits = 3)
knitr::opts_chunk$set(
comment = "#>",
collapse = TRUE,
cache = TRUE,
out.width = "70%",
fig.align = 'center',
fig.width = 6,
fig.asp = 0.618, # 1 / phi
fig.show = "hold"
)
options(dplyr.print_min = 6, dplyr.print_max = 6)
```
# Vectors
## Introduction
So far this book has focussed on data frames and packages that work with them. But as you start to write your own functions, and dig deeper into R, you need to learn about vectors, the objects that underlie data frames. If you've learned R in a more traditional way, you're probably already familiar with vectors, as most R resources start with vectors and work their way up to data frames. I think it's better to start with data frames because they're immediately useful, and then work your way down to the underlying components.
So far this book has focussed on tibbles and packages that work with them. But as you start to write your own functions, and dig deeper into R, you need to learn about vectors, the objects that underlie tibbles. If you've learned R in a more traditional way, you're probably already familiar with vectors, as most R resources start with vectors and work their way up to tibbles. I think it's better to start with tibbles because they're immediately useful, and then work your way down to the underlying components.
Vectors are particularly important as most of the functions you will write will work with vectors. It is possible to write functions that work with data frames (like ggplot2, dplyr, tidyr, etc), but the underlying technology is more complex and less consistent. I am working on a system to make to it easier, but it will not be ready in time for the publication of the book. This system will still require you understand vectors, but will help provide a user-friendly layer on top.
Vectors are particularly important as most of the functions you will write will work with vectors. It is possible to write functions that work with tibbles (like ggplot2, dplyr, and tidyr), but the tools you need write such functions are currently idiosyncratic and immature. I am working on a better approach, <https://github.com/hadley/lazyeval>, but it will not be ready in time for the publication of the book. Even when complete, you'll still need you understand vectors, it'll just make it easier to write a user-friendly layer on top.
### Prerequisites
The focus of this chapter is on base R data structures, so you it isn't essential to load any packages. However, the __purrr__ package, which you'll learn more about in [iteration], provides some useful tools to help us see what's going on.
The focus of this chapter is on base R data structures, so it isn't essential to load any packages. We will, however, use a handful of functions from the __purrr__ package to avoid some inconsistences in base R.
```{r}
library(purrr)
```
## Vector overview
## Vector basics
There are two types of vectors:
1. __Atomic__ vectors, which are further broken down into six types:
1. __Atomic__ vectors, of which there are six types:
__logical__, __integer__, __double__, __character__, __complex__, and
__raw__. Integer and double vectors are collectively known as
__numeric__ vectors.
1. __Lists__, sometimes called recursive vectors, because lists can
contain other lists. This is the chief difference between atomic vectors
and lists: atomic vectors are homogeneous, lists can be heterogeneous.
1. __Lists__, which are sometimes called recursive vectors because lists can
contain other lists.
There's a somewhat related object: `NULL`. `NULL` is often used to represent the absence of a vector (as opposed to `NA` which is used to represent the absence of a value in a vector). `NULL` typically behaves like a vector of length 0.
The chief difference between atomic vectors is that atomic vectors are __homogeneous__, while lists can be __heterogeneous__. There's one other related object: `NULL`. `NULL` is often used to represent the absence of a vector (as opposed to `NA` which is used to represent the absence of a value in a vector). `NULL` typically behaves like a vector of length 0. Figure \@ref{fig-datatypes} summarises the interrelationships.
The structure of the vector types is summarised in the following diagram:
```{r, echo = FALSE}
```{r datatypes, echo = FALSE, out.width = "50%", fig.cap = "The hierarchy of R's vector types"}
knitr::include_graphics("diagrams/data-structures-overview.png")
```
@ -53,11 +67,11 @@ Every vector has two key properties:
Vectors can also contain arbitrary additional metadata in the form of attributes. These attributes are used to create __augmented vectors__ which build on additional behaviour. There are four important types of augmented vector:
* Factors and dates are built on top of integer vectors.
* Date-times (POSIXct) are built on of double vectors.
* Factors are built on top of integer vectors.
* Dates and date-times are built on of numeric vectors.
* Data frames and tibbles are built on top of lists.
This chapter will introduce you to these important vectors from simplest to most complicated. You'll start with atomic vectors, then build up to lists, and finally learn about augmented vectors.
This chapter will introduce you to these important vectors from simplest to most complicated. You'll start with atomic vectors, then build up to lists, and finish off with augmented vectors.
## Important types of atomic vector
@ -68,6 +82,8 @@ The four most important types of atomic vector are logical, integer, double, and
Logical vectors are the simplest type of atomic vector because they can take only three possible values: `FALSE`, `TRUE`, and `NA`. Logical vectors are usually constructed with comparison operators, as described in [comparisons]. You can also create them by hand with `c()`:
```{r}
1:10 %% 3
c(TRUE, TRUE, FALSE, NA)
```
@ -81,53 +97,47 @@ typeof(1L)
1.5L
```
The distinction between integers and doubles is not usually important. However, there are two important differences that you need to be aware of:
The distinction between integers and doubles is not usually important, but there are two important differences that you should be aware of:
1. Doubles are approximations.
1. Doubles are approximations. Doubles represent floating point numbers that
can not always be precisely represented with a fixed amount of memory.
This means that you should consider all doubles to be approximations.
For example, what is square of the square root of two?
1. Integers have one special value: `NA_integer_`, while doubles have four:
`NA_real_`, `NaN`, `Inf` and `-Inf`
```{r}
x <- sqrt(2) ^ 2
x
x - 2
```
Doubles represent floating point numbers that can not always be precisely represented with a fixed amount of memory. This means that you should consider all doubles to be approximations. For example, what is square of the square root of two?
This behaviour is common when working with floating point numbers: most
calculations include some approximation error. Instead of comparing floating
point numbers using `==`, you should use `dplyr::near()` which allows for
some numerical tolerance.
```{r}
x <- sqrt(2) ^ 2
x
```
1. Integers have one special value: `NA`, while doubles have four:
`NA`, `NaN`, `Inf` and `-Inf`. All three special values can arise in
during division:
```{r}
c(-1, 0, 1) / 0
```
It certainly looks like R calculates the number we expect: 2. But things are not exactly as they seem:
```{r}
x == 2
x - 2
```
This behaviour is common when working with floating point numbers: most calculations include some approximation error. Instead of comparing floating point numbers using `==`, you should use `dplyr::near()` which allows for some numerical tolerance.
```{r}
dplyr::near(x, 2)
```
Doubles have three special values in addition to `NA`:
```{r}
c(NA, -1, 0, 1) / 0
```
Avoid using `==` to check for these other special values. Instead use the helper functions `is.finite()`, `is.infinite()`, and `is.nan()`:
| | 0 | Inf | NA | NaN |
|------------------|-----|-----|-----|-----|
| `is.finite()` | x | | | |
| `is.infinite()` | | x | | |
| `is.na()` | | | x | x |
| `is.nan()` | | | | x |
Avoid using `==` to check for these other special values. Instead use the
helper functions `is.finite()`, `is.infinite()`, and `is.nan()`:
| | 0 | Inf | NA | NaN |
|------------------|-----|-----|-----|-----|
| `is.finite()` | x | | | |
| `is.infinite()` | | x | | |
| `is.na()` | | | x | x |
| `is.nan()` | | | | x |
### Character
Character vectors are the most complex type of atomic vector, because each element of a character vector is a string, and a string can contain an arbitrary amount of data. You've already learned a lot about working with strings in [strings].
Character vectors are the most complex type of atomic vector, because each element of a character vector is a string, and a string can contain an arbitrary amount of data.
Here I wanted to mention one important feature of the underlying string implementation: R uses a global string pool. This means that each unique string is only stored in memory once, and every use of the string points to that representation. This reduces the amount of memory needed by duplicated strings. You can see this behaviour in practice with `pryr::object_size()`:
You've already learned a lot about working with strings in [strings]. Here I wanted to mention one important feature of the underlying string implementation: R uses a global string pool. This means that each unique string is only stored in memory once, and every use of the string points to that representation. This reduces the amount of memory needed by duplicated strings. You can see this behaviour in practice with `pryr::object_size()`:
```{r}
x <- "This is a reasonably long string."
@ -141,7 +151,7 @@ pryr::object_size(y)
### Missing values
Each type of atomic vector has its own missing value:
Note that each type of atomic vector has its own missing value:
```{r}
NA # logical
@ -150,7 +160,7 @@ NA_real_ # double
NA_character_ # character
```
Normally, you don't need to know about these different types because you can always use `NA` and it will be converted to the correct type. However, there are some functions that are strict about their inputs, so it's useful to have this knowledge sitting in your back pocket so you can be specific when needed.
Normally you don't need to know about these different types because you can always use `NA` and it will be converted to the correct type using the implicit coercion rules described next. However, there are some functions that are strict about their inputs, so it's useful to have this knowledge sitting in your back pocket so you can be specific when needed.
### Exercises
@ -166,23 +176,22 @@ Normally, you don't need to know about these different types because you can alw
integer. How do they differ? Be precise.
1. What functions from the readr package allow you to turn a string
into a logical, integer, or double vector?
into logical, integer, and double vector?
## Using atomic vectors
Now that you understand the different types of atomic vector, it's useful to review some of the important tools for working with them. These include:
1. The implicit coercion rules which govern what happen when, for example,
you use a logical vector in a numeric context.
1. How to convert from one type to another, and when that happens
automatically.
1. Tools to test if an function input is a specific type of vector.
1. How to tell if an object is a specific type of vector.
1. R's recycling rules which govern what happens when you work
with vectors of different lengths.
1. What happens when you work with vectors of different lengths.
1. Naming the elements of a vector.
1. How to name the elements of a vector.
1. Subsetting a vector to pull out elements of interest.
1. How pull out elements of interest.
### Coercion
@ -200,9 +209,9 @@ There are two ways to convert, or coerce, one type of vector to another:
vector with a numeric summary function, or when you use a double vector
where an integer vector is expected.
Because explicit coercion is used relatively rarely (and is largely easy to understand), it's more important to understand implicit coercion.
Because explicit coercion is used relatively rarely, and is largely easy to understand, I'll focus on implicit coercion here.
The most important type of implicit coercion is using a logical vector in a numeric context. In this case `TRUE` is converted to `1` and `FALSE` converted to 0. That means the sum of a logical vector is the number of trues, and the mean of a logical vector is the proportion of trues:
You've already seen the most important type of implicit coercion: using a logical vector in a numeric context. In this case `TRUE` is converted to `1` and `FALSE` converted to 0. That means the sum of a logical vector is the number of trues, and the mean of a logical vector is the proportion of trues:
```{r}
x <- sample(20, 100, replace = TRUE)
@ -211,7 +220,7 @@ sum(y) # how many are greater than 10?
mean(y) # what proportion are greater than 10?
```
You may see some code (typically older) that relies on the implicit coercion in the opposite direction, from integer to logical:
You may see some code (typically older) that relies on implicit coercion in the opposite direction, from integer to logical:
```{r, eval = FALSE}
if (length(x)) {
@ -229,13 +238,11 @@ typeof(c(1L, 1.5))
typeof(c(1.5, "a"))
```
An atomic vector can not have a mix of different types because the type is a property of the complete vector, not of the individual elements. If you need to mix multiple types in the same vector, you should use a list, which you'll learn about shortly.
An atomic vector can not have a mix of different types because the type is a property of the complete vector, not the individual elements. If you need to mix multiple types in the same vector, you should use a list, which you'll learn about shortly.
### Test functions
Sometimes you want to do different things based on the type of vector. One option is to use `typeof()`. Another is to use a test function which returns a `TRUE` or `FALSE` (broadly, functions that return a single logical value are often called __predicate__ functions).
Base R provides many functions like `is.vector()` and `is.atomic()`, but they often returns surprising results. Instead, it's safer to use the `is_*` functions provided by purrr, which are summarised in the table below.
Sometimes you want to do different things based on the type of vector. One option is to use `typeof()`. Another is to use a test function which returns a `TRUE` or `FALSE`. Base R provides many functions like `is.vector()` and `is.atomic()`, but they often returns surprising results. Instead, it's safer to use the `is_*` functions provided by purrr, which are summarised in the table below.
| | lgl | int | dbl | chr | list |
|------------------|-----|-----|-----|-----|------|
@ -248,11 +255,11 @@ Base R provides many functions like `is.vector()` and `is.atomic()`, but they of
| `is_list()` | | | | | x |
| `is_vector()` | x | x | x | x | x |
Each predicate also comes with a "scalar" version, which checks that the length is 1. This is useful if you want to check (for example) that the inputs to your function are as you expect.
Each predicate also comes with a "scalar" version, like `is_scalar_atomic()`, which checks that the length is 1. This is useful, for example, if you want to check that an argument to your function is a single logical value.
### Scalars and recycling rules
As well as implicitly coercion the types of vectors to be compatible, R will also implicit coerce the length of vectors. This is called vector "recycling", because the shorter vector is repeated, or __recycled__, to be the same length as the longer vector.
As well as implicitly coercing the types of vectors to be compatible, R will also implicitly coerce the length of vectors. This is called vector __recycling__, because the shorter vector is repeated, or recycled, to the same length as the longer vector.
This is generally most useful when you are mixing vectors and "scalars". I put scalars in quotes because R doesn't actually have scalars: instead, a single number is a vector of length 1. Because there are no scalars, most built-in functions are __vectorised__, meaning that they will operate on a vector of numbers. That's why, for example, this code works:
@ -261,7 +268,7 @@ sample(10) + 100
runif(10) > 0.5
```
In R, basic mathematical operations work with vectors, not scalars like in most programming languages. This means that you should never need to perform explicit iteration when performing simple mathematical computations.
In R, basic mathematical operations work with vectors. That means that you should never need to perform explicit iteration when performing simple mathematical computations.
It's intuitive what should happen if you add two vectors of the same length, or a vector and a "scalar", but what happens if you add two vectors of different lengths?
@ -269,16 +276,20 @@ It's intuitive what should happen if you add two vectors of the same length, or
1:10 + 1:2
```
Here, R will expand the shortest vector to the same length as the longest, so called __recycling__. This is silent except in the case where the length of the longer is not an integer multiple of the length of the longer:
Here, R will expand the shortest vector to the same length as the longest, so called recycling. This is silent except when the length of the longer is not an integer multiple of the length of the shorter:
```{r}
1:10 + 1:3
```
While vector recycling can be used to create very succinct, clever code, it can also silently conceal problems. For this reason, the vectorised functions in tidyverse will throw errors when you recycle anything other than a scalar.
While vector recycling can be used to create very succinct, clever code, it can also silently conceal problems. For this reason, the vectorised functions in tidyverse will throw errors when you recycle anything other than a scalar. If you do want to recycle, you'll need to do it yourself with `rep()`:
```{r, error = TRUE}
tibble::tibble(x = 1:4, y = 1:2)
tibble::tibble(x = 1:4, y = rep(1:2, 2))
tibble::tibble(x = 1:4, y = rep(1:2, each = 2))
```
### Naming vectors
@ -299,7 +310,7 @@ Named vectors are most useful for subsetting, described next.
### Subsetting {#vector-subsetting}
So far we've used `dplyr::filter()` to filter the rows in a data frame. `filter()`, however, does not work with vectors, so we need to learn a new tool: `[`. `[` is the subsetting function, and is called like `x[a]`. There are four types of thing that you can subset a vector with:
So far we've used `dplyr::filter()` to filter the rows in a tibble. `filter()` only works with tibble, so we'll need new tool for vectors: `[`. `[` is the subsetting function, and is called like `x[a]`. There are four types of thing that you can subset a vector with:
1. A numeric vector containing only integers. The integers must either be all
positive, all negative, or zero.
@ -336,12 +347,12 @@ So far we've used `dplyr::filter()` to filter the rows in a data frame. `filter(
x[0]
```
This is not generally useful, but can be helpful if you want to create
unusual data structures with which to test your functions.
This is not useful very often, but it can be helpful if you want to create
unusual data structures to test your functions with.
1. Subsetting with a logical vector keeps all values corresponding to a
`TRUE` value. This is most often useful in conjunction with a function
that creates a logical vector.
`TRUE` value. This is most often useful in conjunction with the
comparison functions.
```{r}
x <- c(10, 3, NA, 5, 8, 1, NA)
@ -371,15 +382,20 @@ So far we've used `dplyr::filter()` to filter the rows in a data frame. `filter(
all the columns, and `x[, -1]` selects all rows and all columns except
the first.
I'd recommend reading <http://adv-r.had.co.nz/Subsetting.html#applications> to learn more about how you can use subsetting to achieve various goals. If you are working with data frames, you can typically use a dplyr function to achieve these goals, but the techniques are useful to know about when you are writing your own functions.
To learn more about the applications of subseting, reading the "Subsetting" chapter of _Advanced R_: <http://adv-r.had.co.nz/Subsetting.html#applications>.
There is an important variation of `[` called `[[`. `[[` only ever extracts a single element, and always drops names. It's a good idea to use it whenever you want to make it clear that you're extracting one thing, as in a for loop. The distinction between `[` and `[[` is most important for lists, as we'll see shortly.
There is an important variation of `[` called `[[`. `[[` only ever extracts a single element, and always drops names. It's a good idea to use it whenever you want to make it clear that you're extracting a single item, as in a for loop. The distinction between `[` and `[[` is most important for lists, as we'll see shortly.
### Exercises
1. What does `mean(is.na(x))` tell you about a vector `x`? What about
`sum(!is.finite(x))`?
1. Carefully read the documentation of `is.vector()`. What does it actually
test for? Why does `is.atomic()` not agree with the definition of
atomic vectors above?
1. Compare and contrast `setNames()` with `purrr::set_names()`.
1. Create functions that take a vector as input and returns:
@ -490,15 +506,15 @@ a <- list(a = 1:3, b = "a string", c = pi, d = list(-1, -5))
a[["a"]]
```
The distinction between `[` and `[[` is really important for lists, because `[[` drills down into the list while `[` returns a new, smaller list. Compare the code and output above with the visual representation below.
The distinction between `[` and `[[` is really important for lists, because `[[` drills down into the list while `[` returns a new, smaller list. Compare the code and output above with the visual representation in Figure \@ref(fig:lists-subsetting).
```{r, echo = FALSE, out.width = "75%"}
```{r lists-subsetting, echo = FALSE, out.width = "75%", fig.cap = "Subsetting a list, visually."}
knitr::include_graphics("diagrams/lists-subsetting.png")
```
### Lists of condiments
The difference between `[` and `[[` is very important, but it's easy to get confused. A few months ago I stayed at a hotel with a rather interesting pepper shaker that I hope will help you remember these differences:
The difference between `[` and `[[` is very important, but it's easy to get confused. To help you remember, let me show you an unusual pepper shaker.
```{r, echo = FALSE, out.width = "25%"}
knitr::include_graphics("images/pepper.jpg")
@ -531,12 +547,13 @@ knitr::include_graphics("images/pepper-3.jpg")
1. `list(a, b, list(c, d), list(e, f))`
1. `list(list(list(list(list(list(a))))))`
1. What happens if you subset a data frame as if you're subsetting a list?
What are the key differences between a list and a data frame?
1. What happens if you subset a tibble as if you're subsetting a list?
What are the key differences between a list and a tibble?
## Attributes
Any vector can contain arbitrary additional metadata through its __attributes__. You can think of attributes as named list of vectors that can be attached to any object. You can get and set individual attribute values with `attr()` or see them all at once with `attributes()`.
Any vector can contain arbitrary additional metadata through its __attributes__. You can think of attributes as named list of vectors that can be attached to any object.
You can get and set individual attribute values with `attr()` or see them all at once with `attributes()`.
```{r}
x <- 1:10
@ -552,7 +569,7 @@ There are three very important attributes that are used to implement fundamental
1. __Dimensions__ (dims, for short) make a vector behave like a matrix or array.
1. __Class__ is used to implement the S3 object oriented system.
You've seen names above, and we won't cover dimensions because we don't use matrices in this book. It remains to describe the class, which controls how __generic functions work__. Generic functions are key to object oriented programming in R, making different types of vector act differently. A detailed discussion of the S3 object oriented system is beyond the scope of this book, but you can read more about it at <http://adv-r.had.co.nz/OO-essentials.html#s3>.
You've seen names above, and we won't cover dimensions because we don't use matrices in this book. It remains to describe the class, which controls how __generic functions__ work. Generic functions are key to object oriented programming in R, because they make functions behave differently for different classes of input. A detailed discussion of object oriented programming is beyond the scope of this book, but you can read more about it _Advanced R_ at <http://adv-r.had.co.nz/OO-essentials.html#s3>.
Here's what a typical generic function looks like:
@ -579,7 +596,7 @@ The most important S3 generic is `print()`: it controls how the object is printe
## Augmented vectors
Atomic vectors and lists are the building blocks for other important vector types like factors and dates. I call these __augmented vectors__, because they are vectors with additional __attributes__. Generic methods make augmented vectors behave differently, depending on their class. In this book, we make use of four important augmented vectors:
Atomic vectors and lists are the building blocks for other important vector types like factors and dates. I call these __augmented vectors__, because they are vectors with additional __attributes__, including class. Because augmented vectors has a class, they behave differently to the atomic vector on which they are built. In this book, we make use of four important augmented vectors:
* Factors.
* Date-times and times.
@ -597,18 +614,9 @@ typeof(x)
attributes(x)
```
You can create them from scratch with `factor()` or from a character vector with `as.factor()`.
```{r}
x <- factor(letters[1:5])
is.factor(x)
as.factor(letters[1:5])
```
### Dates and date-times
Dates in R are numeric vectors (sometimes integers, sometimes doubles) that represent the number of days since 1 January 1970.
Dates in R are numeric vectors that represent the number of days since 1 January 1970.
```{r}
x <- as.Date("1971-01-01")
@ -618,7 +626,7 @@ typeof(x)
attributes(x)
```
Date-times are numeric vectors (sometimes integers, sometimes doubles) that represent the number of seconds since 1 January 1970:
Date-times are numeric vectors with class `POSIXct` that represent the number of seconds since 1 January 1970. (In case you were wondering, "POSIXct" stands for "Portable Operating System Interface", calendar time.)
```{r}
x <- lubridate::ymd_hm("1970-01-01 01:00")
@ -628,7 +636,7 @@ typeof(x)
attributes(x)
```
The `tzone` is optional. It controls how the time is printed, not what absolute time it refers to.
The `tzone` attribute is optional. It controls how the time is printed, not what absolute time it refers to.
```{r}
attr(x, "tzone") <- "US/Pacific"
@ -646,7 +654,7 @@ typeof(y)
attributes(y)
```
POSIXlts are rare inside the tidyverse. They do crop up in base R, because they are needed to extract specific components of a date (like the year or month). Since lubridate provides helpers for you to do this instead, you don't need them. POSIXct's are always easier to work with, so if you find you have a POSIXlt, you should always convert it to a POSIXct with `as.POSIXct()`.
POSIXlts are rare inside the tidyverse. They do crop up in base R, because they are needed to extract specific components of a date, like the year or month. Since lubridate provides helpers for you to do this instead, you don't need them. POSIXct's are always easier to work with, so if you find you have a POSIXlt, you should always convert it to a regular data time `lubridate::as_date_time()`.
### Tibbles
@ -658,7 +666,7 @@ typeof(tb)
attributes(tb)
```
The difference between a tibble and a list is that all the elements of a data frame must be the same length. All functions that work with tibbles enforce this constraint.
The difference between a tibble and a list is that all the elements of a data frame must be vectors with the same length. All functions that work with tibbles enforce this constraint.
Traditional data.frames have a very similar structure:
@ -678,3 +686,6 @@ The main difference is the class. The class of tibble includes "data.frame" whic
1. Try and make a tibble that has columns with different lengths. What
happens?
1. Based of the definition above, is it ok to have a list as a
column of a tibble?