More about vectors

This commit is contained in:
hadley 2016-04-18 08:45:27 -05:00
parent 67516034f7
commit 8b7f709a57
3 changed files with 143 additions and 73 deletions

View File

@ -1,10 +1,15 @@
# Data structures
# Vectors
```{r setup, include = FALSE}
library(purrr)
library(dplyr)
```
Often, when you write a function it will work with a single vector (or a handful of vectors), rather than a data frame. So far we've focussed on tools, like dplyr, that work with data frames, and have talked little about vector. Now it's time to dive deep and learn how you can work with vectors to build your own functions to automate common problems.
So far this book has focussed on data frames and packages that work with them. But as you start to write your own functions, and dig deeper into R, you need to learn about vectors, the objects that underpin data frames. If you've learned R in a more traditional way, you're probably familiar with vectors already, as most R resource start with vectors and work their way up to data frames. I think it's better to start with data frames because they're immediately useful, and then work your way down to the underlying components.
Vectors are particularly important as its to learn to write functions that work with vectors, rather than data frames. The technology that lets ggplot2, tidyr, dplyr etc work with data frames is considerably more complex and not currently standardised. While I'm currently working on a new standard that will make life much easier, it's unlikely to be ready in time for this book.
## Vector overview
There are two types of vectors:
@ -13,9 +18,9 @@ There are two types of vectors:
__raw__. Integer and double vectors are collectively known as
__numeric__ vectors.
1. __Lists__, which sometimes called recursive vectors, because lists can
1. __Lists__, are sometimes called recursive vectors, because lists can
contain other lists. This is the chief difference between atomic vectors
and lists.
and lists: atomic vectors are homogeneous, lists can be heterogeneous.
There's a somewhat related object: `NULL`. It's often used to represent the absence of a vector (as opposed to `NA` which is used to represent the absence of a value in a vector). `NULL` typically behaves like a vector of length 0.
@ -43,15 +48,15 @@ Every vector has two key properties:
Vectors can also contain arbitrary additional metadata in the form of attributes. These attributes are used to create __augmented vectors__ which build on additional behaviour. There are four important types of augmented vector:
* __Factors__ and __dates__ are built on top of integers.
* __Date times__ (POSIXct) are built on of doubles.
* Data frames and __tibbles__ are built on top of lists.
* Factors and dates are built on top of integer vectors.
* Date times (POSIXct) are built on of double vectors.
* Data frames and tibbles are built on top of lists.
This chapter will introduce you to these important vectors from simplest to most complicated. You'll start with atomic vectors, then build up to lists, and finally learn about augmented vectors.
## Types of atomic vector
## Important types of atomic vector
The four most important types of atomic vector are logical, integer, double, and character. Raw and complex are rarely used during a data analysis, so I don't discuss them here.
The four most important types of atomic vector are logical, integer, double, and character. Raw and complex are rarely used during a data analysis, so I won't discuss them here.
Each type of atomic vector has its own missing value:
@ -62,15 +67,7 @@ NA_real_ # double
NA_character_ # character
```
Normally, you don't need to know about these different types because you can always use `NA` it will be converted to the correct type. However, there are some functions that are strict about their inputs, so it's useful to have this knowledge sitting in your back pocket so you can use a specific type of missing value when needed.
Note that R does not have "scalars". In R, a single number is a vector of length 1. The impacts of this are mostly on how functions work. Because there are no scalars, most built-in functions are __vectorised__, meaning that they will operate on a vector of numbers. That's why, for example, this code works:
```{r}
1:10 + 2:11
```
In R, basic mathematical operations work with vectors, not scalars like in most programming languages. This means that you should never need to write an explicit for loop when performing simple computations on vectors.
Normally, you don't need to know about these different types because you can always use `NA` it will be converted to the correct type. However, there are some functions that are strict about their inputs, so it's useful to have this knowledge sitting in your back pocket so you can be specific when needed.
### Logical
@ -82,18 +79,22 @@ c(TRUE, TRUE, FALSE, NA)
### Numeric
Integer and double vectors are known collectively as numeric vectors and most of the time the distinction is not important, so we'll discuss them together.
In R, numbers are doubles by default. To make an integer, place a `L` after the number:
Integer and double vectors are known collectively as numeric vectors. In R, numbers are doubles by default. To make an integer, place a `L` after the number:
```{r}
typeof(1)
typeof(1L)
1.5L
```
There are two important differences between integers and doubles: doubles are approximations, and they have three extra special values.
Most of the time the distinction between integers and doubles is not important. However, there are two important differences that you need to be aware of:
Doubles represent floating point numbers that can not always be precisely represented with a fixed amount of memory. This means that you should consider all doubles to be approximations, and you should never test for equality. For example, what is square of the square root of two?
1. Doubles are approximations,
1. Integers have one special value: `NA_integer_`, while doubles have four:
`NA_real_`, `NaN`, `Inf` and `-Inf`
Doubles represent floating point numbers that can not always be precisely represented with a fixed amount of memory. This means that you should consider all doubles to be approximations. For example, what is square of the square root of two?
```{r}
x <- sqrt(2) ^ 2
@ -116,7 +117,7 @@ dplyr::near(x, 2)
Doubles also have three special values in addition to `NA`:
```{r}
c(-1, 0, 1) / 0
c(NA, -1, 0, 1) / 0
```
Avoid using `==` to check for these other special values. Instead use the helper functions `is.finite()`, `is.infinite()`, and `is.nan()`:
@ -132,17 +133,15 @@ Note that `is.finite(x)` is not the same as `!is.infinite(x)`.
### Character
Character vectors are the most complex of atomic vectors, because each element of a character vector is a string, and a string can contain an arbitrary amount of data. Strings are such an important data type, they have their own chapter: [strings].
Character vectors are the most complex type of atomic vector, because each element of a character vector is a string, and a string can contain an arbitrary amount of data. Strings are such an important data type, they have their own chapter: [strings].
Here I wanted to mention one important feature of the underlying string implementation: it uses a global string pool. This means that each unique string is only stored in memory once, and every use of the string points to that representation. This reduces the amount of memory needed by duplicated strings.
You can see this behaviour in practice by using `pryr::object_size()`:
Here I wanted to mention one important feature of the underlying string implementation: R uses a global string pool. This means that each unique string is only stored in memory once, and every use of the string points to that representation. This reduces the amount of memory needed by duplicated strings. You can see this behaviour in practice with `pryr::object_size()`:
```{r}
x <- "This is a reasonably long string."
pryr::object_size(x)
y <- rep(x, 1000)
y <- rep(x, 1000)
pryr::object_size(y)
```
@ -155,38 +154,47 @@ pryr::object_size(y)
1. A logical vector can take 3 possible values. How many possible
values can an integer vector take?
1. List four functions that allow you to convert a double to an
integer. How do they differ?
1. Brainstorm at least four functions that allow you to convert a double to an
integer. How do they differ? Be precise.
1. What functions from the readr package allow you to turn a string
into a logical, integer, or double vector?
## Using atomic vectors
Now that you understand the different types of atomic vector, it's useful to review some of the important tools for working with them:
Now that you understand the different types of atomic vector, it's useful to review some of the important tools for working with them. These include:
1. The implicit coercion rules which govern what happen when, for example,
you use a logical vector in a numeric context.
1. Tools to test if an function input is a specific type of vector.
1. R's recycling rules which govern what happens when you attempt to work
with vectors of different lengths.
1. Naming the elements of a vector.
1. The coercion rules
1. Testing if an input is of a given type
1. How to create named vectors.
1. Subsetting a vector to pull out elements of interest.
### Coercion
There are two ways to convert, or coerce, one type of vector to another:
1. Implicit coercion happens when you use a vector in a specific context
that expects a certain type of vector. For example, when you use a logical
vector with a numeric summary function, or when you use a double vector
where an integer vector is expected.
1. Explicit coercion happesn when you call a function like `as.logical()`,
`as.integer()`, `as.double()`, and `as.character()`. Whenever you find
`as.integer()`, `as.double()`, or `as.character()`. Whenever you find
yourself using explicit coercion, you should always check whether you can
make the fix upstream, so that the vector never had the wrong type in
the first place. For example, you may need to tweak you readr
`col_types` specification.
1. Implicit coercion happens when you use a vector in a specific context
that expects a certain type of vector. For example, when you use a logical
vector with a numeric summary function, or when you use a double vector
where an integer vector is expected.
Because explicit coercion is used relatively rarely, it's more important to understand implicit coercion. The most important implicit coercion is logical to numeric. When used in a numeric context: `TRUE` is converted to `1`, `FALSE` converted to 0. That means the sum of a logical vector is the number of trues, and the mean of a logical vector is the proportion of trues.
Because explicit coercion is used relatively rarely (and is largely easy to understand), it's more important to understand implicit coercion.
The most important type of implicit coercion is using a logical vector in a numeric context. In this case `TRUE` is converted to `1` and `FALSE` converted to 0. That means the sum of a logical vector is the number of trues, and the mean of a logical vector is the proportion of trues:
```{r}
x <- sample(20, 100, replace = TRUE)
@ -195,7 +203,17 @@ sum(y) # how many are greater than 10?
mean(y) # what proportion are greater than 10?
```
It's also important to understand what happens when you try and create a vector containing multiple types with `c()`: the most complex type always wins. The type is a property of the complete vector, not the individual elements, so there's no way to have an atomic vector which is a mix of different types. If you need to mix multiple types in the same vector, you should use a list, which you'll learn about shortly.
You may see some code (typically older) that relies on the implicit coercion in the opposite direction, from integer to logical:
```{r, eval = FALSE}
if (length(x)) {
# do something
}
```
In this case, 0 is converted to `FALSE` and everything else is converted to `TRUE`. I think this makes it harder to understand your code, and I recommend it.
It's also important to understand what happens when you try and create a vector containing multiple types with `c()`: the most complex type always wins.
```{r}
str(c(TRUE, 1L))
@ -203,9 +221,13 @@ str(c(1L, 1.5))
str(c(1.5, "a"))
```
An atomic vector can not have a mix of different types because the type is a property of the complete vector, not of the individual elements. If you need to mix multiple types in the same vector, you should use a list, which you'll learn about shortly.
### Test functions
It's also useful to be able to test what type of thing you have in an unknown object. Base R provides many functions like `is.vector()` and `is.atomic()`, but they often don't do what you expect. Instead, it's safer to use the `is_*` functions provided by purrr, which are summarised in the table below.
Sometimes you want to do different things based on the type of vector you get. One option is to use `typeof()`. Another is to use a test function which returns a `TRUE` or `FALSE` (broadly, functions that return a single logical value are often called __predicate__ functions).
Base R provides many functions like `is.vector()` and `is.atomic()`, but they are often surprising. Instead, it's safer to use the `is_*` functions provided by purrr, which are summarised in the table below.
| | lgl | int | dbl | chr | list |
|------------------|-----|-----|-----|-----|------|
@ -220,6 +242,39 @@ It's also useful to be able to test what type of thing you have in an unknown ob
Each predicate also comes with a "scalar" version, which checks that the length is 1. This is useful if you want to check (for example) that the inputs to your function are as you expect.
### Scalars and recycling rules
As well as implicitly coercion the types of vectors to be compatible, R will also implicit coerce the length of vectors. This is called vector "recycling", because the shorter vector is repeated, or __recycled__, to be the same length as the longer vector.
This is generally most useful when you are mixing vectors and "scalars". But note that R does not actually have scalars. In R, a single number is a vector of length 1. Because there are no scalars, most built-in functions are __vectorised__, meaning that they will operate on a vector of numbers. That's why, for example, this code works:
```{r}
sample(10) + 100
runif(10) > 0.5
```
In R, basic mathematical operations work with vectors, not scalars like in most programming languages. This means that you should never need to perform explicit iteration (either with a loop or a map function) performing simple mathematical computations.
It's intuitive what should happen if you add two vectors of the same length, or a vector and a "scalar", but what happens if you add two vectors of different lengths?
```{r}
1:10 + 1:2
```
Here, R will expand the shortest vector to the same length as the longest, so called __recycling__. This is silent except in the case where the length of the longer is not an integer multiple of the length of the longer:
```{r}
1:10 + 1:3
```
While vector recycling can be used to create very succinct, clever code, it can also silently conceal problems. For this reason, the vectorised functions in dplyr, purrr, etc will throw errors when you recycle anything other than a scalar.
```{r, error = TRUE}
data.frame(x = 1:4, y = 1:2)
dplyr::data_frame(x = 1:4, y = 1:2)
purrr::map2(1:4, 1:2, `+`)
```
### Naming vectors
All types of vectors can be named. You can either name them during creation with `c()`:
@ -238,22 +293,14 @@ Named vectors are most useful for subsetting, described next.
### Subsetting
Before we continue on to a richer data structure, the list, we need to take a brief detour to talk about subsetting vectors. So far, we've focussed on data frames, which are most easily subset with `dplyr::filter()`. `filter()`, however, does not work with vectors, so we need to learn a new tool: `[`.
So far we've used `dplyr::filter()` to filter the rows in a data frame. `filter()`, however, does not work with vectors, so we need to learn a new tool: `[`. `[` is the subsetting function, and is called like `x[a]`. We're not going to cover 2d and higher data structures here, but the idea generalises in a straightforward way: `x[a, b]` for 2d, `x[a, b, c]` for 3d, and so on.
`[` is the subsetting function, and is called like `x[a]`. We're not going to cover data structures that are 2d or higher in detail, but the idea generalised to `x[a, b]`, `x[a, b, c]` and so on. When working with individual vectors, it's important to understand how `[` works and how you can use it to extract elements of interest.
There are four types of thing that you can subset a vector with:
There are three four types of thing you can use to subset a vector:
1. The simplest type of subsetting is nothing, `x[]`, which returns the
complete `x`. This is not useful for subsetting vectors, but it is useful
when subsetting matrices (and other high dimensional structures) because
it lets you select all the rows or all the columns, by leaving that
index blank.
1. A numeric vector. If you subset with a numeric vector, it must either
be all positive, all negative, or zero.
1. A numeric vector containing only integers. The integers must either be all
positive, all negative, or zero.
Subsetting with a positive vector keeps the elements at those positions:
Subsetting with positive integers keeps the elements at those positions:
```{r}
x <- c("one", "two", "three", "four", "five")
@ -273,7 +320,7 @@ There are three four types of thing you can use to subset a vector:
x[c(-1, -3, -5)]
```
It's an error to mix position and negative values:
It's an error to mix positive and negative values:
```{r, error = TRUE}
x[c(1, -1)]
@ -310,6 +357,14 @@ There are three four types of thing you can use to subset a vector:
Like with positive integers, you can also use a character vector to
duplicate individual entries.
1. The simplest type of subsetting is nothing, `x[]`, which returns the
complete `x`. This is not useful for subsetting vectors, but it is useful
when subsetting matrices (and other high dimensional structures) because
it lets you select all the rows or all the columns, by leaving that
index blank. For example, if `x` is 2d, `x[1, ]` selects the first row and
all the columns, and `x[, -1]` selects all rows and all columns except
the first.
I'd recommend reading <http://adv-r.had.co.nz/Subsetting.html#applications> to learn more about how you can use subsetting to achieve various goals. If you are working with data frames, you can typically use a dplyr function to achieve these goals, but the techniques are useful to know about when you are writing your own functions.
There is an important variation of `[` called `[[`. `[[` only ever extracts a single element, and always drops names. It's a good idea to use it whenever you want to make it clear that you're extracting one thing, as in a for loop. The distinction between `[` and `[[` is most important for lists, as we'll see shortly.
@ -317,7 +372,8 @@ There is an important variation of `[` called `[[`. `[[` only ever extracts a si
### Exercises
1. Carefully read the documentation of `is.vector()`. What does it actually
test for?
test for? Why does `is.atomic()` not agree with the definition of
atomic vectors above?
1. Create functions that take a vector as input and returns:
@ -335,7 +391,7 @@ There is an important variation of `[` called `[[`. `[[` only ever extracts a si
## Recursive vectors (lists)
Lists are a fundamentally richer than atomic vectors, because lists can contain other lists. This makes them suitable for representing hierarchical or tree-like structures. You create a list with `list()`:
Lists are a step up in complexity from atomic vectors, because lists can contain other lists. This makes them suitable for representing hierarchical or tree-like structures. You create a list with `list()`:
```{r}
x <- list(1, 2, 3)
@ -359,7 +415,7 @@ z <- list(list(1, 2), list(3, 4))
str(z)
```
`str()` is very helpful when looking at lists because it focusses on the structure, not the contents.
`str()` is very helpful when looking at lists because it focusses on the **str**ucture, not the contents.
### Visualising lists
@ -456,16 +512,19 @@ knitr::include_graphics("images/pepper-3.jpg")
### Exercises
1. Draw the following lists as nested sets.
1. Draw the following lists as nested sets:
1. Generate the lists corresponding to these nested set diagrams.
1. `list(a, b, list(c, d), list(e, f))`
1 `list(list(list(list(list(list(a))))))`
1. What happens if you subset a data frame as if you're subsetting a list?
What are the key differences between a list and a data frame?
## Augmented vectors
There are four important types of vector that are built on top of atomic vectors: factors, dates, date times, and data frames. I call these augmented vectors, because they are atomic vectors with additional __attributes__. Attributes are a way of adding arbitrary additional metadata to a vector. Each attribute is a named vector. You can get and set individual attribute values with `attr()` or see them all at once with `attributes()`.
Atomic vectors and lists are the building blocks for four other important vector types: factors, dates, date times, and data frames. I call these __augmented vectors__, because they are vectors with additional __attributes__.
Attributes are a way of adding arbitrary additional metadata to a vector. You can think of attributes as named list of vectors that can be attached to any object. You can get and set individual attribute values with `attr()` or see them all at once with `attributes()`.
```{r}
x <- 1:10
@ -477,17 +536,21 @@ attributes(x)
There are three very important attributes that are used to implement fundamental parts of R:
* "names" are used to name the elements of a vector.
* "names" are used to name the elements of a vector.
* "dims" make a vector behave like a matrix or array.
* "class" is used to implemenet the S3 object oriented system.
Class is particularly important because it changes what __generic functions__ do with the object. Generic functions are key to OO in R. Here's what a typical generic function looks like:
### S3
Class is particularly important because it changes what __generic functions__ do with the object. Generic functions are key to object oriented programming in R, and are what make augmented vectors behave differently to the vector they are built on top of. A detailed discussion of the S3 object oriented system is beyond the scope of this book, but you can read more about it at <http://adv-r.had.co.nz/OO-essentials.html#s3>.
Here's what a typical generic function looks like:
```{r}
as.Date
```
The call to "UseMethod" means that this is a generic function, and it will call a specific __method__, based on the class of the first argument. You can list all the methods for a generic with `methods()`:
The call to "UseMethod" means that this is a generic function, and it will call a specific __method__, a function, based on the class of the first argument. (All methods are functions; not all functions are methods). You can list all the methods for a generic with `methods()`:
```{r}
methods("as.Date")
@ -502,8 +565,6 @@ getS3method("as.Date", "numeric")
The most important S3 generic is `print()`: it controls how the object is printed when you type its name on the console. Other important generics are the subsetting functions `[`, `[[`, and `$`.
A detailed discussion of S3 is beyond the scope of this book, but you can read more about it at <http://adv-r.had.co.nz/OO-essentials.html#s3>.
### Factors
Factors are designed to represent categorical data that can take a fixed set of possible values. Factors are built on top of integers, and have a levels attribute:
@ -514,9 +575,9 @@ typeof(x)
attributes(x)
```
Historically, factors were much easier to work with than characters so many functions in base R automatically convert characters to factors (controlled by the dread `stringsAsFactors` argument). To get more historical context, you might want to read [stringsAsFactors: An unauthorized biography](http://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/) by Roger Peng or [stringsAsFactors = \<sigh\>](http://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh) by Thomas Lumley. The motivation for factors is the modelling context. If you're going to fit a model to categorical data, you need to know in advance all the possible values. There's no way to make a prediction for "green" if all you've ever seen is "red", "blue", and "yellow"
Historically, factors were much easier to work with than characters so many functions in base R automatically convert characters to factors (controlled by the dread `stringsAsFactors` argument). To get more historical context, you might want to read [stringsAsFactors: An unauthorized biography](http://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/) by Roger Peng or [stringsAsFactors = \<sigh\>](http://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh) by Thomas Lumley. The motivation for factors is modelling. If you're going to fit a model to categorical data, you need to know in advance all the possible values. There's no way to make a prediction for "green" if all you've ever seen is "red", "blue", and "yellow".
The packages in this book keep characters as is, but you will need to deal with them if you are working with base R or many other packages. When you encounter a factor, you should first check to see if you can avoid creating it in the first. Often there will be `stringsAsFactors` argument that you can set to `FALSE`. Otherwise, you can apply `as.character()` to the column to explicitly turn back into a factor.
The packages in this book keep characters as is, but you will need to deal with them if you are working with base R or many other packages. When you encounter a factor, you should first check to see if you can avoid creating it in the first. Often there will be `stringsAsFactors` argument that you can set to `FALSE`. Otherwise, you can apply `as.character()` to the column to explicitly turn back into a character vector.
```{r}
x <- factor(letters[1:5])
@ -546,9 +607,18 @@ typeof(x)
attributes(x)
```
The `tzone` is optional, and only controls the way the date is printed not what it means.
The `tzone` is optional. It controls how the time is printed, not what absolute time it refers to.
There is another type of datetimes called POSIXlt. These are built on top of named lists.
```{r}
attr(x, "tzone") <- "US/Pacific"
x
attr(x, "tzone") <- "US/Eastern"
x
log(-1)
1
```
There is another type of datetimes called POSIXlt. These are built on top of named lists:
```{r}
y <- as.POSIXlt(x)

Binary file not shown.

Before

Width:  |  Height:  |  Size: 6.6 KiB

After

Width:  |  Height:  |  Size: 7.1 KiB

Binary file not shown.