More about data structures

This commit is contained in:
hadley 2016-04-06 09:02:11 -05:00
parent 93cb8b5f73
commit 39e23c46de
2 changed files with 192 additions and 74 deletions

View File

@ -1,11 +1,10 @@
# Data structures
```{r, include = FALSE}
```{r setup, include = FALSE}
library(purrr)
library(dplyr)
```
As you start to write more functions, and as you want your functions to work with more types of inputs, it's useful to have some grounding in the underlying data structures that R is built on. This chapter will dive deeper into the objects that you've already used, helping you better understand how things work.
Often, when you write a function it will work with a single vector (or a handful of vectors), rather than a data frame. So far we've focussed on tools, like dplyr, that work with data frames, and have talked little about vector. Now it's time to dive deep and learn how you can work with vectors to build your own functions to automate common problems.
This chapter focusses on __vectors__, the most important family of objects in R. They are the most important because you work with them most frequently in a data analysis. You will use other types of objects likes functions and environments, but by and large you don't need to understand the details of these data types. If you are interested in learning the precise details, you'll need to learn about R's underlying C API, which is beyond the scope of this book. <http://adv-r.had.co.nz/C-interface.html#c-data-structures> has some details if you're interested.
@ -20,7 +19,7 @@ There are two types of vectors:
contain other lists. This is the chief difference between atomic vectors
and lists.
The structure of the vector types is summarised in the following diagram
The structure of the vector types is summarised in the following diagram:
```{r, echo = FALSE}
knitr::include_graphics("diagrams/data-structures-overview.png")
@ -50,31 +49,10 @@ Vectors can also contain arbitrary additional metadata in the form of attributes
This chapter will introduce you to these important vectors types from simplest to most complicated. You'll start with atomic vectors, then build up to lists, and finally learn about augmented vectors.
## Atomic vectors
## Atomic vector theory
The four most important types of atomic vector are logical, integer, double, and character. Integer and double are known collectively as numeric vectors and most of the time the distinction is not important, so we'll discuss them together. They're beyond the scope of this book because they are rarely needed to do data analysis. The following sections describe each type in turn.
Note that R does not have "scalars". In R, a single number is a vector of length 1. The impacts of this are mostly on how functions work. Because there are no scalars, most built-in functions are __vectorised__, meaning that they will operate on a vector of numbers. That's why, for example, this code works:
```{r}
1:10 + 2:11
```
In R, basic mathematical operations work with vectors, not scalars like in most programming languages. This means that you should never need to write an explicit for loop when performing simple computations on vectors.
Each type of atomic vector has its own missing value:
```{r}
NA # logical
NA_integer_ # integer
NA_real_ # double
NA_character_ # character
```
Normally, you don't need to know about these different types because you can always use `NA` it will be converted to the correct type. However, there are some functions that are strict about their inputs, so it's useful to have this knowledge sitting in your back pocket so you can use a specific type of missing value when needed.
You can convert another from one type of atomic vector to another with `as.logical()`, `as.integer()`, `as.double()`, and `as.character()`. However, before doing so, you should carefully consider whether you can make the fix upstream, so that the vector never had the wrong type in the first place. For example, you may need to tweak you readr `col_types` specification.
### Logical
Logical vectors are the simplest type of atomic vector because they can take only three possible values: `FALSE`, `TRUE`, and `NA`. Logical vectors are usually constructed with comparison operators, as described in [comparisons]. You can also create them by hand with `c()`:
@ -91,6 +69,7 @@ y <- x > 10
sum(y) # how many are greater than 10?
mean(y) # what proportion are greater than 10?
```
### Numeric
Numeric vectors include both integers and doubles (real numbers). In R, numbers are doubles by default. To make an integer, place a `L` after the number:
@ -161,12 +140,190 @@ pryr::object_size(y)
1. Read the source code for `dplyr::near()`. How does it work?
## Atomic vector practice
### Scalars
Note that R does not have "scalars". In R, a single number is a vector of length 1. The impacts of this are mostly on how functions work. Because there are no scalars, most built-in functions are __vectorised__, meaning that they will operate on a vector of numbers. That's why, for example, this code works:
```{r}
1:10 + 2:11
```
In R, basic mathematical operations work with vectors, not scalars like in most programming languages. This means that you should never need to write an explicit for loop when performing simple computations on vectors.
Each type of atomic vector has its own missing value:
```{r}
NA # logical
NA_integer_ # integer
NA_real_ # double
NA_character_ # character
```
Normally, you don't need to know about these different types because you can always use `NA` it will be converted to the correct type. However, there are some functions that are strict about their inputs, so it's useful to have this knowledge sitting in your back pocket so you can use a specific type of missing value when needed.
### Coercion
You can convert another from one type of atomic vector to another with `as.logical()`, `as.integer()`, `as.double()`, and `as.character()`. However, before doing so, you should carefully consider whether you can make the fix upstream, so that the vector never had the wrong type in the first place. For example, you may need to tweak you readr `col_types` specification.
### Named vectors
All types of vectors can be named. You can either name them during creation with `c()`:
```{r}
c(x = 1, y = 2, z = 4)
```
Or after the fact with `purrr::set_names()`:
```{r}
1:3 %>% set_names(c("a", "b", "c"))
```
### Test functions
| | lgl | int | dbl | chr | list | null |
|------------------|-----|-----|-----|-----|------|------|
| `is_logical()` | x | | | | | |
| `is_integer()` | | x | | | | |
| `is_double()` | | | x | | | |
| `is_numeric()` | | x | x | | | |
| `is_character()` | | | | x | | |
| `is_atomic()` | x | x | x | x | | |
| `is_list()` | | | | | x | |
| `is_vector()` | x | x | x | x | x | |
| `is_null()` | | | | | | x |
Compared to the base R functions, they only inspect the type of the object, not its attributes. This means they tend to be less surprising:
```{r}
is.atomic(NULL)
is_atomic(NULL)
is.vector(factor("a"))
is_vector(factor("a"))
```
I recommend using these instead of the base functions.
Each predicate also comes with "scalar" and "bare" versions. The scalar version checks that the length is 1 and the bare version checks that the object is a bare vector with no S3 class.
```{r}
y <- factor(c("a", "b", "c"))
is_integer(y)
is_scalar_integer(y)
is_bare_integer(y)
```
### Exercises
1. Carefully read the documentation of `is.vector()`. What does it actually
test for?
### Subsetting
Before we continue on to a richer data structure, the list, we need to take a brief detour to talk about subsetting vectors. So far, we've focussed on data frames, which are most easily subset with `dplyr::filter()`. `filter()`, however, does not work with vectors, so we need to learn a new tool: `[`.
`[` is the subsetting function, and is called like `x[a]`. We're not going to cover data structures that are 2d or higher in detail, but the idea generalised to `x[a, b]`, `x[a, b, c]` and so on. When working with individual vectors, it's important to understand how `[` works and how you can use it to extract elements of interest.
There are three four types of thing you can use to subset a vector:
1. The simplest type of subsetting is nothing, `x[]`, which returns the
complete `x`. This is not useful for subsetting vectors, but it is useful
when subsetting matrices (and other high dimensional structures) because
it lets you select all the rows or all the columns, by leaving that
index blank.
1. A numeric vector. If you subset with a numeric vector, it must either
be all positive, all negative, or zero.
Subsetting with a positive vector keeps the elements at those positions:
```{r}
x <- c("one", "two", "three", "four", "five")
x[c(3, 2, 5)]
```
By repeating a position, you can actually make an longer output than
input:
```{r}
x[c(1, 1, 5, 5, 5, 2)]
```
Negative values drop the elements at the specified positions:
```{r}
x[c(-1, -3, -5)]
```
It's an error to mix position and negative values:
```{r, error = TRUE}
x[c(1, -1)]
```
The error message mentions subsetting with zero, which returns no values:
```{r}
x[0]
```
This is not generally useful, but can be helpful if you want to create
unusual data structures with which to test your functions.
1. Subsetting with a logical vector keeps all values corresponding to a
`TRUE` value. This is most often useful in conjunction with a function
that creates a logical vector.
```{r, eval = FALSE}
# All non-missing values of x
x[!is.na(x)]
# All even values of x
x[x %% 2 == 0]
```
1. If you have a named vector, you can subset it with a character vector.
```{r}
x <- c(abc = 1, def = 2, xyz = 5)
x[c("xyz", "def")]
```
Like with positive integers, you can also use a character vector to
duplicate individual entries.
I'd recommend reading <http://adv-r.had.co.nz/Subsetting.html#applications> to learn more about how you can use subsetting to achieve various goals. If you are working with data frames, you can typically use a dplyr function to achieve these goals, but the techniques are useful to know about when you are writing your own functions.
There is an important variation of `[` called `[[`. `[[` only ever extracts a single element, and always drops names. It's a good idea to use it whenever you want to make it clear that you're extracting one thing, as in a for loop. The distinction between `[` and `[[` is most important for lists, as we'll see shortly.
### Exercises
1. Create functions that take a vector as input and returns:
1. The last value. Should you use `[` or `[[`?
1. The elements at even numbered positions.
1. Every element except the last value.
1. Why is `x[-which(x > 0)]` not the same as `x[x <= 0]`?
1. What happens when you subset with a positive integer that's bigger
than the length of the vector? What happens when you subset with a
name that doesn't exist?
## Null
Null is a special object. Represents a "generic" vector of length 0. Often used to represent the absense of a vector (as opposed to `NA` which is used to represent the absence of a value in a vector). NULL is a singleton.
Subsetting NULL always gives you a null.
## Recursive vectors (lists)
Lists are the data structure R uses for hierarchical objects. You can create a hierarchical structure with a list because unlike vectors, a list can contain other lists.
You create a list with `list()`:
Lists are a fundamentally richer than atomic vectors, because lists can contain other lists. This makes them suitable for representing hierarchical or tree-like structures. You create a list with `list()`:
```{r}
x <- list(1, 2, 3)
@ -202,7 +359,7 @@ x2 <- list(list(1, 2), list(3, 4))
x3 <- list(1, list(2, list(3)))
```
I draw them as follows:
I'll draw them as follows:
```{r, echo = FALSE, out.width = "75%"}
knitr::include_graphics("diagrams/lists-structure.png")
@ -232,8 +389,8 @@ a <- list(a = 1:3, b = "a string", c = pi, d = list(-1, -5))
str(a[4])
```
Like subsetting vectors, you can use an integer vector to select by
position, or a character vector to select by name.
Like with vectors, you can subset with a logical, integer, or character
vector.
* `[[` extracts a single component from a list. It removes a level of
hierarchy from the list.
@ -251,7 +408,7 @@ a <- list(a = 1:3, b = "a string", c = pi, d = list(-1, -5))
a[["b"]]
```
Or visually:
The distinction between `[` and `[[` is really important for lists, because `[[` drills down into the list while `[` returns a new, smaller list. Compare the code and output above with the visual representation below.
```{r, echo = FALSE, out.width = "75%"}
knitr::include_graphics("diagrams/lists-subsetting.png")
@ -294,7 +451,6 @@ knitr::include_graphics("images/pepper-3.jpg")
1. What happens if you subset a data frame as if you're subsetting a list?
What are the key differences between a list and a data frame?
## Augmented vectors
There are four important types of vector that are built on top of atomic vectors: factors, dates, date times, and data frames. I call these augmented vectors, because they are atomic vectors with additional __attributes__. Attributes are a way of adding arbitrary additional metadata to a vector. Each attribute is a named vector. You can get and set individual attribute values with `attr()` or see them all at once with `attributes()`.
@ -411,43 +567,3 @@ attributes(df2)
```
These extra components give tibbles the helpful behaviours defined in [tibbles].
## Predicates
| | lgl | int | dbl | chr | list | null |
|------------------|-----|-----|-----|-----|------|------|
| `is_logical()` | x | | | | | |
| `is_integer()` | | x | | | | |
| `is_double()` | | | x | | | |
| `is_numeric()` | | x | x | | | |
| `is_character()` | | | | x | | |
| `is_atomic()` | x | x | x | x | | |
| `is_list()` | | | | | x | |
| `is_vector()` | x | x | x | x | x | |
| `is_null()` | | | | | | x |
Compared to the base R functions, they only inspect the type of the object, not its attributes. This means they tend to be less surprising:
```{r}
is.atomic(NULL)
is_atomic(NULL)
is.vector(factor("a"))
is_vector(factor("a"))
```
I recommend using these instead of the base functions.
Each predicate also comes with "scalar" and "bare" versions. The scalar version checks that the length is 1 and the bare version checks that the object is a bare vector with no S3 class.
```{r}
y <- factor(c("a", "b", "c"))
is_integer(y)
is_scalar_integer(y)
is_bare_integer(y)
```
### Exercises
1. Carefully read the documentation of `is.vector()`. What does it actually
test for?

View File

@ -104,6 +104,8 @@ Every for loop has three components:
the work. It's run repeatedly, each time with a different value for `i`.
The first iteration will run `output[[1]] <- median(df[[1]])`,
the second will run `output[[2]] <- median(df[[2]])`, and so on.
If you haven't seen `x[[i]]` before, it extracts the `i`th element from
`x`. You'll learn more about it in [subsetting].
That's all there is to the for loop! Now is a good time to practice creating some basic (and not so basic) for loops using the exercises below. Then we'll move on some variations of the for loop that help you solve other problems that will crop up in practice.