Restructuring vectors
This commit is contained in:
parent
399aa42a14
commit
3141e6e7dc
Binary file not shown.
Before Width: | Height: | Size: 75 KiB |
Binary file not shown.
Binary file not shown.
After Width: | Height: | Size: 42 KiB |
|
@ -84,7 +84,7 @@ flights |>
|
||||||
filter(daytime & approx_ontime)
|
filter(daytime & approx_ontime)
|
||||||
```
|
```
|
||||||
|
|
||||||
### Floating point comparison
|
### Floating point comparison {#sec-fp-comparison}
|
||||||
|
|
||||||
Beware of using `==` with numbers.
|
Beware of using `==` with numbers.
|
||||||
For example, it looks like this vector contains the numbers 1 and 2:
|
For example, it looks like this vector contains the numbers 1 and 2:
|
||||||
|
@ -432,8 +432,7 @@ There are two important tools for this: `if_else()` and `case_when()`.
|
||||||
### `if_else()`
|
### `if_else()`
|
||||||
|
|
||||||
If you want to use one value when a condition is true and another value when it's `FALSE`, you can use `dplyr::if_else()`[^logicals-4].
|
If you want to use one value when a condition is true and another value when it's `FALSE`, you can use `dplyr::if_else()`[^logicals-4].
|
||||||
You'll always use the first three argument of `if_else()`.
|
You'll always use the first three argument of `if_else()`. The first argument, `condition`, is a logical vector, the second, `true`, gives the output when the condition is true, and the third, `false`, gives the output if the condition is false.
|
||||||
The first argument, `condition`, is a logical vector, the second, `true`, gives the output when the condition is true, and the third, `false`, gives the output if the condition is false.
|
|
||||||
|
|
||||||
[^logicals-4]: dplyr's `if_else()` is very similar to base R's `ifelse()`.
|
[^logicals-4]: dplyr's `if_else()` is very similar to base R's `ifelse()`.
|
||||||
There are two main advantages of `if_else()`over `ifelse()`: you can choose what should happen to missing values, and `if_else()` is much more likely to give you a meaningful error if you variables have incompatible types.
|
There are two main advantages of `if_else()`over `ifelse()`: you can choose what should happen to missing values, and `if_else()` is much more likely to give you a meaningful error if you variables have incompatible types.
|
||||||
|
|
436
vectors.qmd
436
vectors.qmd
|
@ -8,10 +8,10 @@ source("_common.R")
|
||||||
|
|
||||||
## Introduction
|
## Introduction
|
||||||
|
|
||||||
So far this book has focussed on tibbles and packages that work with them.
|
So far we've talked about individual data types individual like numbers, strings, factors, tibbles and more.
|
||||||
But as you start to write your own functions, and dig deeper into R, you need to learn about vectors, the objects that underlie tibbles.
|
Now it's time to learn more about how they fit together into a holistic structure.
|
||||||
If you've learned R in a more traditional way, you're probably already familiar with vectors, as most R resources start with vectors and work their way up to tibbles.
|
|
||||||
We think it's better to start with tibbles because they're immediately useful, and then work your way down to the underlying components.
|
In this chapter we'll explore the **vector** data type, the type that underlies pretty much all objects that we use to store data in R.
|
||||||
|
|
||||||
### Prerequisites
|
### Prerequisites
|
||||||
|
|
||||||
|
@ -27,38 +27,55 @@ library(tidyverse)
|
||||||
|
|
||||||
## Vector basics
|
## Vector basics
|
||||||
|
|
||||||
There are two types of vectors:
|
There are two fundamental types of vectors:
|
||||||
|
|
||||||
1. **Atomic** vectors, of which there are six types: **logical**, **integer**, **double**, **character**, **complex**, and **raw**.
|
1. **Atomic** vectors, of which there are six types: **logical**, **integer**, **double**, **character**, **complex**, and **raw**.
|
||||||
Integer and double vectors are collectively known as **numeric** vectors.
|
Integer and double vectors are collectively known as **numeric** vectors.
|
||||||
|
|
||||||
2. **Lists**, which are sometimes called recursive vectors because lists can contain other lists.
|
2. **Lists**, which are sometimes called recursive vectors because lists can contain other lists.
|
||||||
|
|
||||||
The chief difference between atomic vectors and lists is that atomic vectors are **homogeneous**, while lists can be **heterogeneous**.
|
The chief difference between atomic vectors and lists is that atomic vectors are **homogeneous** (every element is the same type), while lists can be **heterogeneous** (every element can be a different type).
|
||||||
|
|
||||||
There's one other related object: `NULL`.
|
There's one other related object: `NULL`.
|
||||||
`NULL` is often used to represent the absence of a vector (as opposed to `NA` which is used to represent the absence of a value in a vector).
|
`NULL` is often used to represent the absence of a vector (as opposed to `NA` which is used to represent the absence of a value in a vector).
|
||||||
`NULL` typically behaves like a vector of length 0.
|
`NULL` typically behaves like a vector of length 0.
|
||||||
@fig-datatypes summarises the interrelationships.
|
|
||||||
|
@fig-datatypes summarizes the interrelationships.
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
#| label: fig-datatypes
|
#| label: fig-datatypes
|
||||||
#| echo: false
|
#| echo: false
|
||||||
#| out-width: "50%"
|
#| out-width: ~
|
||||||
#| fig-cap: >
|
#| fig-cap: >
|
||||||
#| The hierarchy of R's vector types.
|
#| The hierarchy of R's vector types.
|
||||||
|
#| fig-alt: >
|
||||||
|
#| A diagram that uses nested sets to show how R's vector types
|
||||||
|
#| are related. There are two types at the top level: vectors and
|
||||||
|
#| NULL. Inside vectors there are two types: atomic and list.
|
||||||
|
#| Inside atomic there are three types: logical, numeric, and
|
||||||
|
#| character. Inside numeric there are two types: integer, and
|
||||||
|
#| double.
|
||||||
|
|
||||||
knitr::include_graphics("diagrams/data-structures-overview.png")
|
knitr::include_graphics("diagrams/data-structures.png", dpi = 270)
|
||||||
```
|
```
|
||||||
|
|
||||||
Every vector has two key properties:
|
Every vector has two key properties:
|
||||||
|
|
||||||
1. Its **type**, which you can determine with `typeof()`.
|
1. Its **type**, which is one of logical, integer, double, character or list.
|
||||||
|
You can determine this with `typeof()`.
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
typeof(letters)
|
typeof(letters)
|
||||||
typeof(1:10)
|
typeof(1:10)
|
||||||
|
typeof(2.5)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Sometimes you want to do different things based on the type of vector.
|
||||||
|
One option is to use `typeof()`.
|
||||||
|
Another is to use a test function which returns a `TRUE` or `FALSE`.
|
||||||
|
Base R provides many functions like `is.vector()` and `is.atomic()`, but they often return surprising results.
|
||||||
|
Instead, it's safer to use the `is_*` functions provided by purrr, which correspond exactly to @fig-datatypes.
|
||||||
|
|
||||||
2. Its **length**, which you can determine with `length()`.
|
2. Its **length**, which you can determine with `length()`.
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
|
@ -67,51 +84,46 @@ Every vector has two key properties:
|
||||||
```
|
```
|
||||||
|
|
||||||
Vectors can also contain arbitrary additional metadata in the form of attributes.
|
Vectors can also contain arbitrary additional metadata in the form of attributes.
|
||||||
These attributes are used to create **augmented vectors** which build on additional behaviour.
|
These attributes are used to create **S3 vectors** which build on additional behavior.
|
||||||
There are three important types of augmented vector:
|
You've seen three S3 vectors in this book:
|
||||||
|
|
||||||
- Factors are built on top of integer vectors.
|
- Factors (`factor`) are built on top of integer vectors.
|
||||||
- Dates and date-times are built on top of numeric vectors.
|
- Dates (`date`) are built on top of double vectors.
|
||||||
- Data frames and tibbles are built on top of lists.
|
- Date-times (`POSIXct`) are built on top of double vectors.
|
||||||
|
|
||||||
This chapter will introduce you to these important vectors from simplest to most complicated.
|
You can use S3 to build on top of lists to make things that are fundamentally not vectors, like data frames or linear models.
|
||||||
You'll start with atomic vectors, then build up to lists, and finish off with augmented vectors.
|
|
||||||
|
|
||||||
## Important types of atomic vector
|
### Exercises
|
||||||
|
|
||||||
|
1. Carefully read the documentation of `is.vector()`. What does it actually test for? Why does `is.atomic()` not agree with the definition of atomic vectors above?
|
||||||
|
|
||||||
|
## Atomic vectors
|
||||||
|
|
||||||
The four most important types of atomic vector are logical, integer, double, and character.
|
The four most important types of atomic vector are logical, integer, double, and character.
|
||||||
Raw and complex are rarely used during a data analysis, so we won't discuss them here.
|
Raw and complex are rarely used during a data analysis, so we won't discuss them here.
|
||||||
|
The difference between integer and double is rarely important for data science, so we lump them together into numeric.
|
||||||
|
|
||||||
### Logical
|
### Logical
|
||||||
|
|
||||||
Logical vectors are the simplest type of atomic vector because they can take only three possible values: `FALSE`, `TRUE`, and `NA`.
|
Logical vectors are the simplest type of atomic vector because they can take only three possible values: `FALSE`, `TRUE`, and `NA`.
|
||||||
Logical vectors are usually constructed with comparison operators, as described in \[comparisons\].
|
Logical vectors are usually constructed with comparison operators, as described in @sec-logicals.
|
||||||
You can also create them by hand with `c()`:
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
1:10 %% 3 == 0
|
|
||||||
|
|
||||||
c(TRUE, TRUE, FALSE, NA)
|
|
||||||
```
|
|
||||||
|
|
||||||
### Numeric
|
### Numeric
|
||||||
|
|
||||||
Integer and double vectors are known collectively as numeric vectors.
|
Integer and double vectors are known collectively as numeric vectors and were the topic of @sec-numbers.
|
||||||
In R, numbers are doubles by default.
|
In R, numbers are doubles by default.
|
||||||
To make an integer, place an `L` after the number:
|
To make an integer, place an `L` after the number:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
typeof(1)
|
typeof(1)
|
||||||
typeof(1L)
|
typeof(1L)
|
||||||
1.5L
|
|
||||||
```
|
```
|
||||||
|
|
||||||
The distinction between integers and doubles is not usually important, but there are two important differences that you should be aware of:
|
The distinction between integers and doubles is not usually important in R, but there are two important differences that you should be aware of:
|
||||||
|
|
||||||
1. Doubles are approximations.
|
1. Doubles are approximations, as we discussed in @sec-fp-comparison.
|
||||||
Doubles represent floating point numbers that can not always be precisely represented with a fixed amount of memory.
|
Doubles represent floating point numbers that can not always be precisely represented with a fixed amount of memory.
|
||||||
This means that you should consider all doubles to be approximations.
|
For example, the square of the square root of two is not two:
|
||||||
For example, what is square of the square root of two?
|
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
x <- sqrt(2) ^ 2
|
x <- sqrt(2) ^ 2
|
||||||
|
@ -119,9 +131,6 @@ The distinction between integers and doubles is not usually important, but there
|
||||||
x - 2
|
x - 2
|
||||||
```
|
```
|
||||||
|
|
||||||
This behaviour is common when working with floating point numbers: most calculations include some approximation error.
|
|
||||||
Instead of comparing floating point numbers using `==`, you should use `dplyr::near()` which allows for some numerical tolerance.
|
|
||||||
|
|
||||||
2. Integers have one special value: `NA`, while doubles have four: `NA`, `NaN`, `Inf` and `-Inf`.
|
2. Integers have one special value: `NA`, while doubles have four: `NA`, `NaN`, `Inf` and `-Inf`.
|
||||||
All three special values `NaN`, `Inf` and `-Inf` can arise during division:
|
All three special values `NaN`, `Inf` and `-Inf` can arise during division:
|
||||||
|
|
||||||
|
@ -130,24 +139,16 @@ The distinction between integers and doubles is not usually important, but there
|
||||||
```
|
```
|
||||||
|
|
||||||
Avoid using `==` to check for these other special values.
|
Avoid using `==` to check for these other special values.
|
||||||
Instead use the helper functions `is.finite()`, `is.infinite()`, and `is.nan()`:
|
Instead use the helper functions `is.finite()`, `is.infinite()`, and `is.nan()`.
|
||||||
|
|
||||||
| | 0 | Inf | NA | NaN |
|
|
||||||
|-----------------|-----|-----|-----|-----|
|
|
||||||
| `is.finite()` | x | | | |
|
|
||||||
| `is.infinite()` | | x | | |
|
|
||||||
| `is.na()` | | | x | x |
|
|
||||||
| `is.nan()` | | | | x |
|
|
||||||
|
|
||||||
### Character
|
### Character
|
||||||
|
|
||||||
Character vectors are the most complex type of atomic vector, because each element of a character vector is a string, and a string can contain an arbitrary amount of data.
|
Character vectors are the most complex type of atomic vector, because each element of a character vector is a string, and a string can contain an arbitrary amount of data.
|
||||||
|
You already learned many practical tools for working with character vectors in @sec-strings.
|
||||||
You've already learned a lot about working with strings in \[strings\].
|
|
||||||
Here we wanted to mention one important feature of the underlying string implementation: R uses a global string pool.
|
Here we wanted to mention one important feature of the underlying string implementation: R uses a global string pool.
|
||||||
This means that each unique string is only stored in memory once, and every use of the string points to that representation.
|
This means that each unique string is only stored in memory once, and every use of the string points to that representation.
|
||||||
This reduces the amount of memory needed by duplicated strings.
|
This reduces the amount of memory needed by duplicated strings.
|
||||||
You can see this behaviour in practice with `lobstr::obj_size()`:
|
You can see this behavior in practice with `lobstr::obj_size()`:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
x <- "This is a reasonably long string."
|
x <- "This is a reasonably long string."
|
||||||
|
@ -171,41 +172,7 @@ NA_real_ # double
|
||||||
NA_character_ # character
|
NA_character_ # character
|
||||||
```
|
```
|
||||||
|
|
||||||
Normally you don't need to know about these different types because you can always use `NA` and it will be converted to the correct type using the implicit coercion rules described next.
|
This is usually unimportant because `NA` will almost always be automatically converted to the correct type.
|
||||||
However, there are some functions that are strict about their inputs, so it's useful to have this knowledge sitting in your back pocket so you can be specific when needed.
|
|
||||||
|
|
||||||
### Exercises
|
|
||||||
|
|
||||||
1. Describe the difference between `is.finite(x)` and `!is.infinite(x)`.
|
|
||||||
|
|
||||||
2. Read the source code for `dplyr::near()` (Hint: to see the source code, drop the `()`).
|
|
||||||
How does it work?
|
|
||||||
|
|
||||||
3. A logical vector can take 3 possible values.
|
|
||||||
How many possible values can an integer vector take?
|
|
||||||
How many possible values can a double take?
|
|
||||||
Use Google to do some research.
|
|
||||||
|
|
||||||
4. Brainstorm at least four functions that allow you to convert a double to an integer.
|
|
||||||
How do they differ?
|
|
||||||
Be precise.
|
|
||||||
|
|
||||||
5. What functions from the readr package allow you to turn a string into logical, integer, and double vector?
|
|
||||||
|
|
||||||
## Using atomic vectors
|
|
||||||
|
|
||||||
Now that you understand the different types of atomic vector, it's useful to review some of the important tools for working with them.
|
|
||||||
These include:
|
|
||||||
|
|
||||||
1. How to convert from one type to another, and when that happens automatically.
|
|
||||||
|
|
||||||
2. How to tell if an object is a specific type of vector.
|
|
||||||
|
|
||||||
3. What happens when you work with vectors of different lengths.
|
|
||||||
|
|
||||||
4. How to name the elements of a vector.
|
|
||||||
|
|
||||||
5. How to pull out elements of interest.
|
|
||||||
|
|
||||||
### Coercion
|
### Coercion
|
||||||
|
|
||||||
|
@ -231,20 +198,6 @@ sum(y) # how many are greater than 10?
|
||||||
mean(y) # what proportion are greater than 10?
|
mean(y) # what proportion are greater than 10?
|
||||||
```
|
```
|
||||||
|
|
||||||
You may see some code (typically older) that relies on implicit coercion in the opposite direction, from integer to logical:
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
#| eval: false
|
|
||||||
|
|
||||||
if (length(x)) {
|
|
||||||
# do something
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
In this case, 0 is converted to `FALSE` and everything else is converted to `TRUE`.
|
|
||||||
We think this makes it harder to understand your code, and we don't recommend it.
|
|
||||||
Instead be explicit: `length(x) > 0`.
|
|
||||||
|
|
||||||
It's also important to understand what happens when you try and create a vector containing multiple types with `c()`: the most complex type always wins.
|
It's also important to understand what happens when you try and create a vector containing multiple types with `c()`: the most complex type always wins.
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
|
@ -254,93 +207,122 @@ typeof(c(1.5, "a"))
|
||||||
```
|
```
|
||||||
|
|
||||||
An atomic vector can not have a mix of different types because the type is a property of the complete vector, not the individual elements.
|
An atomic vector can not have a mix of different types because the type is a property of the complete vector, not the individual elements.
|
||||||
If you need to mix multiple types in the same vector, you should use a list, which you'll learn about shortly.
|
If you need to mix multiple types in the same vector, you should use a list.
|
||||||
|
|
||||||
### Test functions
|
### Exercises
|
||||||
|
|
||||||
Sometimes you want to do different things based on the type of vector.
|
1. Describe the difference between `is.finite(x)` and `!is.infinite(x)`.
|
||||||
One option is to use `typeof()`.
|
|
||||||
Another is to use a test function which returns a `TRUE` or `FALSE`.
|
|
||||||
Base R provides many functions like `is.vector()` and `is.atomic()`, but they often return surprising results.
|
|
||||||
Instead, it's safer to use the `is_*` functions provided by purrr, which are summarised in the table below.
|
|
||||||
|
|
||||||
| | lgl | int | dbl | chr | list |
|
2. Read the source code for `dplyr::near()` (Hint: to see the source code, drop the `()`).
|
||||||
|------------------|-----|-----|-----|-----|------|
|
How does it work?
|
||||||
| `is_logical()` | x | | | | |
|
|
||||||
| `is_integer()` | | x | | | |
|
|
||||||
| `is_double()` | | | x | | |
|
|
||||||
| `is_numeric()` | | x | x | | |
|
|
||||||
| `is_character()` | | | | x | |
|
|
||||||
| `is_atomic()` | x | x | x | x | |
|
|
||||||
| `is_list()` | | | | | x |
|
|
||||||
| `is_vector()` | x | x | x | x | x |
|
|
||||||
|
|
||||||
### Scalars and recycling rules {#sec-scalars-and-recycling-rules}
|
3. A logical vector can take 3 possible values.
|
||||||
|
How many possible values can an integer vector take?
|
||||||
|
How many possible values can a double take?
|
||||||
|
Use Google to do some research.
|
||||||
|
|
||||||
As well as implicitly coercing the types of vectors to be compatible, R will also implicitly coerce the length of vectors.
|
4. Brainstorm at least four functions that allow you to convert a double to an integer.
|
||||||
This is called vector **recycling**, because the shorter vector is repeated, or recycled, to the same length as the longer vector.
|
How do they differ?
|
||||||
|
Be precise.
|
||||||
|
|
||||||
This is generally most useful when you are mixing vectors and "scalars".
|
5. What functions from the readr package allow you to turn a string into logical, integer, and double vector?
|
||||||
We put scalars in quotes because R doesn't actually have scalars: instead, a single number is a vector of length 1.
|
|
||||||
Because there are no scalars, most built-in functions are **vectorised**, meaning that they will operate on a vector of numbers.
|
6. Compare and contrast `setNames()` with `purrr::set_names()`.
|
||||||
That's why, for example, this code works:
|
|
||||||
|
## Lists {#sec-lists}
|
||||||
|
|
||||||
|
Lists are a step up in complexity from atomic vectors, because lists can contain other lists.
|
||||||
|
This makes them suitable for representing hierarchical or tree-like structures, as you saw in @sec-rectangling.
|
||||||
|
You create a list with `list()`:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
sample(10) + 100
|
x <- list(1, 2, 3)
|
||||||
runif(10) > 0.5
|
x
|
||||||
```
|
```
|
||||||
|
|
||||||
In R, basic mathematical operations work with vectors.
|
A very useful tool for working with lists is `str()` because it focuses on the **str**ucture, not the contents.
|
||||||
That means that you should never need to perform explicit iteration when performing simple mathematical computations.
|
|
||||||
|
|
||||||
It's intuitive what should happen if you add two vectors of the same length, or a vector and a "scalar", but what happens if you add two vectors of different lengths?
|
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
1:10 + 1:2
|
str(x)
|
||||||
|
|
||||||
|
x_named <- list(a = 1, b = 2, c = 3)
|
||||||
|
str(x_named)
|
||||||
```
|
```
|
||||||
|
|
||||||
Here, R will expand the shortest vector to the same length as the longest, so called recycling.
|
Unlike atomic vectors, `list()` can contain a mix of objects:
|
||||||
This is silent except when the length of the longer is not an integer multiple of the length of the shorter:
|
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
1:10 + 1:3
|
y <- list("a", 1L, 1.5, TRUE)
|
||||||
|
str(y)
|
||||||
```
|
```
|
||||||
|
|
||||||
While vector recycling can be used to create very succinct, clever code, it can also silently conceal problems.
|
Lists can even contain other lists!
|
||||||
For this reason, the vectorised functions in tidyverse will throw errors when you recycle anything other than a scalar.
|
|
||||||
If you do want to recycle, you'll need to do it yourself with `rep()`:
|
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
#| error: true
|
z <- list(list(1, 2), list(3, 4))
|
||||||
|
str(z)
|
||||||
tibble(x = 1:4, y = 1:2)
|
|
||||||
|
|
||||||
tibble(x = 1:4, y = rep(1:2, 2))
|
|
||||||
|
|
||||||
tibble(x = 1:4, y = rep(1:2, each = 2))
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### Naming vectors
|
To explain more complicated list manipulation functions, it's helpful to have a visual representation of lists.
|
||||||
|
For example, take these three lists:
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
x1 <- list(c(1, 2), c(3, 4))
|
||||||
|
x2 <- list(list(1, 2), list(3, 4))
|
||||||
|
x3 <- list(1, list(2, list(3)))
|
||||||
|
```
|
||||||
|
|
||||||
|
We'll draw them as follows:
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
#| echo: false
|
||||||
|
#| out-width: "75%"
|
||||||
|
|
||||||
|
knitr::include_graphics("diagrams/lists-structure.png")
|
||||||
|
```
|
||||||
|
|
||||||
|
There are three principles:
|
||||||
|
|
||||||
|
1. Lists have rounded corners.
|
||||||
|
Atomic vectors have square corners.
|
||||||
|
|
||||||
|
2. Children are drawn inside their parent, and have a slightly darker background to make it easier to see the hierarchy.
|
||||||
|
|
||||||
|
3. The orientation of the children (i.e. rows or columns) isn't important, so we'll pick a row or column orientation to either save space or illustrate an important property in the example.
|
||||||
|
|
||||||
|
### Names
|
||||||
|
|
||||||
All types of vectors can be named.
|
All types of vectors can be named.
|
||||||
You can name them during creation with `c()`:
|
But names they seem particularly useful for lists.
|
||||||
|
You can name them during creation with `list()`:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
c(x = 1, y = 2, z = 4)
|
list(x = 1, y = 2, z = 4)
|
||||||
```
|
```
|
||||||
|
|
||||||
Or after the fact with `purrr::set_names()`:
|
Or after the fact with `purrr::set_names()`:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
set_names(1:3, c("a", "b", "c"))
|
set_names(list(1, 2, 3), c("a", "b", "c"))
|
||||||
```
|
```
|
||||||
|
|
||||||
Named vectors are most useful for subsetting, described next.
|
Named vectors are most useful for subsetting, described next.
|
||||||
|
|
||||||
### Subsetting {#sec-vector-subsetting}
|
### Exercises
|
||||||
|
|
||||||
|
1. Draw the following lists as nested sets:
|
||||||
|
|
||||||
|
a. `list(a, b, list(c, d), list(e, f))`
|
||||||
|
b. `list(list(list(list(list(list(a))))))`
|
||||||
|
|
||||||
|
## Subsetting {#sec-vector-subsetting}
|
||||||
|
|
||||||
|
There are three subsetting tools in base R: `[`, `[[`, and `$`.
|
||||||
|
We'll see how they apply to atomic vectors and lists.
|
||||||
|
And then how they combine to provide an alternative to `filter()` and `select()` for working with data frames.
|
||||||
|
|
||||||
|
### Atomic vectors
|
||||||
|
|
||||||
So far we've used `dplyr::filter()` to filter the rows in a tibble.
|
|
||||||
`filter()` only works with tibble, so we'll need a new tool for vectors: `[`.
|
|
||||||
`[` is the subsetting function, and is called like `x[a]`.
|
`[` is the subsetting function, and is called like `x[a]`.
|
||||||
There are four types of things that you can subset a vector with:
|
There are four types of things that you can subset a vector with:
|
||||||
|
|
||||||
|
@ -415,93 +397,7 @@ There is an important variation of `[` called `[[`.
|
||||||
It's a good idea to use it whenever you want to make it clear that you're extracting a single item, as in a for loop.
|
It's a good idea to use it whenever you want to make it clear that you're extracting a single item, as in a for loop.
|
||||||
The distinction between `[` and `[[` is most important for lists, as we'll see shortly.
|
The distinction between `[` and `[[` is most important for lists, as we'll see shortly.
|
||||||
|
|
||||||
### Exercises
|
### Lists
|
||||||
|
|
||||||
1. What does `mean(is.na(x))` tell you about a vector `x`?
|
|
||||||
What about `sum(!is.finite(x))`?
|
|
||||||
|
|
||||||
2. Carefully read the documentation of `is.vector()`.
|
|
||||||
What does it actually test for?
|
|
||||||
Why does `is.atomic()` not agree with the definition of atomic vectors above?
|
|
||||||
|
|
||||||
3. Compare and contrast `setNames()` with `purrr::set_names()`.
|
|
||||||
|
|
||||||
4. Create functions that take a vector as input and return:
|
|
||||||
|
|
||||||
a. The last value. Should you use `[` or `[[`?
|
|
||||||
b. The elements at even numbered positions.
|
|
||||||
c. Every element except the last value.
|
|
||||||
d. Only even numbers (and no missing values).
|
|
||||||
|
|
||||||
5. Why is `x[-which(x > 0)]` not the same as `x[x <= 0]`?
|
|
||||||
|
|
||||||
6. What happens when you subset with a positive integer that's bigger than the length of the vector?
|
|
||||||
What happens when you subset with a name that doesn't exist?
|
|
||||||
|
|
||||||
## Recursive vectors (lists) {#sec-lists}
|
|
||||||
|
|
||||||
Lists are a step up in complexity from atomic vectors, because lists can contain other lists.
|
|
||||||
This makes them suitable for representing hierarchical or tree-like structures.
|
|
||||||
You create a list with `list()`:
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
x <- list(1, 2, 3)
|
|
||||||
x
|
|
||||||
```
|
|
||||||
|
|
||||||
A very useful tool for working with lists is `str()` because it focusses on the **str**ucture, not the contents.
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
str(x)
|
|
||||||
|
|
||||||
x_named <- list(a = 1, b = 2, c = 3)
|
|
||||||
str(x_named)
|
|
||||||
```
|
|
||||||
|
|
||||||
Unlike atomic vectors, `list()` can contain a mix of objects:
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
y <- list("a", 1L, 1.5, TRUE)
|
|
||||||
str(y)
|
|
||||||
```
|
|
||||||
|
|
||||||
Lists can even contain other lists!
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
z <- list(list(1, 2), list(3, 4))
|
|
||||||
str(z)
|
|
||||||
```
|
|
||||||
|
|
||||||
### Visualising lists
|
|
||||||
|
|
||||||
To explain more complicated list manipulation functions, it's helpful to have a visual representation of lists.
|
|
||||||
For example, take these three lists:
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
x1 <- list(c(1, 2), c(3, 4))
|
|
||||||
x2 <- list(list(1, 2), list(3, 4))
|
|
||||||
x3 <- list(1, list(2, list(3)))
|
|
||||||
```
|
|
||||||
|
|
||||||
We'll draw them as follows:
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
#| echo: false
|
|
||||||
#| out-width: "75%"
|
|
||||||
|
|
||||||
knitr::include_graphics("diagrams/lists-structure.png")
|
|
||||||
```
|
|
||||||
|
|
||||||
There are three principles:
|
|
||||||
|
|
||||||
1. Lists have rounded corners.
|
|
||||||
Atomic vectors have square corners.
|
|
||||||
|
|
||||||
2. Children are drawn inside their parent, and have a slightly darker background to make it easier to see the hierarchy.
|
|
||||||
|
|
||||||
3. The orientation of the children (i.e. rows or columns) isn't important, so we'll pick a row or column orientation to either save space or illustrate an important property in the example.
|
|
||||||
|
|
||||||
### Subsetting
|
|
||||||
|
|
||||||
There are three ways to subset a list, which we'll illustrate with a list named `a`:
|
There are three ways to subset a list, which we'll illustrate with a list named `a`:
|
||||||
|
|
||||||
|
@ -548,59 +444,70 @@ Compare the code and output above with the visual representation in @fig-lists-s
|
||||||
knitr::include_graphics("diagrams/lists-subsetting.png")
|
knitr::include_graphics("diagrams/lists-subsetting.png")
|
||||||
```
|
```
|
||||||
|
|
||||||
### Lists of condiments
|
|
||||||
|
|
||||||
The difference between `[` and `[[` is very important, but it's easy to get confused.
|
The difference between `[` and `[[` is very important, but it's easy to get confused.
|
||||||
To help you remember, let me show you an unusual pepper shaker.
|
To help you remember, let me show you an unusual pepper shaker in @fig-pepper-1.If this pepper shaker is your list `pepper`, then, `pepper[1]` is a pepper shaker containing a single pepper packet, as in @fig-pepper-2. `pepper[2]` would look the same, but would contain the second packet.
|
||||||
|
`pepper[1:2]` would be a pepper shaker containing two pepper packets.
|
||||||
|
`pepper[[1]]` would extract the pepper packet itself, as in @fig-pepper-3.
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
|
#| label: fig-pepper-1
|
||||||
#| echo: false
|
#| echo: false
|
||||||
#| out-width: "25%"
|
#| out-width: "25%"
|
||||||
|
#| fig-cap: A pepper shaker that Hadley once found in his hotel room.
|
||||||
|
#| fig-alt: >
|
||||||
|
#| A photo of a glass pepper shaker. Instead of the pepper shaker
|
||||||
|
#| containing pepper, it contains many packets of pepper.
|
||||||
|
|
||||||
knitr::include_graphics("images/pepper.jpg")
|
knitr::include_graphics("images/pepper.jpg")
|
||||||
```
|
```
|
||||||
|
|
||||||
If this pepper shaker is your list `x`, then, `x[1]` is a pepper shaker containing a single pepper packet:
|
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
|
#| label: fig-pepper-2
|
||||||
#| echo: false
|
#| echo: false
|
||||||
#| out-width: "25%"
|
#| out-width: "25%"
|
||||||
|
#| fig-cap: >
|
||||||
|
#| `pepper[1]`
|
||||||
|
#| fig-alt: >
|
||||||
|
#| A photo of the glass pepper shaker containing just one packet of
|
||||||
|
#| pepper.
|
||||||
|
|
||||||
knitr::include_graphics("images/pepper-1.jpg")
|
knitr::include_graphics("images/pepper-1.jpg")
|
||||||
```
|
```
|
||||||
|
|
||||||
`x[2]` would look the same, but would contain the second packet.
|
|
||||||
`x[1:2]` would be a pepper shaker containing two pepper packets.
|
|
||||||
|
|
||||||
`x[[1]]` is:
|
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
|
#| label: fig-pepper-3
|
||||||
#| echo: false
|
#| echo: false
|
||||||
#| out-width: "25%"
|
#| out-width: "25%"
|
||||||
|
#| fig-cap: >
|
||||||
|
#| `pepper[[1]]`
|
||||||
|
#| fig-alt: A single packet of pepper.
|
||||||
|
|
||||||
knitr::include_graphics("images/pepper-2.jpg")
|
knitr::include_graphics("images/pepper-2.jpg")
|
||||||
```
|
```
|
||||||
|
|
||||||
If you wanted to get the content of the pepper package, you'd need `x[[1]][[1]]`:
|
### Data frames
|
||||||
|
|
||||||
```{r}
|
1d subsetting behaves like a list.
|
||||||
#| echo: false
|
2d behaves like a combination of subsetting rows and columns.
|
||||||
#| out-width: "25%"
|
|
||||||
|
|
||||||
knitr::include_graphics("images/pepper-3.jpg")
|
|
||||||
```
|
|
||||||
|
|
||||||
### Exercises
|
### Exercises
|
||||||
|
|
||||||
1. Draw the following lists as nested sets:
|
4. Create functions that take a vector as input and return:
|
||||||
|
|
||||||
a. `list(a, b, list(c, d), list(e, f))`
|
a. The last value. Should you use `[` or `[[`?
|
||||||
b. `list(list(list(list(list(list(a))))))`
|
b. The elements at even numbered positions.
|
||||||
|
c. Every element except the last value.
|
||||||
|
d. Only even numbers (and no missing values).
|
||||||
|
|
||||||
2. What happens if you subset a tibble as if you're subsetting a list?
|
5. Why is `x[-which(x > 0)]` not the same as `x[x <= 0]`?
|
||||||
|
|
||||||
|
6. What happens when you subset with a positive integer that's bigger than the length of the vector?
|
||||||
|
What happens when you subset with a name that doesn't exist?
|
||||||
|
|
||||||
|
7. What happens if you subset a tibble as if you're subsetting a list?
|
||||||
What are the key differences between a list and a tibble?
|
What are the key differences between a list and a tibble?
|
||||||
|
|
||||||
## Attributes
|
## Attributes and S3 vectors
|
||||||
|
|
||||||
Any vector can contain arbitrary additional metadata through its **attributes**.
|
Any vector can contain arbitrary additional metadata through its **attributes**.
|
||||||
You can think of attributes as named list of vectors that can be attached to any object.
|
You can think of attributes as named list of vectors that can be attached to any object.
|
||||||
|
@ -621,6 +528,9 @@ There are three very important attributes that are used to implement fundamental
|
||||||
3. **Class** is used to implement the S3 object oriented system.
|
3. **Class** is used to implement the S3 object oriented system.
|
||||||
|
|
||||||
You've seen names above, and we won't cover dimensions because we don't use matrices in this book.
|
You've seen names above, and we won't cover dimensions because we don't use matrices in this book.
|
||||||
|
|
||||||
|
### Class
|
||||||
|
|
||||||
It remains to describe the class, which controls how **generic functions** work.
|
It remains to describe the class, which controls how **generic functions** work.
|
||||||
Generic functions are key to object oriented programming in R, because they make functions behave differently for different classes of input.
|
Generic functions are key to object oriented programming in R, because they make functions behave differently for different classes of input.
|
||||||
A detailed discussion of object oriented programming is beyond the scope of this book, but you can read more about it in *Advanced R* at <http://adv-r.had.co.nz/OO-essentials.html#s3>.
|
A detailed discussion of object oriented programming is beyond the scope of this book, but you can read more about it in *Advanced R* at <http://adv-r.had.co.nz/OO-essentials.html#s3>.
|
||||||
|
@ -651,20 +561,6 @@ getS3method("as.Date", "numeric")
|
||||||
The most important S3 generic is `print()`: it controls how the object is printed when you type its name at the console.
|
The most important S3 generic is `print()`: it controls how the object is printed when you type its name at the console.
|
||||||
Other important generics are the subsetting functions `[`, `[[`, and `$`.
|
Other important generics are the subsetting functions `[`, `[[`, and `$`.
|
||||||
|
|
||||||
## Augmented vectors
|
|
||||||
|
|
||||||
Atomic vectors and lists are the building blocks for other important vector types like factors and dates.
|
|
||||||
We call these **augmented vectors**, because they are vectors with additional **attributes**, including class.
|
|
||||||
Because augmented vectors have a class, they behave differently to the atomic vector on which they are built.
|
|
||||||
In this book, we make use of four important augmented vectors:
|
|
||||||
|
|
||||||
- Factors
|
|
||||||
- Dates
|
|
||||||
- Date-times
|
|
||||||
- Tibbles
|
|
||||||
|
|
||||||
These are described below.
|
|
||||||
|
|
||||||
### Factors
|
### Factors
|
||||||
|
|
||||||
Factors are designed to represent categorical data that can take a fixed set of possible values.
|
Factors are designed to represent categorical data that can take a fixed set of possible values.
|
||||||
|
@ -724,6 +620,8 @@ They do crop up in base R, because they are needed to extract specific component
|
||||||
Since lubridate provides helpers for you to do this instead, you don't need them.
|
Since lubridate provides helpers for you to do this instead, you don't need them.
|
||||||
POSIXct's are always easier to work with, so if you find you have a POSIXlt, you should always convert it to a regular date time with `lubridate::as_datetime()`.
|
POSIXct's are always easier to work with, so if you find you have a POSIXlt, you should always convert it to a regular date time with `lubridate::as_datetime()`.
|
||||||
|
|
||||||
|
## Other types
|
||||||
|
|
||||||
### Tibbles
|
### Tibbles
|
||||||
|
|
||||||
Tibbles are augmented lists: they have class "tbl_df" + "tbl" + "data.frame", and `names` (column) and `row.names` attributes:
|
Tibbles are augmented lists: they have class "tbl_df" + "tbl" + "data.frame", and `names` (column) and `row.names` attributes:
|
||||||
|
|
Loading…
Reference in New Issue