r4ds/data-structures.Rmd

# Data structures

```{r, include = FALSE}
library(purrr)
```

As you start to write more functions, and as you want your functions to work with more types of inputs, it's useful to have some grounding in the underlying data structures that R is built on.  This chapter will dive deeper into the objects that you've already used, helping you better understand how things work.

The most important class of objects in R is the __vector__. Every vector has two key properties:

1. Its type, whether it's logical, numeric, character, and so on. You 
   can determine the type of any R object with `typeof()`.

2. Its length, which you can retrieve with `length()`.

Vectors are broken down into __atomic__ vectors, and __lists__. I call factors, dates, and date times __molecular vectors__ because they're built on top of atomic vectors. Data frames are similar, they're built on top of lists.

Note that R does not have "scalars". In R, a single number is a vector of length 1. The impacts of this are mostly on how functions work. Because there are no scalars, most built-in functions are vectorised, meaning that they will operate on a vector of numbers. That's why, for example, you can write `1:10 + 10:1`.

## Atomic vectors

There are four important types of atomic vector:

* logical
* integer
* double
* character

Collectively, integer and double vectors are known as numeric vectors. Most of the time the distinction between integers and doubles is not important in R, so we'll discuss them together.

(There are also two rarer atomic vectors: raw and complex. They're beyond the scope of this book because they are rarely needed to do data analysis)

### Logical

Logical vectors are simplest type of atomic vector because they can take only three possible values: `FALSE`, `TRUE`, and `NA`. Logical vectors are usually constructed with comparison operators, as described in [comparisons].

In numeric contexts, `TRUE` is converted to `1`, `FALSE` converted to 0. That means the sum of a logical vector is the number of trues, and the mean of a logical vector is the proportion of trues.

```{r}
x <- sample(20, 100, replace = TRUE)
y <- x > 10
sum(y)
mean(y)
```

### Numeric

Numeric vectors encompasses both integers and doubles (real numbers). For large data, there is some small advantage to using the integer data type if you really have integers, but in most cases the differences are immaterial. In R, numbers are doubles by default. To make an integer, use a `L` after the number:

```{r}
typeof(1)
typeof(1L)
```

There are two cases where you need to be aware of the differences between doubles and integers. Firstly, never test for equality on a double. There can be very small differences that don't print out by default. These differences arise because a double is represented using a fixed number of (binary) digits. For example, what should you get if you square the square-root of two?

```{r}
x <- sqrt(2) ^ 2
x
```

It certainly looks like we get what we expect: `2`. But things are not exactly as they seem:

```{r}
x == 2
x - 2
```

The number we've computed is actually slightly different to 2.  To avoid this sort of comparison difficulty, you can use the `near()` function from dplyr (available in 0.5).

```{r, eval = packageVersion("dplyr") >= "0.4.3.9000"}
dplyr::near(x, 2)
```

The other important thing to know about doubles is that they have three special values in addition to `NA`:

```{r}
c(-1, 0, 1) / 0
```

Like with missing values, you should avoid using `==` to check for these other special values. Instead use `is.finite()`, `is.infinite()`, and `is.nan()`: 

|                  |  0  | Inf | NA  | NaN |
|------------------|-----|-----|-----|-----|
| `is.finite()`    |  x  |     |     |     |
| `is.infinite()`  |     |  x  |     |     |
| `is.na()`        |     |     |  x  |  x  |
| `is.nan()`       |     |     |     |  x  |

### Character

Each element of a character vector is a string. 

```{r}
x <- c("abc", "def", "ghijklmnopqrs")
typeof(x)
```

You learned how to manipulate these vectors in [strings].

## Subsetting


## Augmented vectors

There are three important types of vector that are built on top of atomic vectors: factors, dates, and date times. I call these augmented vectors, because they are atomic vectors with additional __attributes__. Attributes are a way of adding arbitrary additional metadata to a vector. Each attribute is a named vector.  You can get and set individual attribute values with `attr()` or see them all at once with `attributes()`.

```{r}
x <- 1:10
attr(x, "greeting")
attr(x, "greeting") <- "Hi!"
attr(x, "farewell") <- "Bye!"
attributes(x)
```

There are three very important attributes that are used to implement fundamental parts of R:

* "names" are used to name the elements of a vector.
* "dims" make a vector behave like a matrix or array.
* "class" is used to implemenet the S3 object oriented system.

Class is particularly important because it changes what __generic functions__ do with the object.  Generic functions are key to OO in R. Here's what a typical generic function looks like:

```{r}
as.Date
```

The call to "UseMethod" means that this is a generic function, and it will call a specific __method__, based on the class of the first argument.  You can list all the methods for a generic with `methods()`:

```{r}
methods("as.Date")
```

And you can see the specific implementation of a method with `getS3method()`:

```{r}
getS3method("as.Date", "default")
getS3method("as.Date", "numeric")
```

The most important S3 generic is print: it controls how the object is printed when you type its name on the console.

A detailed discussion of S3 is beyond the scope of this book, but you can read more about it at <http://adv-r.had.co.nz/OO-essentials.html#s3>.

### Factors

Factors are designed to represent categorical data that can take a fixed set of possible values. Factors are built on top of integers, and have a levels attribute:

```{r}
x <- factor(c("ab", "cd", "ab"), levels = c("ab", "cd", "ef"))
typeof(x)
attributes(x)
```

Historically, factors were much easier to work with than characters so many functions in base R automatically convert characters to factors (controlled by the dread `stringsAsFactors` argument). To get more historical context, you might want to read [stringsAsFactors: An unauthorized biography](http://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/) by Roger Peng or [stringsAsFactors = \<sigh\>](http://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh) by Thomas Lumley.  The motivation for factors is the modelling context. If you're going to fit a model to categorical data, you need to know in advance all the possible values. There's no way to make a prediction for "green" if all you've ever seen is "red", "blue", and "yellow"

The packages in this book keep characters as is, but you will need to deal with them if you are working with base R or many other packages. When you encounter a factor, you should first check to see if you can avoid creating it in the first. Often there will be `stringsAsFactors` argument that you can set to `FALSE`. Otherwise, you can apply `as.character()` to the column to explicitly turn back into a factor.

```{r}
x <- factor(letters[1:5])
is.factor(x)
as.factor(letters[1:5])
```

### Dates

Dates in R are numeric vectors (sometimes integers, sometimes doubles) that represent the number of days since 1 January 1970.

```{r}
x <- as.Date("1971-01-01")
unclass(x)

typeof(x)
attributes(x)
```

### Date times

Date times are numeric vectors (sometimes integers, sometimes doubles) that represent the number of seconds since 1 January 1970:

```{r}
x <- lubridate::ymd_hm("1970-01-01 01:00")
unclass(x)

typeof(x)
attributes(x)
```

The `tzone` is optional, and only controls the display not the meaning.

There is another type of datetimes called POSIXlt. These are built on top of lists.

```{r}
y <- as.POSIXlt(x)
typeof(y)
attributes(y)
```

As far as I know there is no case in which you need POSIXlt. If you find you have a POSIXlt, convert it to a POSIXct with `as.POSIXct()`.

## Recursive vectors (lists)

Lists are the data structure R uses for hierarchical objects. Lists extend atomic vectors to model objects that are like trees. You can create a hierarchical structure with a list because unlike vectors, a list can contain other lists.

You create a list with `list()`:

```{r}
x <- list(1, 2, 3)
str(x)

x_named <- list(a = 1, b = 2, c = 3)
str(x_named)
```

Unlike atomic vectors, `lists()` can contain a mix of objects:

```{r}
y <- list("a", 1L, 1.5, TRUE)
str(y)
```

Lists can even contain other lists!

```{r}
z <- list(list(1, 2), list(3, 4))
str(z)
```

`str()` is very helpful when looking at lists because it focusses on the structure, not the contents.

### Visualising lists

To explain more complicated list manipulation functions, it's helpful to have a visual representation of lists. For example, take these three lists:

```{r}
x1 <- list(c(1, 2), c(3, 4))
x2 <- list(list(1, 2), list(3, 4))
x3 <- list(1, list(2, list(3)))
```

I draw them as follows:

```{r, echo = FALSE, out.width = "75%"}
knitr::include_graphics("diagrams/lists-structure.png")
```

* Lists are rounded rectangles that contain their children.
  
* I draw each child a little darker than its parent to make it easier to see 
  the hierarchy.
  
* The orientation of the children (i.e. rows or columns) isn't important, 
  so I'll pick a row or column orientation to either save space or illustrate 
  an important property in the example.

### Subsetting

There are three ways to subset a list, which I'll illustrate with `a`:

```{r}
a <- list(a = 1:3, b = "a string", c = pi, d = list(-1, -5))
```

*   `[` extracts a sub-list. The result will always be a list.

    ```{r}
    str(a[1:2])
    str(a[4])
    ```
    
    Like subsetting vectors, you can use an integer vector to select by 
    position, or a character vector to select by name.
    
*   `[[` extracts a single component from a list. It removes a level of 
    hierarchy from the list.

    ```{r}
    str(y[[1]])
    str(y[[4]])
    ```

*   `$` is a shorthand for extracting named elements of a list. It works
    similarly to `[[` except that you don't need to use quotes.
    
    ```{r}
    a$a
    a[["b"]]
    ```

Or visually:

```{r, echo = FALSE, out.width = "75%"}
knitr::include_graphics("diagrams/lists-subsetting.png")
```

### Lists of condiments

It's easy to get confused between `[` and `[[`, but it's important to understand the difference. A few months ago I stayed at a hotel with a pretty interesting pepper shaker that I hope will help you remember these differences:

```{r, echo = FALSE, out.width = "25%"} 
knitr::include_graphics("images/pepper.jpg")
```

If this pepper shaker is your list `x`, then, `x[1]` is a pepper shaker containing a single pepper packet:

```{r, echo = FALSE, out.width = "25%"} 
knitr::include_graphics("images/pepper-1.jpg")
```

`x[2]` would look the same, but would contain the second packet. `x[1:2]` would be a pepper shaker containing two pepper packets. 

`x[[1]]` is:

```{r, echo = FALSE, out.width = "25%"} 
knitr::include_graphics("images/pepper-2.jpg")
```

If you wanted to get the content of the pepper package, you'd need `x[[1]][[1]]`:

```{r, echo = FALSE, out.width = "25%"} 
knitr::include_graphics("images/pepper-3.jpg")
```

### Exercises

1.  Draw the following lists as nested sets.

1.  Generate the lists corresponding to these nested set diagrams.

1.  What happens if you subset a data frame as if you're subsetting a list?
    What are the key differences between a list and a data frame?

## Data frames

Data frames are augmented lists. 

```{r}
df <- data.frame(x = 1:5, y = 5:1)
typeof(df)
attributes(df)
```

Generally, I prefer using `dplyr::data_frame()` instead of `data.frame`. It creates an object that is verty similar:

```{r}
df <- dplyr::data_frame(x = 1:5, y = 5:1)
typeof(df)
attributes(df)
```

* Doesn't convert variable types or variable names. It never uses character
  row names.

* It adds additional classes `tbl_df` to give better printing and subsetting
  behaviour.

## Predicates

|                  | lgl | int | dbl | chr | list | null |
|------------------|-----|-----|-----|-----|------|------|
| `is_logical()`   |  x  |     |     |     |      |      |
| `is_integer()`   |     |  x  |     |     |      |      |
| `is_double()`    |     |     |  x  |     |      |      |
| `is_numeric()`   |     |  x  |  x  |     |      |      |
| `is_character()` |     |     |     |  x  |      |      |
| `is_atomic()`    |  x  |  x  |  x  |  x  |      |      |
| `is_list()`      |     |     |     |     |  x   |      |
| `is_vector()`    |  x  |  x  |  x  |  x  |  x   |      |
| `is_null()`      |     |     |     |     |      | x    |

Compared to the base R functions, they only inspect the type of the object, not its attributes. This means they tend to be less surprising: 

```{r}
is.atomic(NULL)
is_atomic(NULL)

is.vector(factor("a"))
is_vector(factor("a"))
```

I recommend using these instead of the base functions.

Each predicate also comes with "scalar" and "bare" versions. The scalar version checks that the length is 1 and the bare version checks that the object is a bare vector with no S3 class.

```{r}
y <- factor(c("a", "b", "c"))
is_integer(y)
is_scalar_integer(y)
is_bare_integer(y)
```

### Exercises

1.  Carefully read the documentation of `is.vector()`. What does it actually
    test for?
Make sure first element is heading 2015-12-12 02:34:20 +08:00			`# Data structures`
Sketch out complete book outline 2015-12-06 22:02:29 +08:00
Shuffle programming structure. Also improve intro 2016-03-10 22:46:06 +08:00			```{r, include = FALSE}
			`library(purrr)`
			```

Starting brain dump of data structures 2016-03-11 23:10:20 +08:00			`As you start to write more functions, and as you want your functions to work with more types of inputs, it's useful to have some grounding in the underlying data structures that R is built on. This chapter will dive deeper into the objects that you've already used, helping you better understand how things work.`
Sketch out complete book outline 2015-12-06 22:02:29 +08:00
Starting brain dump of data structures 2016-03-11 23:10:20 +08:00			`The most important class of objects in R is the __vector__. Every vector has two key properties:`
Sketch out complete book outline 2015-12-06 22:02:29 +08:00
Starting brain dump of data structures 2016-03-11 23:10:20 +08:00			`1. Its type, whether it's logical, numeric, character, and so on. You`
			can determine the type of any R object with `typeof()`.

			2. Its length, which you can retrieve with `length()`.

			`Vectors are broken down into __atomic__ vectors, and __lists__. I call factors, dates, and date times __molecular vectors__ because they're built on top of atomic vectors. Data frames are similar, they're built on top of lists.`

			Note that R does not have "scalars". In R, a single number is a vector of length 1. The impacts of this are mostly on how functions work. Because there are no scalars, most built-in functions are vectorised, meaning that they will operate on a vector of numbers. That's why, for example, you can write `1:10 + 10:1`.

			`## Atomic vectors`

			`There are four important types of atomic vector:`
Sketch out complete book outline 2015-12-06 22:02:29 +08:00
Starting to work on expressing yourself 2016-01-21 23:51:02 +08:00			`* logical`
Starting brain dump of data structures 2016-03-11 23:10:20 +08:00			`* integer`
			`* double`
Starting to work on expressing yourself 2016-01-21 23:51:02 +08:00			`* character`

Starting brain dump of data structures 2016-03-11 23:10:20 +08:00			`Collectively, integer and double vectors are known as numeric vectors. Most of the time the distinction between integers and doubles is not important in R, so we'll discuss them together.`
Starting to work on expressing yourself 2016-01-21 23:51:02 +08:00
Starting brain dump of data structures 2016-03-11 23:10:20 +08:00			`(There are also two rarer atomic vectors: raw and complex. They're beyond the scope of this book because they are rarely needed to do data analysis)`
Shuffle programming structure. Also improve intro 2016-03-10 22:46:06 +08:00
Starting brain dump of data structures 2016-03-11 23:10:20 +08:00			`### Logical`
Starting to work on expressing yourself 2016-01-21 23:51:02 +08:00
Starting brain dump of data structures 2016-03-11 23:10:20 +08:00			Logical vectors are simplest type of atomic vector because they can take only three possible values: `FALSE`, `TRUE`, and `NA`. Logical vectors are usually constructed with comparison operators, as described in [comparisons].
Starting to work on expressing yourself 2016-01-21 23:51:02 +08:00
Starting brain dump of data structures 2016-03-11 23:10:20 +08:00			In numeric contexts, `TRUE` is converted to `1`, `FALSE` converted to 0. That means the sum of a logical vector is the number of trues, and the mean of a logical vector is the proportion of trues.
Shuffle programming structure. Also improve intro 2016-03-10 22:46:06 +08:00
Starting brain dump of data structures 2016-03-11 23:10:20 +08:00			```{r}
			`x <- sample(20, 100, replace = TRUE)`
			`y <- x > 10`
			`sum(y)`
			`mean(y)`
			```
Shuffle programming structure. Also improve intro 2016-03-10 22:46:06 +08:00
Starting brain dump of data structures 2016-03-11 23:10:20 +08:00			`### Numeric`
Shuffle programming structure. Also improve intro 2016-03-10 22:46:06 +08:00
Starting brain dump of data structures 2016-03-11 23:10:20 +08:00			Numeric vectors encompasses both integers and doubles (real numbers). For large data, there is some small advantage to using the integer data type if you really have integers, but in most cases the differences are immaterial. In R, numbers are doubles by default. To make an integer, use a `L` after the number:
Shuffle programming structure. Also improve intro 2016-03-10 22:46:06 +08:00
			```{r}
Starting brain dump of data structures 2016-03-11 23:10:20 +08:00			`typeof(1)`
			`typeof(1L)`
			```
Shuffle programming structure. Also improve intro 2016-03-10 22:46:06 +08:00
Starting brain dump of data structures 2016-03-11 23:10:20 +08:00			`There are two cases where you need to be aware of the differences between doubles and integers. Firstly, never test for equality on a double. There can be very small differences that don't print out by default. These differences arise because a double is represented using a fixed number of (binary) digits. For example, what should you get if you square the square-root of two?`

			```{r}
			`x <- sqrt(2) ^ 2`
			`x`
Shuffle programming structure. Also improve intro 2016-03-10 22:46:06 +08:00			```

Starting brain dump of data structures 2016-03-11 23:10:20 +08:00			It certainly looks like we get what we expect: `2`. But things are not exactly as they seem:
Shuffle programming structure. Also improve intro 2016-03-10 22:46:06 +08:00
Starting brain dump of data structures 2016-03-11 23:10:20 +08:00			```{r}
			`x == 2`
			`x - 2`
			```

			The number we've computed is actually slightly different to 2. To avoid this sort of comparison difficulty, you can use the `near()` function from dplyr (available in 0.5).

			```{r, eval = packageVersion("dplyr") >= "0.4.3.9000"}
			`dplyr::near(x, 2)`
			```

			The other important thing to know about doubles is that they have three special values in addition to `NA`:
Shuffle programming structure. Also improve intro 2016-03-10 22:46:06 +08:00
			```{r}
Starting brain dump of data structures 2016-03-11 23:10:20 +08:00			`c(-1, 0, 1) / 0`
Shuffle programming structure. Also improve intro 2016-03-10 22:46:06 +08:00			```

Starting brain dump of data structures 2016-03-11 23:10:20 +08:00			Like with missing values, you should avoid using `==` to check for these other special values. Instead use `is.finite()`, `is.infinite()`, and `is.nan()`:
Shuffle programming structure. Also improve intro 2016-03-10 22:46:06 +08:00
Starting brain dump of data structures 2016-03-11 23:10:20 +08:00			`\| \| 0 \| Inf \| NA \| NaN \|`
			`\|------------------\|-----\|-----\|-----\|-----\|`
			\| `is.finite()` \| x \| \| \| \|
			\| `is.infinite()` \| \| x \| \| \|
			\| `is.na()` \| \| \| x \| x \|
			\| `is.nan()` \| \| \| \| x \|
Shuffle programming structure. Also improve intro 2016-03-10 22:46:06 +08:00
Starting brain dump of data structures 2016-03-11 23:10:20 +08:00			`### Character`
Starting to work on expressing yourself 2016-01-21 23:51:02 +08:00
Starting brain dump of data structures 2016-03-11 23:10:20 +08:00			`Each element of a character vector is a string.`
Starting to work on expressing yourself 2016-01-21 23:51:02 +08:00
			```{r}
Starting brain dump of data structures 2016-03-11 23:10:20 +08:00			`x <- c("abc", "def", "ghijklmnopqrs")`
			`typeof(x)`
			```
Starting to work on expressing yourself 2016-01-21 23:51:02 +08:00
Starting brain dump of data structures 2016-03-11 23:10:20 +08:00			`You learned how to manipulate these vectors in [strings].`
Starting to work on expressing yourself 2016-01-21 23:51:02 +08:00
More on data structures 2016-03-14 22:11:24 +08:00			`## Subsetting`

Starting to work on expressing yourself 2016-01-21 23:51:02 +08:00
Shuffle programming structure. Also improve intro 2016-03-10 22:46:06 +08:00
More on data structures 2016-03-14 22:11:24 +08:00			`## Augmented vectors`

			There are three important types of vector that are built on top of atomic vectors: factors, dates, and date times. I call these augmented vectors, because they are atomic vectors with additional __attributes__. Attributes are a way of adding arbitrary additional metadata to a vector. Each attribute is a named vector. You can get and set individual attribute values with `attr()` or see them all at once with `attributes()`.
Starting brain dump of data structures 2016-03-11 23:10:20 +08:00
			```{r}
			`x <- 1:10`
			`attr(x, "greeting")`
			`attr(x, "greeting") <- "Hi!"`
			`attr(x, "farewell") <- "Bye!"`
			`attributes(x)`
			```
Starting to work on expressing yourself 2016-01-21 23:51:02 +08:00
More on data structures 2016-03-14 22:11:24 +08:00			`There are three very important attributes that are used to implement fundamental parts of R:`

			`* "names" are used to name the elements of a vector.`
			`* "dims" make a vector behave like a matrix or array.`
			`* "class" is used to implemenet the S3 object oriented system.`

			`Class is particularly important because it changes what __generic functions__ do with the object. Generic functions are key to OO in R. Here's what a typical generic function looks like:`

			```{r}
			`as.Date`
			```

			The call to "UseMethod" means that this is a generic function, and it will call a specific __method__, based on the class of the first argument. You can list all the methods for a generic with `methods()`:

			```{r}
			`methods("as.Date")`
			```

			And you can see the specific implementation of a method with `getS3method()`:

			```{r}
			`getS3method("as.Date", "default")`
			`getS3method("as.Date", "numeric")`
			```

			`The most important S3 generic is print: it controls how the object is printed when you type its name on the console.`

			`A detailed discussion of S3 is beyond the scope of this book, but you can read more about it at <http://adv-r.had.co.nz/OO-essentials.html#s3>.`
Starting to work on expressing yourself 2016-01-21 23:51:02 +08:00
			`### Factors`
Sketch out complete book outline 2015-12-06 22:02:29 +08:00
Starting brain dump of data structures 2016-03-11 23:10:20 +08:00			`Factors are designed to represent categorical data that can take a fixed set of possible values. Factors are built on top of integers, and have a levels attribute:`

			```{r}
			`x <- factor(c("ab", "cd", "ab"), levels = c("ab", "cd", "ef"))`
			`typeof(x)`
			`attributes(x)`
			```

			Historically, factors were much easier to work with than characters so many functions in base R automatically convert characters to factors (controlled by the dread `stringsAsFactors` argument). To get more historical context, you might want to read [stringsAsFactors: An unauthorized biography](http://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/) by Roger Peng or [stringsAsFactors = \<sigh\>](http://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh) by Thomas Lumley. The motivation for factors is the modelling context. If you're going to fit a model to categorical data, you need to know in advance all the possible values. There's no way to make a prediction for "green" if all you've ever seen is "red", "blue", and "yellow"

More on data structures 2016-03-14 22:11:24 +08:00			The packages in this book keep characters as is, but you will need to deal with them if you are working with base R or many other packages. When you encounter a factor, you should first check to see if you can avoid creating it in the first. Often there will be `stringsAsFactors` argument that you can set to `FALSE`. Otherwise, you can apply `as.character()` to the column to explicitly turn back into a factor.

			```{r}
			`x <- factor(letters[1:5])`
			`is.factor(x)`
			`as.factor(letters[1:5])`
			```
Sketch out complete book outline 2015-12-06 22:02:29 +08:00
Starting to work on expressing yourself 2016-01-21 23:51:02 +08:00			`### Dates`

Starting brain dump of data structures 2016-03-11 23:10:20 +08:00			`Dates in R are numeric vectors (sometimes integers, sometimes doubles) that represent the number of days since 1 January 1970.`

			```{r}
			`x <- as.Date("1971-01-01")`
			`unclass(x)`

			`typeof(x)`
			`attributes(x)`
			```

Starting to work on expressing yourself 2016-01-21 23:51:02 +08:00			`### Date times`

Starting brain dump of data structures 2016-03-11 23:10:20 +08:00			`Date times are numeric vectors (sometimes integers, sometimes doubles) that represent the number of seconds since 1 January 1970:`

			```{r}
			`x <- lubridate::ymd_hm("1970-01-01 01:00")`
			`unclass(x)`

			`typeof(x)`
			`attributes(x)`
			```

			The `tzone` is optional, and only controls the display not the meaning.

			`There is another type of datetimes called POSIXlt. These are built on top of lists.`

			```{r}
			`y <- as.POSIXlt(x)`
			`typeof(y)`
			`attributes(y)`
			```

			As far as I know there is no case in which you need POSIXlt. If you find you have a POSIXlt, convert it to a POSIXct with `as.POSIXct()`.

Shuffle programming structure. Also improve intro 2016-03-10 22:46:06 +08:00			`## Recursive vectors (lists)`

More on data structures 2016-03-14 22:11:24 +08:00			`Lists are the data structure R uses for hierarchical objects. Lists extend atomic vectors to model objects that are like trees. You can create a hierarchical structure with a list because unlike vectors, a list can contain other lists.`
Shuffle programming structure. Also improve intro 2016-03-10 22:46:06 +08:00
			You create a list with `list()`:

			```{r}
			`x <- list(1, 2, 3)`
			`str(x)`

			`x_named <- list(a = 1, b = 2, c = 3)`
			`str(x_named)`
			```

			Unlike atomic vectors, `lists()` can contain a mix of objects:

			```{r}
			`y <- list("a", 1L, 1.5, TRUE)`
			`str(y)`
			```

			`Lists can even contain other lists!`

			```{r}
			`z <- list(list(1, 2), list(3, 4))`
			`str(z)`
			```

			`str()` is very helpful when looking at lists because it focusses on the structure, not the contents.

			`### Visualising lists`

			`To explain more complicated list manipulation functions, it's helpful to have a visual representation of lists. For example, take these three lists:`

			```{r}
			`x1 <- list(c(1, 2), c(3, 4))`
			`x2 <- list(list(1, 2), list(3, 4))`
			`x3 <- list(1, list(2, list(3)))`
			```

			`I draw them as follows:`

			```{r, echo = FALSE, out.width = "75%"}
			`knitr::include_graphics("diagrams/lists-structure.png")`
			```

			`* Lists are rounded rectangles that contain their children.`

			`* I draw each child a little darker than its parent to make it easier to see`
			`the hierarchy.`

			`* The orientation of the children (i.e. rows or columns) isn't important,`
			`so I'll pick a row or column orientation to either save space or illustrate`
			`an important property in the example.`

			`### Subsetting`

			There are three ways to subset a list, which I'll illustrate with `a`:

			```{r}
			`a <- list(a = 1:3, b = "a string", c = pi, d = list(-1, -5))`
			```

			* `[` extracts a sub-list. The result will always be a list.

			```{r}
			`str(a[1:2])`
			`str(a[4])`
			```

			`Like subsetting vectors, you can use an integer vector to select by`
			`position, or a character vector to select by name.`

			* `[[` extracts a single component from a list. It removes a level of
			`hierarchy from the list.`

			```{r}
			`str(y[[1]])`
			`str(y[[4]])`
			```

			* `$` is a shorthand for extracting named elements of a list. It works
			similarly to `[[` except that you don't need to use quotes.

			```{r}
			`a$a`
			`a[["b"]]`
			```

			`Or visually:`

			```{r, echo = FALSE, out.width = "75%"}
			`knitr::include_graphics("diagrams/lists-subsetting.png")`
			```

			`### Lists of condiments`

			It's easy to get confused between `[` and `[[`, but it's important to understand the difference. A few months ago I stayed at a hotel with a pretty interesting pepper shaker that I hope will help you remember these differences:

			```{r, echo = FALSE, out.width = "25%"}
			`knitr::include_graphics("images/pepper.jpg")`
			```

			If this pepper shaker is your list `x`, then, `x[1]` is a pepper shaker containing a single pepper packet:

			```{r, echo = FALSE, out.width = "25%"}
			`knitr::include_graphics("images/pepper-1.jpg")`
			```

			`x[2]` would look the same, but would contain the second packet. `x[1:2]` would be a pepper shaker containing two pepper packets.

			`x[[1]]` is:

			```{r, echo = FALSE, out.width = "25%"}
			`knitr::include_graphics("images/pepper-2.jpg")`
			```

			If you wanted to get the content of the pepper package, you'd need `x[[1]][[1]]`:

			```{r, echo = FALSE, out.width = "25%"}
			`knitr::include_graphics("images/pepper-3.jpg")`
			```

			`### Exercises`

			`1. Draw the following lists as nested sets.`

			`1. Generate the lists corresponding to these nested set diagrams.`

			`1. What happens if you subset a data frame as if you're subsetting a list?`
			`What are the key differences between a list and a data frame?`

Starting to work on expressing yourself 2016-01-21 23:51:02 +08:00			`## Data frames`

More on data structures 2016-03-14 22:11:24 +08:00			`Data frames are augmented lists.`
Starting brain dump of data structures 2016-03-11 23:10:20 +08:00
More on data structures 2016-03-14 22:11:24 +08:00			```{r}
			`df <- data.frame(x = 1:5, y = 5:1)`
			`typeof(df)`
			`attributes(df)`
			```

			Generally, I prefer using `dplyr::data_frame()` instead of `data.frame`. It creates an object that is verty similar:

			```{r}
			`df <- dplyr::data_frame(x = 1:5, y = 5:1)`
			`typeof(df)`
			`attributes(df)`
			```

			`* Doesn't convert variable types or variable names. It never uses character`
			`row names.`

			* It adds additional classes `tbl_df` to give better printing and subsetting
			`behaviour.`
Sketch out complete book outline 2015-12-06 22:02:29 +08:00
More on data structures 2016-03-14 22:11:24 +08:00			`## Predicates`
Shuffle programming structure. Also improve intro 2016-03-10 22:46:06 +08:00
Starting brain dump of data structures 2016-03-11 23:10:20 +08:00			`\| \| lgl \| int \| dbl \| chr \| list \| null \|`
			`\|------------------\|-----\|-----\|-----\|-----\|------\|------\|`
			\| `is_logical()` \| x \| \| \| \| \| \|
			\| `is_integer()` \| \| x \| \| \| \| \|
			\| `is_double()` \| \| \| x \| \| \| \|
			\| `is_numeric()` \| \| x \| x \| \| \| \|
			\| `is_character()` \| \| \| \| x \| \| \|
			\| `is_atomic()` \| x \| x \| x \| x \| \| \|
			\| `is_list()` \| \| \| \| \| x \| \|
			\| `is_vector()` \| x \| x \| x \| x \| x \| \|
			\| `is_null()` \| \| \| \| \| \| x \|

			`Compared to the base R functions, they only inspect the type of the object, not its attributes. This means they tend to be less surprising:`

			```{r}
			`is.atomic(NULL)`
			`is_atomic(NULL)`

			`is.vector(factor("a"))`
			`is_vector(factor("a"))`
			```

			`I recommend using these instead of the base functions.`

			`Each predicate also comes with "scalar" and "bare" versions. The scalar version checks that the length is 1 and the bare version checks that the object is a bare vector with no S3 class.`

			```{r}
			`y <- factor(c("a", "b", "c"))`
			`is_integer(y)`
			`is_scalar_integer(y)`
			`is_bare_integer(y)`
			```

			`### Exercises`

			1. Carefully read the documentation of `is.vector()`. What does it actually
			`test for?`