More on data structures

This commit is contained in:
hadley 2016-03-14 09:11:24 -05:00
parent 6f2f9b858d
commit abcf1e38a4
1 changed files with 62 additions and 11 deletions

View File

@ -98,11 +98,13 @@ typeof(x)
You learned how to manipulate these vectors in [strings].
## Molecular vectors
## Subsetting
There are three important types of vector that are built on top of atomic vectors: factors, dates, and date times. I call these molecular vectors, to torture the chemistry metaphor a little further. The chief difference between atomic and molecular vectors is that molecular vectors also have __attributes__.
Attributes are a way of adding arbitrary additional metadata to a vector. Each attribute is a named vector. You can get and set individual attribute values with `attr()` or see them all at once with `attributes()`.
## Augmented vectors
There are three important types of vector that are built on top of atomic vectors: factors, dates, and date times. I call these augmented vectors, because they are atomic vectors with additional __attributes__. Attributes are a way of adding arbitrary additional metadata to a vector. Each attribute is a named vector. You can get and set individual attribute values with `attr()` or see them all at once with `attributes()`.
```{r}
x <- 1:10
@ -112,7 +114,34 @@ attr(x, "farewell") <- "Bye!"
attributes(x)
```
The most important use of attributes in R is implement the S3 object oriented system. S3 objects have a "class" attribute, and which work with __generic functions__ to implement behaviour that differs based on the class of the object. A detailed discussion of S3 is beyond the scope of this book, but you can read more about it at <http://adv-r.had.co.nz/OO-essentials.html#s3>.
There are three very important attributes that are used to implement fundamental parts of R:
* "names" are used to name the elements of a vector.
* "dims" make a vector behave like a matrix or array.
* "class" is used to implemenet the S3 object oriented system.
Class is particularly important because it changes what __generic functions__ do with the object. Generic functions are key to OO in R. Here's what a typical generic function looks like:
```{r}
as.Date
```
The call to "UseMethod" means that this is a generic function, and it will call a specific __method__, based on the class of the first argument. You can list all the methods for a generic with `methods()`:
```{r}
methods("as.Date")
```
And you can see the specific implementation of a method with `getS3method()`:
```{r}
getS3method("as.Date", "default")
getS3method("as.Date", "numeric")
```
The most important S3 generic is print: it controls how the object is printed when you type its name on the console.
A detailed discussion of S3 is beyond the scope of this book, but you can read more about it at <http://adv-r.had.co.nz/OO-essentials.html#s3>.
### Factors
@ -126,7 +155,13 @@ attributes(x)
Historically, factors were much easier to work with than characters so many functions in base R automatically convert characters to factors (controlled by the dread `stringsAsFactors` argument). To get more historical context, you might want to read [stringsAsFactors: An unauthorized biography](http://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/) by Roger Peng or [stringsAsFactors = \<sigh\>](http://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh) by Thomas Lumley. The motivation for factors is the modelling context. If you're going to fit a model to categorical data, you need to know in advance all the possible values. There's no way to make a prediction for "green" if all you've ever seen is "red", "blue", and "yellow"
The packages in this book keep characters as is, but you will need to deal with them if you are working with base R or many other packages. When you encounter a factor, you should first check to see if you can eliminate it. Often there will be `stringsAsFactors` argument that you can set to `FALSE`. Otherwise, you can use `as.character()` to explicitly turn back into a factor.
The packages in this book keep characters as is, but you will need to deal with them if you are working with base R or many other packages. When you encounter a factor, you should first check to see if you can avoid creating it in the first. Often there will be `stringsAsFactors` argument that you can set to `FALSE`. Otherwise, you can apply `as.character()` to the column to explicitly turn back into a factor.
```{r}
x <- factor(letters[1:5])
is.factor(x)
as.factor(letters[1:5])
```
### Dates
@ -166,7 +201,7 @@ As far as I know there is no case in which you need POSIXlt. If you find you hav
## Recursive vectors (lists)
Lists are the data structure R uses for hierarchical objects. You're already familiar with vectors, R's data structure for 1d objects. Lists extend these ideas to model objects that are like trees. You can create a hierarchical structure with a list because unlike vectors, a list can contain other lists.
Lists are the data structure R uses for hierarchical objects. Lists extend atomic vectors to model objects that are like trees. You can create a hierarchical structure with a list because unlike vectors, a list can contain other lists.
You create a list with `list()`:
@ -296,16 +331,32 @@ knitr::include_graphics("images/pepper-3.jpg")
1. What happens if you subset a data frame as if you're subsetting a list?
What are the key differences between a list and a data frame?
## Data frames
## Matrices
Data frames are augmented lists.
## Subsetting
```{r}
df <- data.frame(x = 1:5, y = 5:1)
typeof(df)
attributes(df)
```
Not sure where else this should be covered.
Generally, I prefer using `dplyr::data_frame()` instead of `data.frame`. It creates an object that is verty similar:
```{r}
df <- dplyr::data_frame(x = 1:5, y = 5:1)
typeof(df)
attributes(df)
```
* Doesn't convert variable types or variable names. It never uses character
row names.
* It adds additional classes `tbl_df` to give better printing and subsetting
behaviour.
## Predicates
### Predicates
| | lgl | int | dbl | chr | list | null |
|------------------|-----|-----|-----|-----|------|------|
| `is_logical()` | x | | | | | |