From abcf1e38a4874728b8115bb88769242ed685b508 Mon Sep 17 00:00:00 2001 From: hadley Date: Mon, 14 Mar 2016 09:11:24 -0500 Subject: [PATCH] More on data structures --- data-structures.Rmd | 73 ++++++++++++++++++++++++++++++++++++++------- 1 file changed, 62 insertions(+), 11 deletions(-) diff --git a/data-structures.Rmd b/data-structures.Rmd index 88934e4..c3e0313 100644 --- a/data-structures.Rmd +++ b/data-structures.Rmd @@ -98,11 +98,13 @@ typeof(x) You learned how to manipulate these vectors in [strings]. -## Molecular vectors +## Subsetting -There are three important types of vector that are built on top of atomic vectors: factors, dates, and date times. I call these molecular vectors, to torture the chemistry metaphor a little further. The chief difference between atomic and molecular vectors is that molecular vectors also have __attributes__. -Attributes are a way of adding arbitrary additional metadata to a vector. Each attribute is a named vector. You can get and set individual attribute values with `attr()` or see them all at once with `attributes()`. + +## Augmented vectors + +There are three important types of vector that are built on top of atomic vectors: factors, dates, and date times. I call these augmented vectors, because they are atomic vectors with additional __attributes__. Attributes are a way of adding arbitrary additional metadata to a vector. Each attribute is a named vector. You can get and set individual attribute values with `attr()` or see them all at once with `attributes()`. ```{r} x <- 1:10 @@ -112,7 +114,34 @@ attr(x, "farewell") <- "Bye!" attributes(x) ``` -The most important use of attributes in R is implement the S3 object oriented system. S3 objects have a "class" attribute, and which work with __generic functions__ to implement behaviour that differs based on the class of the object. A detailed discussion of S3 is beyond the scope of this book, but you can read more about it at . +There are three very important attributes that are used to implement fundamental parts of R: + +* "names" are used to name the elements of a vector. +* "dims" make a vector behave like a matrix or array. +* "class" is used to implemenet the S3 object oriented system. + +Class is particularly important because it changes what __generic functions__ do with the object. Generic functions are key to OO in R. Here's what a typical generic function looks like: + +```{r} +as.Date +``` + +The call to "UseMethod" means that this is a generic function, and it will call a specific __method__, based on the class of the first argument. You can list all the methods for a generic with `methods()`: + +```{r} +methods("as.Date") +``` + +And you can see the specific implementation of a method with `getS3method()`: + +```{r} +getS3method("as.Date", "default") +getS3method("as.Date", "numeric") +``` + +The most important S3 generic is print: it controls how the object is printed when you type its name on the console. + +A detailed discussion of S3 is beyond the scope of this book, but you can read more about it at . ### Factors @@ -126,7 +155,13 @@ attributes(x) Historically, factors were much easier to work with than characters so many functions in base R automatically convert characters to factors (controlled by the dread `stringsAsFactors` argument). To get more historical context, you might want to read [stringsAsFactors: An unauthorized biography](http://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/) by Roger Peng or [stringsAsFactors = \](http://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh) by Thomas Lumley. The motivation for factors is the modelling context. If you're going to fit a model to categorical data, you need to know in advance all the possible values. There's no way to make a prediction for "green" if all you've ever seen is "red", "blue", and "yellow" -The packages in this book keep characters as is, but you will need to deal with them if you are working with base R or many other packages. When you encounter a factor, you should first check to see if you can eliminate it. Often there will be `stringsAsFactors` argument that you can set to `FALSE`. Otherwise, you can use `as.character()` to explicitly turn back into a factor. +The packages in this book keep characters as is, but you will need to deal with them if you are working with base R or many other packages. When you encounter a factor, you should first check to see if you can avoid creating it in the first. Often there will be `stringsAsFactors` argument that you can set to `FALSE`. Otherwise, you can apply `as.character()` to the column to explicitly turn back into a factor. + +```{r} +x <- factor(letters[1:5]) +is.factor(x) +as.factor(letters[1:5]) +``` ### Dates @@ -166,7 +201,7 @@ As far as I know there is no case in which you need POSIXlt. If you find you hav ## Recursive vectors (lists) -Lists are the data structure R uses for hierarchical objects. You're already familiar with vectors, R's data structure for 1d objects. Lists extend these ideas to model objects that are like trees. You can create a hierarchical structure with a list because unlike vectors, a list can contain other lists. +Lists are the data structure R uses for hierarchical objects. Lists extend atomic vectors to model objects that are like trees. You can create a hierarchical structure with a list because unlike vectors, a list can contain other lists. You create a list with `list()`: @@ -296,16 +331,32 @@ knitr::include_graphics("images/pepper-3.jpg") 1. What happens if you subset a data frame as if you're subsetting a list? What are the key differences between a list and a data frame? - ## Data frames -## Matrices +Data frames are augmented lists. -## Subsetting +```{r} +df <- data.frame(x = 1:5, y = 5:1) +typeof(df) +attributes(df) +``` -Not sure where else this should be covered. +Generally, I prefer using `dplyr::data_frame()` instead of `data.frame`. It creates an object that is verty similar: + +```{r} +df <- dplyr::data_frame(x = 1:5, y = 5:1) +typeof(df) +attributes(df) +``` + +* Doesn't convert variable types or variable names. It never uses character + row names. + +* It adds additional classes `tbl_df` to give better printing and subsetting + behaviour. + +## Predicates -### Predicates | | lgl | int | dbl | chr | list | null | |------------------|-----|-----|-----|-----|------|------| | `is_logical()` | x | | | | | |