Update EDA

Based on @behrman comments
This commit is contained in:
hadley 2016-07-26 16:01:19 -05:00
parent a20833a72b
commit fba9278416
6 changed files with 44 additions and 43 deletions

21
EDA.Rmd
View File

@ -83,7 +83,7 @@ Every variable has its own pattern of variation, which can reveal interesting in
### Visualising distributions
How you visualise the distribution of a variable will depend on whether the variable is categorical or continuous. A variable is **categorical** if it can only have a finite (or countably infinite) set of unique values. In R, categorical variables are usually saved as factors or character vectors. To examine the distribution of a categorical variable, use a bar chart:
How you visualise the distribution of a variable will depend on whether the variable is categorical or continuous. A variable is **categorical** if it can only take one of small set of values. In R, categorical variables are usually saved as factors or character vectors. To examine the distribution of a categorical variable, use a bar chart:
```{r}
ggplot(data = diamonds) +
@ -96,7 +96,7 @@ The height of the bars displays how many observations occurred with each x value
diamonds %>% count(cut)
```
A variable is **continuous** if you can arrange its values in order _and_ an infinite number of unique values can exist between any two values of the variable. Numbers and date-times are two examples of continuous variables. To examine the distribution of a continuous variable, use a histogram:
A variable is **continuous** if can take any of an infinite set of ordered values. Numbers and date-times are two examples of continuous variables. To examine the distribution of a continuous variable, use a histogram:
```{r}
ggplot(data = diamonds) +
@ -109,9 +109,9 @@ You can compute this by hand by combining `dplyr::count()` and `ggplot2::cut_wid
diamonds %>% count(cut_width(carat, 0.5))
```
A histogram divides the x axis into equally spaced bins and then uses the height of bar to display the number of observations that fall in each bin. In the graph above, the tallest bar shows that almost 30,000 observations have a $carat$ value between 0.25 and 0.75, which are the left and right edges of the bar.
A histogram divides the x axis into equally spaced bins and then uses the height of bar to display the number of observations that fall in each bin. In the graph above, the tallest bar shows that almost 30,000 observations have a `carat` value between 0.25 and 0.75, which are the left and right edges of the bar.
You can set the width of the intervals in a histogram with the `binwidth` argument, which is measured in the units of the $x$ variable. You should always explore a variety of binwidths when working with histograms, as different binwidths can reveal different patterns. For example, here is how the graph above looks when we zoom into just the diamonds with a size of less than three carats and choose a smaller binwidth.
You can set the width of the intervals in a histogram with the `binwidth` argument, which is measured in the units of the `x` variable. You should always explore a variety of binwidths when working with histograms, as different binwidths can reveal different patterns. For example, here is how the graph above looks when we zoom into just the diamonds with a size of less than three carats and choose a smaller binwidth.
```{r}
smaller <- diamonds %>% filter(carat < 3)
@ -214,7 +214,8 @@ When you discover an outlier, it's a good idea to trace it back as far as possib
or surprising? (Hint: carefully think about the `binwidth` and make sure
you)
1. How many diamonds have 0.99 carats? Why?
1. How many diamonds are 0.99 carat? How many have are 1 carat? What
do you think is the cause of the difference?
1. Compare and contrast `coord_cartesian()` vs `xlim()`/`ylim()` when
zooming in on a histogram. What happens if you leave `binwidth` unset?
@ -316,7 +317,7 @@ There's something rather surprising about this plot - it appears that fair diamo
Another alternative to display the distribution of a continuous variable broken down by a categorical variable is the boxplot. A **boxplot** is a type of visual shorthand for a distribution of values that is popular among statisticians. Each boxplot consists of:
* A box that stretches from the 25th percentile of the distribution to the
75th percentile, a distance known as the Inter-Quartile Range (IQR). In the
75th percentile, a distance known as the interquartile range (IQR). In the
middle of the box is a line that displays the median, i.e. 50th percentile,
of the distribution. These three lines give you a sense of the spread of the
distribution and whether or not the distribution is symmetric about the
@ -349,7 +350,7 @@ ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot()
```
Covariation will appear as a systematic change in the medians or IQRs of the boxplots. To make the trend easier to see, reorder $x$ variable with `reorder()`. This code reorders the `class` based on the median value of `hwy` in each group.
Covariation will appear as a systematic change in the medians or IQRs of the boxplots. To make the trend easier to see, reorder `x` variable with `reorder()`. This code reorders the `class` based on the median value of `hwy` in each group.
```{r fig.height = 3}
ggplot(data = mpg) +
@ -497,8 +498,8 @@ ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
1. Two dimensional plots reveal outliers that are not visible in one
dimensional plots. For example, some points in the plot below have an
unusual combination of $x$ and $y$ values, which makes the points outliers
even though their $x$ and $y$ values appear normal when examined separately.
unusual combination of `x` and `y` values, which makes the points outliers
even though their `x` and `y` values appear normal when examined separately.
```{r, dev = "png"}
ggplot(data = diamonds) +
@ -553,7 +554,7 @@ ggplot(data = diamonds2, mapping = aes(x = cut, y = resid)) +
geom_boxplot()
```
You haven't learn more modelling yet because understanding what models are and how they work is easiest once you have tools of data wrangling and programming in hand.
We're saving modelling for later because understanding what models are and how they work is easiest once you have tools of data wrangling and programming in hand.
## ggplot2 calls

View File

@ -2,11 +2,11 @@
## Introduction
This chapter will show you how to work with dates and times in R. Dates and times follow their own rules, which can make working with them difficult. For example dates and times are ordered, like numbers; but the timeline is not as orderly as the number line. The timeline repeats itself, and has noticeable gaps due to Daylight Savings Time, leap years, and leap seconds. Datetimes also rely on ambiguous units: How long is a month? How long is a year? Time zones give you another headache when you work with dates and times. The same instant of time will have different "names" in different time zones.
This chapter will show you how to work with dates and times in R. Dates and times follow their own rules, which can make working with them difficult. For example dates and times are ordered, like numbers; but the timeline is not as orderly as the number line. The timeline repeats itself, and has noticeable gaps due to Daylight Savings Time, leap years, and leap seconds. Date-times also rely on ambiguous units: How long is a month? How long is a year? Time zones give you another headache when you work with dates and times. The same instant of time will have different "names" in different time zones.
### Prerequisites
This chapter will focus on R's __lubridate__ package, which makes it much easier to work with dates and times in R. You'll learn the basic date time structures in R and the lubridate functions that make working with them easy. We will use `nycflights13` for practice data, and use some packages for EDA.
This chapter will focus on R's __lubridate__ package, which makes it much easier to work with dates and times in R. You'll learn the basic date-time structures in R and the lubridate functions that make working with them easy. We will use `nycflights13` for practice data, and use some packages for EDA.
```{r message = FALSE}
library(lubridate)
@ -25,7 +25,7 @@ flights %>%
select(year, month, day, hour, minute)
```
Getting R to agree that your dataset contains the dates and times that you think it does can be tricky. Lubridate simplifies that. To combine separate numbers into datetimes, use `make_datetime()`.
Getting R to agree that your dataset contains the dates and times that you think it does can be tricky. Lubridate simplifies that. To combine separate numbers into date-times, use `make_datetime()`.
```{r}
datetimes <- flights %>%
@ -77,20 +77,20 @@ ymd_hms("2017-01-31 20:11:59", tz = "America/New_York")
#### The structure of dates and times
What have we accomplished by parsing our datetimes? R now recognizes that our departure and arrival variables contain datetime information, and it saves the variables in the POSIXct format, a common way of representing dates and times.
What have we accomplished by parsing our date-times? R now recognizes that our departure and arrival variables contain date-time information, and it saves the variables in the POSIXct format, a common way of representing dates and times.
```{r}
class(datetimes$departure[1])
```
In POSIXct form, each datetime is saved as the number of seconds that passed between the datetime and midnight January 1st, 1970 in the Coordinated Universal Time zone. Under this system, the very first moment of January 1st, 1970 gets the number zero. Earlier moments get a negative number.
In POSIXct form, each date-time is saved as the number of seconds that passed between the date-time and midnight January 1st, 1970 in the Coordinated Universal Time zone. Under this system, the very first moment of January 1st, 1970 gets the number zero. Earlier moments get a negative number.
```{r}
unclass(datetimes$departure[1])
unclass(ymd_hms("1970-01-01 00:00:00"))
```
The POSIXct format has many advantages. You can display the same date time in any time zone by changing its tzone attribute (more on that later), and R can recognize when two times displayed in two different time zones refer to the same moment.
The POSIXct format has many advantages. You can display the same date-time in any time zone by changing its tzone attribute (more on that later), and R can recognize when two times displayed in two different time zones refer to the same moment.
```{r warning = FALSE}
(zero_hour <- ymd_hms("1970-01-01 00:00:00"))
@ -99,7 +99,7 @@ zero_hour
ymd_hms("1970-01-01 00:00:00") == ymd_hms("1970-01-01 00:00:00", tz = "America/Denver")
```
Best of all, you can change a datetime by adding or subtracting seconds from it.
Best of all, you can change a date-time by adding or subtracting seconds from it.
```{r}
ymd_hms("1970-01-01 00:00:00") + 1
@ -123,7 +123,7 @@ class(zero_day)
zero_day - 1
```
R can also save datetimes in the POSIXlt form, a list based date structure. Working with POSIXlt dates can be much slower than working with POSIXct dates, and I don't recommend it. Lubridate's parse functions will always return a POSIXct date when you supply an hour, minutes, or seconds component.
R can also save date-times in the POSIXlt form, a list based date structure. Working with POSIXlt dates can be much slower than working with POSIXct dates, and I don't recommend it. Lubridate's parse functions will always return a POSIXct date when you supply an hour, minutes, or seconds component.
## Arithmetic with dates
@ -142,7 +142,7 @@ However, the conversion to seconds becomes tedious and introduces a chance for e
### Difftimes
A difftime class object records a span of time in one of seconds, minutes, hours, days, or weeks. R creates a difftime whenever you subtract two dates or two datetimes.
A difftime class object records a span of time in one of seconds, minutes, hours, days, or weeks. R creates a difftime whenever you subtract two dates or two date-times.
```{r}
(day1 <- ymd("2000-01-01") - ymd("1999-12-31"))
@ -181,7 +181,7 @@ To make a duration that lasts multiple units, pass the number of units as the ar
dminutes(3)
```
This syntax provides a very clean way to do arithmetic with datetimes. For example, we can recreate our scheduled departure and arrival times with
This syntax provides a very clean way to do arithmetic with date-times. For example, we can recreate our scheduled departure and arrival times with
```{r}
(datetimes <- datetimes %>%
@ -200,13 +200,13 @@ For example, Daylight Savings Time can result in this sort of surprise.
ymd_hms("2016-03-13 00:00:00", tz = "America/New_York") + ddays(1)
```
Luckily, the UTC time zone does not use Daylight Savings Time, so if you keep your datetimes in UTC you can avoid this type of complexity. But what if you do need to work with Daylight Savings Time (or leap years or months, two other places where the time line can misbehave [^1])?
Luckily, the UTC time zone does not use Daylight Savings Time, so if you keep your date-times in UTC you can avoid this type of complexity. But what if you do need to work with Daylight Savings Time (or leap years or months, two other places where the time line can misbehave [^1])?
[^1]: Technically, the timeline also misbehaves during __leap seconds__, extra seconds that are added to the timeline to account for changes in the Earth's movement. In practice, most operating systems ignore leap seconds, and R follows the behavior of the operating system. If you are curious about when leap seconds occur, R lists them under `.leap.seconds`.
### Periods
You can use lubridate's period class to handle irregularities in the timeline. Periods are time spans that are generalized to work with clock times, the "name" of a datetime that you would see on a clock, like "2016-03-13 00:00:00." Periods have no fixed length, which lets them work in an intuitive, human friendly way. When you add a one day period to "2000-03-13 00:00:00" the result will be "2000-03-14 00:00:00" whether there were 86400 seconds in March 13, 2000 or 82800 seconds (due to Daylight Savings Time).
You can use lubridate's period class to handle irregularities in the timeline. Periods are time spans that are generalized to work with clock times, the "name" of a date-time that you would see on a clock, like "2016-03-13 00:00:00." Periods have no fixed length, which lets them work in an intuitive, human friendly way. When you add a one day period to "2000-03-13 00:00:00" the result will be "2000-03-14 00:00:00" whether there were 86400 seconds in March 13, 2000 or 82800 seconds (due to Daylight Savings Time).
To make a period object, call the name of the unit you wish to use, make it plural, and pass it the number of units to use as an argument.
@ -254,7 +254,7 @@ When should you use a period and when should you use a duration?
* Use periods whenever you need to model human events, such as the opening of the stock market, or the close of the business day.
Periods also let you model datetimes that reoccur on a monthly basis in a way that would be impossible with durations. Consider that some of the months below are 31 days, some have 30, and one has 29.
Periods also let you model date-times that reoccur on a monthly basis in a way that would be impossible with durations. Consider that some of the months below are 31 days, some have 30, and one has 29.
```{r}
mdy("January 1st, 2016") + months(0:11)
@ -328,7 +328,7 @@ datetimes %>%
Let's instead group flights by day of the week, to see which week days are the busiest, and by hour to see which times of the day are busiest. To do this we will need to extract the day of the week and hour that each flight was scheduled to depart.
You can extract the year, month, day of the year (yday), day of the month (mday), day of the week (wday), hour, minute, second, and time zone (tz) of any date or datetime with lubridate's accessor functions. Use the function that has the name of the unit you wish to extract. Accessor function names are singular, period function names are plural.
You can extract the year, month, day of the year (yday), day of the month (mday), day of the week (wday), hour, minute, second, and time zone (tz) of any date or date-time with lubridate's accessor functions. Use the function that has the name of the unit you wish to extract. Accessor function names are singular, period function names are plural.
```{r}
(datetime <- ymd_hms("2007-08-09 12:34:56", tz = "America/Los_Angeles"))
@ -409,7 +409,7 @@ datetimes %>%
### Setting dates
You can also use each accessor function to set the components of a date or datetime.
You can also use each accessor function to set the components of a date or date-time.
```{r}
datetime
@ -442,7 +442,7 @@ update(datetime, year = 2002, month = 2, mday = 2, hour = 2,
## Time zones
R records the time zone of each datetime as an attribute of the datetime object. This makes time zones tricky to work with. For example, a vector of datetimes can only contain one time zone attribute, so every datetime in the vector must share the same time zone.
R records the time zone of each date-time as an attribute of the date-time object. This makes time zones tricky to work with. For example, a vector of date-times can only contain one time zone attribute, so every datetime in the vector must share the same time zone.
```{r}
(firsts <- ymd_hms("2000-01-01 12:00:00") + months(0:11))
@ -453,7 +453,7 @@ unclass(firsts)
firsts
```
Operations that drop attributes, such as `c()` will drop the time zone attribute from your datetimes. In that case, the datetimes will display in your local time zone (mine is "America/New_York", i.e. Eastern Time).
Operations that drop attributes, such as `c()` will drop the time zone attribute from your date-times. In that case, the date-times will display in your local time zone (mine is "America/New_York", i.e. Eastern Time).
```{r}
(jan_day <- ymd_hms("2000-01-01 12:00:00"))
@ -470,9 +470,9 @@ You can set the time zone of a date with the tz argument when you parse the date
ymd_hms("2016-01-01 00:00:01", tz = "Pacific/Auckland")
```
If you do not set the time zone, lubridate will automatically assign the datetime to Coordinated Universal Time (UTC). Coordinated Universal Time is the standard time zone used by the scientific community and roughly equates to its predecessor, Greenwich Meridian Time. Since Coordinated Universal time does not follow Daylight Savings Time, it is straightforward to work with times saved in this time zone.
If you do not set the time zone, lubridate will automatically assign the date-time to Coordinated Universal Time (UTC). Coordinated Universal Time is the standard time zone used by the scientific community and roughly equates to its predecessor, Greenwich Meridian Time. Since Coordinated Universal time does not follow Daylight Savings Time, it is straightforward to work with times saved in this time zone.
You can change the time zone of a date time in two ways. First, you can display the same instant of time in a different time zone with lubridate's `with_tz()` function.
You can change the time zone of a date-time in two ways. First, you can display the same instant of time in a different time zone with lubridate's `with_tz()` function.
```{r}
jan_day
@ -585,7 +585,7 @@ datetimes2 %>%
## Intervals of time
An interval of time is a specific period of time, such as midnight April 13, 2013 to midnight April 23, 2013. You can make an interval of time with lubridate's `interval()` function. Pass it the start and end datetimes of the interval. Use the tzone argument to select a time zone to display the interval in (if you wish to display the interval in a different time zone than that of the start date).
An interval of time is a specific period of time, such as midnight April 13, 2013 to midnight April 23, 2013. You can make an interval of time with lubridate's `interval()` function. Pass it the start and end date-times of the interval. Use the tzone argument to select a time zone to display the interval in (if you wish to display the interval in a different time zone than that of the start date).
```{r}

View File

@ -277,11 +277,11 @@ The first argument to `guess_encoding()` can either be a path to a file, or, as
Encodings are a rich and complex topic, and I've only scratched the surface here. If you'd like to learn more I'd recommend reading the detailed explanation at <http://kunststube.net/encoding/>.
### Dates, date times, and times
### Dates, date-times, and times
You pick between three parsers depending on whether you want a date (the number of days since 1970-01-01), a date time (the number of seconds since midnight 1970-01-01), or a time (the number of seconds since midnight). When called without any additional arguments:
You pick between three parsers depending on whether you want a date (the number of days since 1970-01-01), a date-time (the number of seconds since midnight 1970-01-01), or a time (the number of seconds since midnight). When called without any additional arguments:
* `parse_datetime()` expects an ISO8601 date time. ISO8601 is an
* `parse_datetime()` expects an ISO8601 date-time. ISO8601 is an
international standard in which the components of a date are
organised from biggest to smallest: year, month, day, hour, minute,
second.
@ -315,7 +315,7 @@ You pick between three parsers depending on whether you want a date (the number
Base R doesn't have a great built in class for time data, so we use
the one provided in the hms package.
If these defaults don't work for your data you can supply your own datetime `format`, built up of the following pieces:
If these defaults don't work for your data you can supply your own date-time `format`, built up of the following pieces:
Year
: `%Y` (4 digits).
@ -418,7 +418,7 @@ The heuristic tries each of the following types, stopping when it finds a match:
* number: contains valid doubles with the grouping mark inside.
* time: matches the default `time_format`.
* date: matches the default `date_format`.
* date time: any ISO8601 date.
* date-time: any ISO8601 date.
If none of these rules apply, then the column will stay as a vector of strings.
@ -552,7 +552,7 @@ readr also comes with two useful functions for writing data back to disk: `write
(a "byte order mark") at the start of the file which tells Excel that
you're using the UTF-8 encoding.
* Saving dates and datetimes in ISO8601 format so they are easily
* Saving dates and date-times in ISO8601 format so they are easily
parsed elsewhere.
The most important arguments are `x` (the data frame to save), and `path` (the location to save it). You can also specify how missing values are written with `na`, and if you want to `append` to an existing file.

View File

@ -21,7 +21,7 @@ It's a challenge to know when to stop. You need to figure out when your model is
### Prerequisites
We'll start with modelling and EDA tools we needed in the last chapter. Then we'll add in some real datasets: `diamonds` from ggplot2, and `flights` from nycflights13. We'll also need lubridate to extract useful components of datetimes.
We'll start with modelling and EDA tools we needed in the last chapter. Then we'll add in some real datasets: `diamonds` from ggplot2, and `flights` from nycflights13. We'll also need lubridate to extract useful components of date-times.
```{r setup, message = FALSE}
# Modelling functions

View File

@ -50,7 +50,7 @@ Every vector has two key properties:
Vectors can also contain arbitrary additional metadata in the form of attributes. These attributes are used to create __augmented vectors__ which build on additional behaviour. There are four important types of augmented vector:
* Factors and dates are built on top of integer vectors.
* Date times (POSIXct) are built on of double vectors.
* Date-times (POSIXct) are built on of double vectors.
* Data frames and tibbles are built on top of lists.
This chapter will introduce you to these important vectors from simplest to most complicated. You'll start with atomic vectors, then build up to lists, and finally learn about augmented vectors.
@ -523,7 +523,7 @@ knitr::include_graphics("images/pepper-3.jpg")
## Augmented vectors
Atomic vectors and lists are the building blocks for four other important vector types: factors, dates, date times, and data frames. I call these __augmented vectors__, because they are vectors with additional __attributes__.
Atomic vectors and lists are the building blocks for four other important vector types: factors, dates, date-times, and data frames. I call these __augmented vectors__, because they are vectors with additional __attributes__.
Attributes are a way of adding arbitrary additional metadata to a vector. You can think of attributes as named list of vectors that can be attached to any object. You can get and set individual attribute values with `attr()` or see them all at once with `attributes()`.
@ -586,7 +586,7 @@ is.factor(x)
as.factor(letters[1:5])
```
### Dates and date times
### Dates and date-times
Dates in R are numeric vectors (sometimes integers, sometimes doubles) that represent the number of days since 1 January 1970.
@ -598,7 +598,7 @@ typeof(x)
attributes(x)
```
Date times are numeric vectors (sometimes integers, sometimes doubles) that represent the number of seconds since 1 January 1970:
Date-times are numeric vectors (sometimes integers, sometimes doubles) that represent the number of seconds since 1 January 1970:
```{r}
x <- lubridate::ymd_hm("1970-01-01 01:00")
@ -619,7 +619,7 @@ log(-1)
1
```
There is another type of datetimes called POSIXlt. These are built on top of named lists:
There is another type of date-times called POSIXlt. These are built on top of named lists:
```{r}
y <- as.POSIXlt(x)

View File

@ -25,7 +25,7 @@ Data wrangling is import because it allows you to work with your own data. You'l
Data wrangling also encompasses data transformation. You've already learned the basics, and now you'll learn new skills for specific types of data:
* [Dates and times] will give you the key tools for working with
dates, and date times.
dates, and date-times.
* [Strings] will introduce regular expressions, a powerful tool for
manipulating strings.