Date/time proofing

This commit is contained in:
hadley 2016-08-12 12:25:51 -05:00
parent 686254068d
commit 8a43da7124
1 changed files with 30 additions and 38 deletions

View File

@ -2,15 +2,15 @@
## Introduction
This chapter will show you how to work with dates and times in R. At first glance, dates and times seem simple. You use them all the time in your regular life, and they don't seem to cause much confusion. However, the more you learn about dates and times, the more complicated they seem to get. For example:
This chapter will show you how to work with dates and times in R. At first glance, dates and times seem simple. You use them all the time in your regular life, and they don't seem to cause much confusion. However, the more you learn about dates and times, the more complicated they seem to get. To warm yp, trying these three seemingly simple questions:
* Does every year have 365 days?
* Does every day have 24 hours?
* Does every minute have 60 seconds?
I'm sure you know that not every year has 365 days, but do you know the full rule for determining if a year is a leap year? You might have remembered that many parts of the world use daylight savings time (DST), so that some days have 23 hours, and others have 25. You probably didn't know that some minutes have 61 seconds because every now and then leap seconds are added to keep because the Earth's rotation is gradually slowing down.
I'm sure you know that not every year has 365 days, but do you know the full rule for determining if a year is a leap year? (It has three parts.) You might have remembered that many parts of the world use daylight savings time (DST), so that some days have 23 hours, and others have 25. You probably didn't know that some minutes have 61 seconds because every now and then leap seconds are added because the Earth's rotation is gradually slowing down.
Dates and times are hard because they have to reconcile two physical phenonmen (the rotation of the Earth and its orbit around the sun) with a whole raft of geopolitical phenonmeon including months, time zones, and DST. This chapter won't teach you every last detail about dates and times, but it will give you a solid grounding of practical skills that will help you with common data analysis challenges.
Dates and times are hard because they have to reconcile two physical phenomenon (the rotation of the Earth and its orbit around the sun) with a whole raft of geopolitical phenonmeon including months, time zones, and DST. This chapter won't teach you every last detail about dates and times, but it will give you a solid grounding of practical skills that will help you with common data analysis challenges.
### Prerequisites
@ -40,7 +40,7 @@ There are three types of date/time data that refer to an instant in time:
as `<dttm>`. Elsewhere in R these are called POSIXct, but I don't think
that's a very useful name.
In this chapter we are only going to focus on dates and date-times. R doesn't have a native class for storing times. If you need one, you can use the hms package.
In this chapter we are only going to focus on dates and date-times as R doesn't have a native class for storing times. If you need one, you can use the __hms__ package.
You should always use the simplest possible data type that works for your needs. That means if you can use a date instead of a date-time, you should. Date-times are substantially more complicated because of the need to handle time zones, which we'll come back to at the end of the chapter.
@ -57,9 +57,11 @@ Otherwise, there are three ways you're likely to create a date/time:
* From individual date-time components.
* From an existing date/time object.
They work as follows.
### From strings
Time data often comes as strings. You've seen one approach to parsing strings into date-times in [date-times](#readr-datetimes). Another approach is to use the helpers provided by lubridate. They automatically work out the format once you specify the order the date components. To use them, identify the order in which the year, month, and day appears in your dates, then arrange "y", "m", and "d" in the same order. That gives you the name of the lubridate function that will parse your date. For example:
Date/time data often comes as strings. You've seen one approach to parsing strings into date-times in [date-times](#readr-datetimes). Another approach is to use the helpers provided by lubridate. They automatically work out the format once you specify the order of the component. To use them, identify the order in which year, month, and day appears in your dates, then arrange "y", "m", and "d" in the same order. That gives you the name of the lubridate function that will parse your date. For example:
```{r}
ymd("2017-01-31")
@ -88,14 +90,14 @@ ymd(20170131, tz = "UTC")
### From individual components
Sometimes instead of a single string, you'll have the individual components of the date-time spread across multiple column. This is what we have in the flights data:
Instead of a single string, sometimes you'll have the individual components of the date-time spread across multiple columns. This is what we have in the flights data:
```{r}
flights %>%
select(year, month, day, hour, minute)
```
To create a date-time from this sort of input, use `make_date()` or `make_datetime()`:
To create a date/time from this sort of input, use `make_date()` for dates, or `make_datetime()` for date-times:
```{r}
flights %>%
@ -103,7 +105,7 @@ flights %>%
mutate(departure = make_datetime(year, month, day, hour, minute))
```
Let's do the same thing for each of the four time columns. The times are represented in a slightly odd format, so we use modulus arithmetic to pull out the hour and minute components. Once I've created the date-time variables, I focus in on the variables we'll explore in the rest of the chapter.
Let's do the same thing for each of the four time columns in `flights`. The times are represented in a slightly odd format, so we use modulus arithmetic to pull out the hour and minute components. Once I've created the date-time variables, I focus in on the variables we'll explore in the rest of the chapter.
```{r}
make_datetime_100 <- function(year, month, day, time) {
@ -123,7 +125,7 @@ flights_dt <- flights %>%
flights_dt
```
With this data, I can start to visualise the distribution of departure times across the year:
With this data, I can visualise the distribution of departure times across the year:
```{r}
flights_dt %>%
@ -168,7 +170,7 @@ as_date(now())
1. What does the `tzone` argument to `today()` do? Why is it important?
1. Use lubridate to parse each of the following dates:
1. Use the appropriate lubridate function to parse each of the following dates:
```{r}
d1 <- "January 1, 2010"
@ -180,11 +182,11 @@ as_date(now())
## Date-time components
Now that you know how to get date-time data into R's date-time data structures let's explore what you can do with them. This section will focus on the accessor functions that let you get and set individual components. The next section will look at how arithmetic works with date-times.
Now that you know how to get date-time data into R's date-time data structures, let's explore what you can do with them. This section will focus on the accessor functions that let you get and set individual components. The next section will look at how arithmetic works with date-times.
### Getting components
You can pull out individual parts of the date with the acccessor functions `year()`, `month()`, `mday()` (day of the month), `yday()` (day of the year)`, `wday()` (day of the week), `hour()`, `minute()`, `second()`.
You can pull out individual parts of the date with the acccessor functions `year()`, `month()`, `mday()` (day of the month), `yday()` (day of the year), `wday()` (day of the week), `hour()`, `minute()`, and `second()`.
```{r}
datetime <- ymd_hms("2016-07-08 12:34:56")
@ -197,14 +199,14 @@ yday(datetime)
wday(datetime)
```
For `month()` and `wday()` you can set `label = TRUE` to return the abbreviate name of the month or day of the week. Set `abbr = FALSE` to return the full name.
For `month()` and `wday()` you can set `label = TRUE` to return the abbreviated name of the month or day of the week. Set `abbr = FALSE` to return the full name.
```{r}
month(datetime, label = TRUE)
wday(datetime, label = TRUE, abbr = FALSE)
```
We can `wday()` to see that more flights depart during the week than on the weekend:
We can use `wday()` to see that more flights depart during the week than on the weekend:
```{r}
flights_dt %>%
@ -240,20 +242,16 @@ ggplot(sched_dep , aes(minute, avg_delay)) +
geom_line()
```
So we do we see that pattern with the actual departure times? Well, like much data collected by humans, there's a strong bias towards flights leaving at "nice" departure times. Always be alert for this sort of pattern whenever you data involves human judgement.
So why do we see that pattern with the actual departure times? Well, like much data collected by humans, there's a strong bias towards flights leaving at "nice" departure times. Always be alert for this sort of pattern whenever you work with data that involves human judgement!
```{r}
ggplot(sched_dep, aes(minute, n)) +
geom_line()
```
What we're probably seeing is the impact of flights scheduled to leave on the hour or half past the hour leaving a few minutes early.
### Rounding
An alternative approach to plotting individual components is to round the date to a nearby unit of time, using `floor_date()`, `round_date()`, and `ceiling_date()`. Each function takes a vector of dates to adjust and then the name of the unit round down (floor), round up (ceiling), or round to
This allows us to, for example, plot the number of flights per week:
An alternative approach to plotting individual components is to round the date to a nearby unit of time, with `floor_date()`, `round_date()`, and `ceiling_date()`. Each function takes a vector of dates to adjust and then the name of the unit round down (floor), round up (ceiling), or round to. This, for example, allows us to plot the number of flights per week:
```{r}
flights_dt %>%
@ -302,7 +300,7 @@ flights_dt %>%
geom_freqpoly(binwidth = 300)
```
Setting the larger component of a date to a constant is a powerful technique that allows you to explore patterns in the smaller components.
Setting larger components of a date to a constant is a powerful technique that allows you to explore patterns in the smaller components.
### Exercises
@ -353,7 +351,7 @@ A difftime class object records a time span of seconds, minutes, hours, days, or
as.duration(h_age)
```
Durations also come with a bunch of convenient constructors:
Durations come with a bunch of convenient constructors:
```{r}
dseconds(15)
@ -364,7 +362,7 @@ dweeks(3)
dyears(1)
```
Durations always record the time space in seconds. Larger units are created by converting minutes, hours, days, weeks, and years to seconds at the standard rate (60 seconds in a minute, 60 minutes in an hour, 24 hours in day, 7 days in a week, 365 days in a year).
Durations always record the time span in seconds. Larger units are created by converting minutes, hours, days, weeks, and years to seconds at the standard rate (60 seconds in a minute, 60 minutes in an hour, 24 hours in day, 7 days in a week, 365 days in a year).
You can add and multiply durations:
@ -389,11 +387,11 @@ one_pm
one_pm + ddays(1)
```
Why is one day after 1pm on March 12, 2pm on March 13?! If you look carefully at the date you might also notice that the time zones have changed. Because of DST, March 12 only has 23 hours, so if add a full days worth of seconds we end up with a different hour.
Why is one day after 1pm on March 12, 2pm on March 13?! If you look carefully at the date you might also notice that the time zones have changed. Because of DST, March 12 only has 23 hours, so if add a full days worth of seconds we end up with a different time.
### Periods
To solve this problem, lubridate provides __periods__. Periods are time spans that are work with "human" times, like days, and months. Periods don't have a fixed length in seconds, which lets them work in a more intuitive way:
To solve this problem, lubridate provides __periods__. Periods are time spans but don't have a fixed length in seconds, instead they work with "human" times, like days and months. That allows them work in a more intuitive way:
```{r}
one_pm
@ -483,9 +481,9 @@ To find out how many periods fall into an interval, you need to use integer divi
How do you pick between duration, periods, and intervals? As always, pick the simplest data structure that solves your problem. If you only care about physical time, use a duration; if you need to add human times, use a period; if you need to figure out how long a span is in human units, use an interval.
The following diagram summarises the interelationships between the different data types:
Figure \@{ref:dt-algebra} summarises permitted arithmetic operations between the different data types.
```{r, echo = FALSE}
```{r dt-algebra, echo = FALSE, fig.cap = "The allowed arithmetic operations between pairs of date/time classes."}
knitr::include_graphics("diagrams/datetimes-arithmetic.png")
```
@ -503,15 +501,15 @@ knitr::include_graphics("diagrams/datetimes-arithmetic.png")
1. Write a function that given your birthday (as a date), returns
how old you are in years.
1. Why can't `(today() %--% next_year) / months(1)` work?
1. Why can't `(today() %--% (today() + years(1)) / months(1)` work?
## Time zones
Time zones are an enormously complicated topic because of their interaction with geopolitical entities. Fortunately we don't need to dig into all the details as they're not all important for data analysis, but there are a few challenges we'll need to tackle head on.
The first challange is that everyday names of time zones tend to be ambiguous. For example, if you're American you're probably familiar with EST, or Eastern Standard Time. However, both Australia and Canada also have EST! To avoid confusion, R uses the international standard IANA time zones. These use a consistent naming scheme "<area>/<location>", typically in the form "<continent>/<city>" (there are a few exceptions because not every country lies on a continent). Examples include "America/New_York", "Europe/Paris", and "Pacific/Auckland".
The first challange is that everyday names of time zones tend to be ambiguous. For example, if you're American you're probably familiar with EST, or Eastern Standard Time. However, both Australia and Canada also have EST! To avoid confusion, R uses the international standard IANA time zones. These use a consistent naming scheme "<area>/<location>", typically in the form "\<continent\>/\<city\>" (there are a few exceptions because not every country lies on a continent). Examples include "America/New_York", "Europe/Paris", and "Pacific/Auckland".
You might wonder why the time zone uses a city, when typically you think of time zones as associated with a country or region within a country. This is because the IANA database has to record decades worth of data. In the course of decades, countries change names (or break apart) fairly frequently, but city names tend to stay the same. Another problem is that name needs to reflect not only to the current behaviour, but also the complete history. For example, there are time zones for both "America/New_York" and "America/Detroit". These cities both currently use Eastern Standard Time but in 1969-1972 Michigan (the state in which Detroit is located), did not follow DST, so it needs a different name. It's worth reading the raw time zone database (available at <http://www.iana.org/time-zones>) just to read some of these stories!
You might wonder why the time zone uses a city, when typically you think of time zones as associated with a country or region within a country. This is because the IANA database has to record decades worth of time zone rules. In the course of decades, countries change names (or break apart) fairly frequently, but city names tend to stay the same. Another problem is that name needs to reflect not only to the current behaviour, but also the complete history. For example, there are time zones for both "America/New_York" and "America/Detroit". These cities both currently use Eastern Standard Time but in 1969-1972 Michigan (the state in which Detroit is located), did not follow DST, so it needs a different name. It's worth reading the raw time zone database (available at <http://www.iana.org/time-zones>) just to read some of these stories!
You can find out what R thinks your current time zone is with `Sys.timezone()`:
@ -534,20 +532,14 @@ In R, the time zone is an attribute of the date-time that only controls printing
(x3 <- ymd_hms("2015-06-02 04:00:00", tz = "Pacific/Auckland"))
```
You can verify that they're the same time with subtraction:
You can verify that they're the same time using subtraction:
```{r}
x1 - x2
x1 - x3
```
Unless other specified, lubridate always uses UTC. UTC (Coordinated Universal Time) is the standard time zone used by the scientific community and roughly equivalent to its predecessor GMT (Greenwich Meridian Time). It does not have DST, which makes a convenient representation for computation.
```{r}
ymd_hms("2015-06-01 12:00:00")
```
Operations that combine date-times, like `c()`, will often drop the time zone. In that case, the date-times will display in your local time zone:
Unless other specified, lubridate always uses UTC. UTC (Coordinated Universal Time) is the standard time zone used by the scientific community and roughly equivalent to its predecessor GMT (Greenwich Meridian Time). It does not have DST, which makes a convenient representation for computation. Operations that combine date-times, like `c()`, will often drop the time zone. In that case, the date-times will display in your local time zone:
```{r}
x4 <- c(x1, x2, x3)