This chapter will show you how to work with dates and times in R. At first glance, dates and times seem simple. You use them all the time in every day, and generally have too many problems. However, the more you learn about dates and times, the more complicated the get. For example:
* Does every year have 365 days?
* Does every day have 24 hours?
* Does every minute have 60 seconds?
I'm sure you remembered that there are leap years that have 365 days (but do you know the full rule for determining if a year is a leap year?). You might have remembered that many parts of the world use daylight savings time, so that some days have 23 hours, and others have 25. You probably didn't know that some minutes have 61 seconds because occassionally leap seconds are added to keep things in synch. Read <http://www.creativedeletion.com/2015/01/28/falsehoods-programmers-date-time-zones.html> for even more things that you probably believe that are not true.
Dates and times are hard because they have to reconcile two physical phenonmen (the rotation of the Earth and its orbit around the sun) with a whole raft of cultural phenonmeon including months and time zones. This chapter won't teach you everything about dates and times, but it will give you a solid grounding of practical skills that will help you with common data analysis challenges.
This chapter will focus on the __lubridate__ package, which makes it easier to work with dates and times in R. We will use nycflights13 for practice data, and some packages for EDA.
* A __time__, the number of seconds. A date + a time = a date-time. Not
discussed furher in this chapter. `<time>`
When I want to talk about them collectively I'll use date/times.
If you can use a date, you should. Avoids all the time zome issues you'll learn about later on.
Note that historical dates (before ~1800) are tricky because the world hadn't yet agreed on a standard calendar. Time zones prior to 1970 are hard because the data is not available. If you're working with historical dates/times you'll need to think this through carefully.
There are four ways you are likely to create a date time:
* From a character vector
* From numeric vectors of each component
* From an existing date/time object
There are two special dates/times that are often useful:
Time data normally comes as character strings. You've seen one approach to parsing date times with readr package, in [date-times](#readr-datetimes). Another approach is to use the lubridate helpers. These automatically work out the format once you tell it the order of the day, month, and year components. To use them, identify the order in which the year, month, and day appears in your dates. Now arrange "y", "m", and "d" in the same order. This is the name of the function in lubridate that will parse your dates. For example:
If you have a date-time that also contains hours, minutes, or seconds, add an underscore and then one or more of "h", "m", and "s" to the name of the parsing function.
Let's do the same thing for every date-time column in `flights`. The times are represented in a slightly odd format, so we use modulus arithmetic to pull out the hour and minute components. Once that's done, we can drop the old `year`, `month`, and `day`, `hour` and `minute` columns. I've rearrange the variables a bit so they print nicely.
Now that we have the scheduled arrival and departure times at date times, let's look at the patterns. We could plot a histogram of flights throughout the year:
These are important to know whenever you use a date time in a numeric context. For example, the `binwidth` of a histogram gives the number of seconds for a date-time, and the number of days for a date. Adding an integer to a date-time vs. adding integer to date.
Let's instead group flights by day of the week, to see which week days are the busiest, and by hour to see which times of the day are busiest. To do this we will need to extract the day of the week and hour that each flight was scheduled to depart.
You can pull out individual parts of the date with the acccessor functions `year()`, `month()`, `mday()` (day of the month), `yday()` (day of the year)`, `wday()` (day of the week), `hour()`, `minute()`, `second()`.
For both `month()` and `wday()` you can set `label = TRUE` to return the name of the month or day of the week. Set `abbr = TRUE` to return an abbreviated version of the name, which can be helpful in plots.
The `hour()` accessor reveals that scheduled departures follow a bimodal distribution throughout the day. There is a morning and evening peak in departures.
When should you depart if you want to minimize your chance of delay? The results are striking. On average, flights that left on a Saturday arrived ahead of schedule.
There's an interesting pattern if we look at the average departure delay by minute. It looks like flights leaving around 20-30 and 50-60 generally have much lower delays that you'd expect!
So we do we see such a strong pattern in the delays of actual departure times? Well, like much data collected by humans, there's a strong bias towards flights leaving at "nice" departure times:
An alternative approach to plotting individual components is to round the date, using `floor_date()`, `round_date()`, and `ceiling_date()` to round (or move) a date to a nearby unit of time. Each function takes a vector of dates to adjust and then the name of the time unit to floor, ceiling, or round them to.
A difftime class object records a span of time in one of seconds, minutes, hours, days, or weeks. R creates a difftime whenever you subtract two dates or two date-times.
Difftimes come with base R, but they have some rough edges. For example, the value of a difftime depends on the difftime's units attribute. If this attribute is dropped, as it is when you combine difftimes with `c()`, the value becomes uninterpretable. Consider what happens when I combine these two difftimes that have the same length.
Durations behave like difftimes, but are a little more user friendly. To make a duration, choose a unit of time, make it plural, and then place a "d" in front of it. This is the name of the function in lubridate that will make your duration, i.e.
Durations always contain a time span measured in seconds. Larger units are estimated by converting minutes, hours, days, weeks, and years to seconds at the standard rate. This makes durations very precise, but it can lead to unexpected results when the timeline is non-contiguous, as with during daylight savings transitions.
Technically, the timeline also misbehaves during __leap seconds__, extra seconds that are added to the timeline to account for changes in the Earth's movement. In practice, most operating systems ignore leap seconds, and R follows the behavior of the operating system. If you are curious about when leap seconds occur, R lists them under `.leap.seconds`.
You can use lubridate's period class to handle irregularities in the timeline. Periods are time spans that are generalized to work with clock times, the "name" of a date-time that you would see on a clock, like "2016-03-13 00:00:00." Periods have no fixed length, which lets them work in an intuitive, human friendly way. When you add a one day period to "2000-03-13 00:00:00" the result will be "2000-03-14 00:00:00" whether there were 86400 seconds in March 13, 2000 or 82800 seconds (due to Daylight Savings Time).
To make a period object, call the name of the unit you wish to use, make it plural, and pass it the number of units to use as an argument.
The period always returns the "expected" clock time, as if the irregularity had not happened. The duration always returns the time that is exactly 86,400 seconds (in the case of a day) or 31,536,000 seconds later (in the case of a year).
Periods also let you model date-times that reoccur on a monthly basis in a way that would be impossible with durations. Consider that some of the months below are 31 days, some have 30, and one has 29.
Let's use periods to fix an oddity related to our flight dates. Some planes appear to have arrived at their destination _before_ they departed from New York City.
These are overnight flights. We used the same date information for both the departure and the arrival times, but these flights arrived on the following day. We can fix this by adding `days(1)` to the arrival time of each overnight flight.
It's obvious what `dyears(1) / ddays(365)` should return. It should return one because durations are always represented by seconds, an a duration of a year is defined as 365 days worth of seconds.
What should `years(1) / days(1)` return? Well, if the year was 2015 it should return 365, but if it was 366, it should return 366! There's not quite enough information for lubridate to give a single clear answer. What it does instead is give an estimate, with a warning:
If you want a more accurate measurement, you'll have to use an __interval__ instead of a a duration. An interval is a duration with a starting point - that makes it precise so you can determine exactly how long it is:
Time zones are an enormously complicated topic because of their interaction with geopolitical entities. Fortunately we don't need to dig into all the details as they're not all important for data analysis, but there are a few challenges we'll need to tackle head on.
The first challange is that the names of time zones that you're familiar with are not very general. For example, if you're an American you're probably familiar with EST, or Eastern Standard Time. However, both Australia and Canada also have Eastern standard times which mean different things! To avoid confusion R uses the international standard IANA time zones. These don't have a terribly consistent naming scheme, but tend to fall in one of three camps:
An additional complication of time zones is daylight savings time (DST): many time zones shift by an hour during summer time. For example, the same instants may be the same time or difference times in Denver and Phoenix over the course of the year:
This also creates a challenge for determining how much time has elapsed between two date-times. Lubridate also offers solution for this: the __interval__, which you can coerce into either a duration or a period:
Operations that drop attributes, such as `c()` will drop the time zone attribute from your date-times. In that case, the date-times will display in your local time zone:
If you do not set the time zone, lubridate will automatically assign the date-time to Coordinated Universal Time (UTC). Coordinated Universal Time is the standard time zone used by the scientific community and roughly equates to its predecessor, Greenwich Meridian Time. Since Coordinated Universal time does not follow Daylight Savings Time, it is straightforward to work with times saved in this time zone.