Polish date times

This commit is contained in:
Hadley Wickham 2022-08-11 17:32:49 -05:00
parent b15eecf8b3
commit 5ad1eca391
3 changed files with 128 additions and 82 deletions

View File

@ -5,6 +5,9 @@
#| echo: false
source("_common.R")
status("polishing")
# https://github.com/tidyverse/lubridate/issues/1058
options(warnPartialMatchArgs = FALSE)
```
## Introduction
@ -13,15 +16,14 @@ This chapter will show you how to work with dates and times in R.
At first glance, dates and times seem simple.
You use them all the time in your regular life, and they don't seem to cause much confusion.
However, the more you learn about dates and times, the more complicated they seem to get.
To warm up, try these three seemingly simple questions:
To warm up think about how many days there are in a year, and how many hours there are in a day.
- Does every year have 365 days?
- Does every day have 24 hours?
- Does every minute have 60 seconds?
You probably remembered that most years have 365 days, but leap years have 366.
Do you know the full rule for determining if a year is a leap year[^datetimes-1]?
The number of hours in a day is a little less obvious: most days have 24 hours, but if you use daylight saving time (DST), one day each year has 23 hours and another has 25.
We're sure you know that not every year has 365 days, but do you know the full rule for determining if a year is a leap year?
(It has three parts.) You might have remembered that many parts of the world use daylight savings time (DST), so that some days have 23 hours, and others have 25.
You might not have known that some minutes have 61 seconds because every now and then leap seconds are added because the Earth's rotation is gradually slowing down.
[^datetimes-1]: A year is a leap year if it's divisible by 4, unless it's also divisible by 100, except if it's also divisible by 400.
In other words, in every set of 400 years, there's 97 leap years.
Dates and times are hard because they have to reconcile two physical phenomena (the rotation of the Earth and its orbit around the sun) with a whole raft of geopolitical phenomena including months, time zones, and DST.
This chapter won't teach you every last detail about dates and times, but it will give you a solid grounding of practical skills that will help you with common data analysis challenges.
@ -34,7 +36,6 @@ We will also need nycflights13 for practice data.
```{r}
#| message: false
library(tidyverse)
library(lubridate)
@ -53,9 +54,9 @@ There are three types of date/time data that refer to an instant in time:
- A **date-time** is a date plus a time: it uniquely identifies an instant in time (typically to the nearest second).
Tibbles print this as `<dttm>`.
Elsewhere in R these are called POSIXct, but that's not a very useful name.
Base R calls these POSIXct, but doesn't exactly trip off the tongue.
In this chapter we are only going to focus on dates and date-times as R doesn't have a native class for storing times.
In this chapter we are going to focus on dates and date-times as R doesn't have a native class for storing times.
If you need one, you can use the **hms** package.
You should always use the simplest possible data type that works for your needs.
@ -93,14 +94,6 @@ mdy("January 31st, 2017")
dmy("31-Jan-2017")
```
These functions also take unquoted numbers.
This is the most concise way to create a single date/time object, as you might need when filtering date/time data.
`ymd()` is short and unambiguous:
```{r}
ymd(20170131)
```
`ymd()` and friends create dates.
To create a date-time, add an underscore and one or more of "h", "m", and "s" to the name of the parsing function:
@ -112,7 +105,7 @@ mdy_hm("01/31/2017 08:01")
You can also force the creation of a date-time from a date by supplying a timezone:
```{r}
ymd(20170131, tz = "UTC")
ymd("2017-01-31", tz = "UTC")
```
### From individual components
@ -155,9 +148,17 @@ flights_dt <- flights |>
flights_dt
```
With this data, we can visualise the distribution of departure times across the year:
With this data, we can visualize the distribution of departure times across the year:
```{r}
#| fig.alt: >
#| A frequency polyon with departure time (Jan-Dec 2013) on the x-axis
#| and number of flights on the y-axis (0-1000). The frequency polygon
#| is binned by day so you see a time series of flights by day. The
#| pattern is dominated by a weekly pattern; there are fewer flights
#| on weekends. The are few days that stand out as having a surprisingly
#| few flights in early Februrary, early July, late November, and late
#| December.
flights_dt |>
ggplot(aes(dep_time)) +
geom_freqpoly(binwidth = 86400) # 86400 seconds = 1 day
@ -166,6 +167,12 @@ flights_dt |>
Or within a single day:
```{r}
#| fig.alt: >
#| A frequency polygon with departure time (6am - midnight Jan 1) on the
#| x-axis, number of flights on the y-axis (0-17), binned into 10 minute
#| increments. It's hard to see much pattern because of high variability,
#| but most bins have 8-12 flights, and there are markedly fewer flights
#| before 6am and after 8pm.
flights_dt |>
filter(dep_time < ymd(20130102)) |>
ggplot(aes(dep_time)) +
@ -227,7 +234,7 @@ The next section will look at how arithmetic works with date-times.
You can pull out individual parts of the date with the accessor functions `year()`, `month()`, `mday()` (day of the month), `yday()` (day of the year), `wday()` (day of the week), `hour()`, `minute()`, and `second()`.
```{r}
datetime <- ymd_hms("2016-07-08 12:34:56")
datetime <- ymd_hms("2026-07-08 12:34:56")
year(datetime)
month(datetime)
@ -248,6 +255,12 @@ wday(datetime, label = TRUE, abbr = FALSE)
We can use `wday()` to see that more flights depart during the week than on the weekend:
```{r}
#| fig-alt: >
#| A bar chart with days of the week on the x-axis and number of
#| flights on the y-axis. Monday-Friday have roughly the same number of
#| flights, ~48,0000, decreasingly slightly over the course of the week.
#| Sunday is a little lower (~45,000), and Saturday is much lower
#| (~38,000).
flights_dt |>
mutate(wday = wday(dep_time, label = TRUE)) |>
ggplot(aes(x = wday)) +
@ -258,6 +271,13 @@ There's an interesting pattern if we look at the average departure delay by minu
It looks like flights leaving in minutes 20-30 and 50-60 have much lower delays than the rest of the hour!
```{r}
#| fig-alt: >
#| A line chart with minute of actual departure (0-60) on the x-axis and
#| average delay (4-20) on the y-axis. Average delay starts at (0, 12),
#| steadily increases to (18, 20), then sharply drops, hitting at minimum
#| at ~23 minute past the hour and 9 minutes of delay. It then increases
#| again to (17, 35), and sharply decreases to (55, 4). It finishes off
#| with an increase to (60, 9).
flights_dt |>
mutate(minute = minute(dep_time)) |>
group_by(minute) |>
@ -271,6 +291,11 @@ flights_dt |>
Interestingly, if we look at the *scheduled* departure time we don't see such a strong pattern:
```{r}
#| fig-alt: >
#| A line chart with minute of scheduled departure (0-60) on the x-axis
#| and average delay (4-16). There is relatively little pattern, just a
#| small suggestion that the average delay decreases from maybe 10 minutes
#| to 8 minutes over the course of the hour.
sched_dep <- flights_dt |>
mutate(minute = minute(sched_dep_time)) |>
group_by(minute) |>
@ -287,6 +312,12 @@ Well, like much data collected by humans, there's a strong bias towards flights
Always be alert for this sort of pattern whenever you work with data that involves human judgement!
```{r}
#| fig-alt: >
#| A line plot with departure minute (0-60) on the x-axis and number of
#| flights (0-60000) on the y-axis. Most flights are scheduled to depart
#| on either the hour (~60,000) or the half hour (~35,000). Otherwise,
#| all most all flights are scheduled to depart on multiples of five,
#| with a few extra at 15, 45, and 55 minutes.
ggplot(sched_dep, aes(minute, n)) +
geom_line()
```
@ -298,22 +329,55 @@ Each function takes a vector of dates to adjust and then the name of the unit ro
This, for example, allows us to plot the number of flights per week:
```{r}
#| fig-alt: >
#| A line plot with week (Jan-Dec 2013) on the x-axis and number of
#| flights (2,000-7,000) on the y-axis. The pattern is fairly flat from
#| February to November with around 7,000 flights per week. There are
#| far fewer flights on the first (approximately 4,500 flights) and last
#| weeks of the year (approximately 2,500 flights).
flights_dt |>
count(week = floor_date(dep_time, "week")) |>
ggplot(aes(week, n)) +
geom_line()
geom_line() +
geom_point()
```
Computing the difference between a rounded and unrounded date can be particularly useful.
### Setting components
You can also use each accessor function to set the components of a date/time:
You can use rounding to show the distribution of flights across the course of a day by computing the difference between `dep_time` and the earliest instant of that day:
```{r}
(datetime <- ymd_hms("2016-07-08 12:34:56"))
#| fig-alt: >
#| A line plot with depature time on the x-axis. This is units of seconds
#| since midnight so it's hard to interpret.
flights_dt |>
mutate(dep_hour = dep_time - floor_date(dep_time, "day")) |>
ggplot(aes(dep_hour)) +
geom_freqpoly(binwidth = 60 * 30)
```
year(datetime) <- 2020
Computing the difference between a pair of date-times yields a difftime (more on that in @sec-intervals). We can convert that to an `hms` object to get a more useful x-axis:
```{r}
#| fig-alt: >
#| A line plot with depature time (midnight to midnight) on the x-axis
#| and number of flights on the y-axis (0 to 15,000). There are very few
#| (<100) flights before 5am. The number of flights then rises rapidly
#| to 12,000 / hour, peaking at 15,000 at 9am, before falling to around
#| 8,000 / hour for 10am to 2pm. Number of flights then increases to
#| around 12,000 per hour until 8pm, when they rapidly drop again.
flights_dt |>
mutate(dep_hour = hms::as_hms(dep_time - floor_date(dep_time, "day"))) |>
ggplot(aes(dep_hour)) +
geom_freqpoly(binwidth = 60 * 30)
```
### Modifying components
You can also use each accessor function to modify the components of a date/time:
```{r}
(datetime <- ymd_hms("2026-07-08 12:34:56"))
year(datetime) <- 2030
datetime
month(datetime) <- 01
datetime
@ -321,33 +385,20 @@ hour(datetime) <- hour(datetime) + 1
datetime
```
Alternatively, rather than modifying in place, you can create a new date-time with `update()`.
This also allows you to set multiple values at once.
Alternatively, rather than modifying an existing variabke, you can create a new date-time with `update()`.
This also allows you to set multiple values in one step:
```{r}
update(datetime, year = 2020, month = 2, mday = 2, hour = 2)
update(datetime, year = 2030, month = 2, mday = 2, hour = 2)
```
If values are too big, they will roll-over:
```{r}
ymd("2015-02-01") |>
update(mday = 30)
ymd("2015-02-01") |>
update(hour = 400)
update(ymd("2023-02-01"), mday = 30)
update(ymd("2023-02-01"), hour = 400)
```
You can use `update()` to show the distribution of flights across the course of the day for every day of the year:
```{r}
flights_dt |>
mutate(dep_hour = update(dep_time, yday = 1)) |>
ggplot(aes(dep_hour)) +
geom_freqpoly(binwidth = 300)
```
Setting larger components of a date to a constant is a powerful technique that allows you to explore patterns in the smaller components.
### Exercises
1. How does the distribution of flight times within a day change over the course of the year?
@ -386,7 +437,7 @@ In R, when you subtract two dates, you get a difftime object:
```{r}
# How old is Hadley?
h_age <- today() - ymd(19791014)
h_age <- today() - ymd("1979-10-14")
h_age
```
@ -431,15 +482,15 @@ last_year <- today() - dyears(1)
However, because durations represent an exact number of seconds, sometimes you might get an unexpected result:
```{r}
one_pm <- ymd_hms("2016-03-12 13:00:00", tz = "America/New_York")
one_pm <- ymd_hms("2026-03-12 13:00:00", tz = "America/New_York")
one_pm
one_pm + ddays(1)
```
Why is one day after 1pm on March 12, 2pm on March 13?!
Why is one day after 1pm March 12, 2pm March 13?
If you look carefully at the date you might also notice that the time zones have changed.
Because of DST, March 12 only has 23 hours, so if we add a full days worth of seconds we end up with a different time.
March 12 only has 23 hours because it's when DST starts, so if we add a full days worth of seconds we end up with a different time.
### Periods
@ -455,13 +506,9 @@ one_pm + days(1)
Like durations, periods can be created with a number of friendly constructor functions.
```{r}
seconds(15)
minutes(10)
hours(c(12, 24))
days(7)
months(1:6)
weeks(3)
years(1)
```
You can add and multiply periods:
@ -476,8 +523,8 @@ Compared to durations, periods are more likely to do what you expect:
```{r}
# A leap year
ymd("2016-01-01") + dyears(1)
ymd("2016-01-01") + years(1)
ymd("2024-01-01") + dyears(1)
ymd("2024-01-01") + years(1)
# Daylight Savings Time
one_pm + ddays(1)
@ -500,7 +547,7 @@ We can fix this by adding `days(1)` to the arrival time of each overnight flight
flights_dt <- flights_dt |>
mutate(
overnight = arr_time < dep_time,
arr_time = arr_time + days(ifelse(overnight, 0, 1)),
arr_time = arr_time + days(if_else(overnight, 0, 1)),
sched_arr_time = sched_arr_time + days(overnight * 1)
)
```
@ -512,7 +559,7 @@ flights_dt |>
filter(overnight, arr_time < dep_time)
```
### Intervals
### Intervals {#sec-intervals}
It's obvious what `dyears(1) / ddays(365)` should return: one, because durations are always represented by a number of seconds, and a duration of a year is defined as 365 days worth of seconds.
@ -531,15 +578,18 @@ An interval is a pair of starting and ending date times, or you can think of it
You can create an interval by writing `start %--% end`:
```{r}
to_next_year <- today() %--% (today() + years(1))
to_next_year
y2023 <- ymd("2023-01-01") %--% ymd("2024-01-01")
y2024 <- ymd("2024-01-01") %--% ymd("2025-01-01")
y2023
y2024
```
You could then divide it by a duration or a period:
You could then divide it by `days()` to find out how many days fit in the year:
```{r}
to_next_year / ddays(1)
to_next_year / months(1)
y2023 / days(1)
y2024 / days(1)
```
### Summary
@ -548,17 +598,6 @@ How do you pick between duration, periods, and intervals?
As always, pick the simplest data structure that solves your problem.
If you only care about physical time, use a duration; if you need to add human times, use a period; if you need to figure out how long a span is in human units, use an interval.
@fig-dt-algebra summarizes permitted arithmetic operations between the different data types.
```{r}
#| label: fig-dt-algebra
#| echo: false
#| fig-cap: >
#| The allowed arithmetic operations between pairs of date/time classes.
knitr::include_graphics("diagrams/datetimes-arithmetic.png")
```
### Exercises
1. Explain `days(overnight * 1)` to someone who has just started learning R.
@ -576,17 +615,19 @@ knitr::include_graphics("diagrams/datetimes-arithmetic.png")
Time zones are an enormously complicated topic because of their interaction with geopolitical entities.
Fortunately we don't need to dig into all the details as they're not all important for data analysis, but there are a few challenges we'll need to tackle head on.
<!--# https://www.ietf.org/timezones/tzdb-2018a/theory.html -->
The first challenge is that everyday names of time zones tend to be ambiguous.
For example, if you're American you're probably familiar with EST, or Eastern Standard Time.
However, both Australia and Canada also have EST!
To avoid confusion, R uses the international standard IANA time zones.
These use a consistent naming scheme "<area>/<location>", typically in the form "\<continent\>/\<city\>" (there are a few exceptions because not every country lies on a continent).
These use a consistent naming scheme `{area}/{location}`, typically in the form `{continent}/{city}` or `{ocean}/{city}`.
Examples include "America/New_York", "Europe/Paris", and "Pacific/Auckland".
You might wonder why the time zone uses a city, when typically you think of time zones as associated with a country or region within a country.
This is because the IANA database has to record decades worth of time zone rules.
In the course of decades, countries change names (or break apart) fairly frequently, but city names tend to stay the same.
Another problem is that the name needs to reflect not only the current behaviour, but also the complete history.
Over the course of decades, countries change names (or break apart) fairly frequently, but city names tend to stay the same.
Another problem is that the name needs to reflect not only the current behavior, but also the complete history.
For example, there are time zones for both "America/New_York" and "America/Detroit".
These cities both currently use Eastern Standard Time but in 1969-1972 Michigan (the state in which Detroit is located), did not follow DST, so it needs a different name.
It's worth reading the raw time zone database (available at <http://www.iana.org/time-zones>) just to read some of these stories!
@ -610,9 +651,14 @@ In R, the time zone is an attribute of the date-time that only controls printing
For example, these three objects represent the same instant in time:
```{r}
(x1 <- ymd_hms("2015-06-01 12:00:00", tz = "America/New_York"))
(x2 <- ymd_hms("2015-06-01 18:00:00", tz = "Europe/Copenhagen"))
(x3 <- ymd_hms("2015-06-02 04:00:00", tz = "Pacific/Auckland"))
x1 <- ymd_hms("2024-06-01 12:00:00", tz = "America/New_York")
x1
x2 <- ymd_hms("2024-06-01 18:00:00", tz = "Europe/Copenhagen")
x2
x3 <- ymd_hms("2024-06-02 04:00:00", tz = "Pacific/Auckland")
x3
```
You can verify that they're the same time using subtraction:
@ -623,7 +669,7 @@ x1 - x3
```
Unless otherwise specified, lubridate always uses UTC.
UTC (Coordinated Universal Time) is the standard time zone used by the scientific community and roughly equivalent to its predecessor GMT (Greenwich Mean Time).
UTC (Coordinated Universal Time) is the standard time zone used by the scientific community and is roughly equivalent to GMT (Greenwich Mean Time).
It does not have DST, which makes a convenient representation for computation.
Operations that combine date-times, like `c()`, will often drop the time zone.
In that case, the date-times will display in your local time zone:

Binary file not shown.

Before

Width:  |  Height:  |  Size: 73 KiB

Binary file not shown.