There are four important verbs that affect the columns without changing the rows: mutate()
, select()
, rename()
, and relocate()
. mutate()
creates new columns that are functions of the existing columns; select()
, rename()
, and relocate()
change which columns are present, their names, or their positions.
mutate()
The job of mutate()
is to add new columns that are calculated from the existing columns. In the transform chapters, you’ll learn a large set of functions that you can use to manipulate different types of variables. For now, we’ll stick with basic algebra, which allows us to compute the gain
, how much time a delayed flight made up in the air, and the speed
in miles per hour:
flights |>
mutate(
gain = dep_delay - arr_delay,
speed = distance / air_time * 60
)
#> # A tibble: 336,776 × 21
#> year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
#> 1 2013 1 1 517 515 2 830 819 11 UA
#> 2 2013 1 1 533 529 4 850 830 20 UA
#> 3 2013 1 1 542 540 2 923 850 33 AA
#> 4 2013 1 1 544 545 -1 1004 1022 -18 B6
#> 5 2013 1 1 554 600 -6 812 837 -25 DL
#> 6 2013 1 1 554 558 -4 740 728 12 UA
#> # … with 336,770 more rows, 11 more variables: flight <int>, tailnum <chr>,
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> # minute <dbl>, time_hour <dttm>, gain <dbl>, speed <dbl>, and abbreviated
#> # variable names ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time,
#> # ⁵arr_delay
By default, mutate()
adds new columns on the right hand side of your dataset, which makes it difficult to see what’s happening here. We can use the .before
argument to instead add the variables to the left hand sideRemember that in RStudio, the easiest way to see a dataset with many columns is View()
.:
flights |>
mutate(
gain = dep_delay - arr_delay,
speed = distance / air_time * 60,
.before = 1
)
#> # A tibble: 336,776 × 21
#> gain speed year month day dep_time sched_dep_…¹ dep_d…² arr_t…³ sched…⁴
#> <dbl> <dbl> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 -9 370. 2013 1 1 517 515 2 830 819
#> 2 -16 374. 2013 1 1 533 529 4 850 830
#> 3 -31 408. 2013 1 1 542 540 2 923 850
#> 4 17 517. 2013 1 1 544 545 -1 1004 1022
#> 5 19 394. 2013 1 1 554 600 -6 812 837
#> 6 -16 288. 2013 1 1 554 558 -4 740 728
#> # … with 336,770 more rows, 11 more variables: arr_delay <dbl>,
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#> # time_hour <dttm>, and abbreviated variable names ¹sched_dep_time,
#> # ²dep_delay, ³arr_time, ⁴sched_arr_time
The .
is a sign that .before
is an argument to the function, not the name of a new variable. You can also use .after
to add after a variable, and in both .before
and .after
you can the name of a variable name instead of a position. For example, we could add the new variables after day:
flights |>
mutate(
gain = dep_delay - arr_delay,
speed = distance / air_time * 60,
.after = day
)
#> # A tibble: 336,776 × 21
#> year month day gain speed dep_time sched_dep_…¹ dep_d…² arr_t…³ sched…⁴
#> <int> <int> <int> <dbl> <dbl> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 -9 370. 517 515 2 830 819
#> 2 2013 1 1 -16 374. 533 529 4 850 830
#> 3 2013 1 1 -31 408. 542 540 2 923 850
#> 4 2013 1 1 17 517. 544 545 -1 1004 1022
#> 5 2013 1 1 19 394. 554 600 -6 812 837
#> 6 2013 1 1 -16 288. 554 558 -4 740 728
#> # … with 336,770 more rows, 11 more variables: arr_delay <dbl>,
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#> # time_hour <dttm>, and abbreviated variable names ¹sched_dep_time,
#> # ²dep_delay, ³arr_time, ⁴sched_arr_time
Alternatively, you can control which variables are kept with the .keep
argument. A particularly useful argument is "used"
which allows you to see the inputs and outputs from your calculations:
flights |>
mutate(,
gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours,
.keep = "used"
)
#> # A tibble: 336,776 × 6
#> dep_delay arr_delay air_time gain hours gain_per_hour
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2 11 227 -9 3.78 -2.38
#> 2 4 20 227 -16 3.78 -4.23
#> 3 2 33 160 -31 2.67 -11.6
#> 4 -1 -18 183 17 3.05 5.57
#> 5 -6 -25 116 19 1.93 9.83
#> 6 -4 12 150 -16 2.5 -6.4
#> # … with 336,770 more rows
select()
It’s not uncommon to get datasets with hundreds or even thousands of variables. In this situation, the first challenge is often just focusing on the variables you’re interested in. select()
allows you to rapidly zoom in on a useful subset using operations based on the names of the variables. select()
is not terribly useful with the flights data because we only have 19 variables, but you can still get the general idea of how it works:
# Select columns by name
flights |>
select(year, month, day)
#> # A tibble: 336,776 × 3
#> year month day
#> <int> <int> <int>
#> 1 2013 1 1
#> 2 2013 1 1
#> 3 2013 1 1
#> 4 2013 1 1
#> 5 2013 1 1
#> 6 2013 1 1
#> # … with 336,770 more rows
# Select all columns between year and day (inclusive)
flights |>
select(year:day)
#> # A tibble: 336,776 × 3
#> year month day
#> <int> <int> <int>
#> 1 2013 1 1
#> 2 2013 1 1
#> 3 2013 1 1
#> 4 2013 1 1
#> 5 2013 1 1
#> 6 2013 1 1
#> # … with 336,770 more rows
# Select all columns except those from year to day (inclusive)
flights |>
select(!year:day)
#> # A tibble: 336,776 × 16
#> dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier flight tailnum
#> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr>
#> 1 517 515 2 830 819 11 UA 1545 N14228
#> 2 533 529 4 850 830 20 UA 1714 N24211
#> 3 542 540 2 923 850 33 AA 1141 N619AA
#> 4 544 545 -1 1004 1022 -18 B6 725 N804JB
#> 5 554 600 -6 812 837 -25 DL 461 N668DN
#> 6 554 558 -4 740 728 12 UA 1696 N39463
#> # … with 336,770 more rows, 7 more variables: origin <chr>, dest <chr>,
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#> # time_hour <dttm>, and abbreviated variable names ¹sched_dep_time,
#> # ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
# Select all columns that are characters
flights |>
select(where(is.character))
#> # A tibble: 336,776 × 4
#> carrier tailnum origin dest
#> <chr> <chr> <chr> <chr>
#> 1 UA N14228 EWR IAH
#> 2 UA N24211 LGA IAH
#> 3 AA N619AA JFK MIA
#> 4 B6 N804JB JFK BQN
#> 5 DL N668DN LGA ATL
#> 6 UA N39463 EWR ORD
#> # … with 336,770 more rows
There are a number of helper functions you can use within select()
:
-
starts_with("abc")
: matches names that begin with “abc”.
-
ends_with("xyz")
: matches names that end with “xyz”.
-
contains("ijk")
: matches names that contain “ijk”.
-
num_range("x", 1:3)
: matches x1
, x2
and x3
.
See ?select
for more details. Once you know regular expressions (the topic of #chp-regexps) you’ll also be use matches()
to select variables that match a pattern.
You can rename variables as you select()
them by using =
. The new name appears on the left hand side of the =
, and the old variable appears on the right hand side:
flights |>
select(tail_num = tailnum)
#> # A tibble: 336,776 × 1
#> tail_num
#> <chr>
#> 1 N14228
#> 2 N24211
#> 3 N619AA
#> 4 N804JB
#> 5 N668DN
#> 6 N39463
#> # … with 336,770 more rows
rename()
If you just want to keep all the existing variables and just want to rename a few, you can use rename()
instead of select()
:
flights |>
rename(tail_num = tailnum)
#> # A tibble: 336,776 × 19
#> year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
#> 1 2013 1 1 517 515 2 830 819 11 UA
#> 2 2013 1 1 533 529 4 850 830 20 UA
#> 3 2013 1 1 542 540 2 923 850 33 AA
#> 4 2013 1 1 544 545 -1 1004 1022 -18 B6
#> 5 2013 1 1 554 600 -6 812 837 -25 DL
#> 6 2013 1 1 554 558 -4 740 728 12 UA
#> # … with 336,770 more rows, 9 more variables: flight <int>, tail_num <chr>,
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> # minute <dbl>, time_hour <dttm>, and abbreviated variable names
#> # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
It works exactly the same way as select()
, but keeps all the variables that aren’t explicitly selected.
If you have a bunch of inconsistently named columns and it would be painful to fix them all by hand, check out janitor::clean_names()
which provides some useful automated cleaning.
relocate()
Use relocate()
to move variables around. You might want to collect related variables together or move important variables to the front. By default relocate()
moves variables to the front:
flights |>
relocate(time_hour, air_time)
#> # A tibble: 336,776 × 19
#> time_hour air_time year month day dep_time sched_dep…¹ dep_d…²
#> <dttm> <dbl> <int> <int> <int> <int> <int> <dbl>
#> 1 2013-01-01 05:00:00 227 2013 1 1 517 515 2
#> 2 2013-01-01 05:00:00 227 2013 1 1 533 529 4
#> 3 2013-01-01 05:00:00 160 2013 1 1 542 540 2
#> 4 2013-01-01 05:00:00 183 2013 1 1 544 545 -1
#> 5 2013-01-01 06:00:00 116 2013 1 1 554 600 -6
#> 6 2013-01-01 05:00:00 150 2013 1 1 554 558 -4
#> # … with 336,770 more rows, 11 more variables: arr_time <int>,
#> # sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>,
#> # tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>, hour <dbl>,
#> # minute <dbl>, and abbreviated variable names ¹sched_dep_time, ²dep_delay
But you can use the same .before
and .after
arguments as mutate()
to choose where to put them:
flights |>
relocate(year:dep_time, .after = time_hour)
#> # A tibble: 336,776 × 19
#> sched…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier flight tailnum origin dest
#> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr>
#> 1 515 2 830 819 11 UA 1545 N14228 EWR IAH
#> 2 529 4 850 830 20 UA 1714 N24211 LGA IAH
#> 3 540 2 923 850 33 AA 1141 N619AA JFK MIA
#> 4 545 -1 1004 1022 -18 B6 725 N804JB JFK BQN
#> 5 600 -6 812 837 -25 DL 461 N668DN LGA ATL
#> 6 558 -4 740 728 12 UA 1696 N39463 EWR ORD
#> # … with 336,770 more rows, 9 more variables: air_time <dbl>,
#> # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, year <int>,
#> # month <int>, day <int>, dep_time <int>, and abbreviated variable names
#> # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
flights |>
relocate(starts_with("arr"), .before = dep_time)
#> # A tibble: 336,776 × 19
#> year month day arr_time arr_de…¹ dep_t…² sched…³ dep_d…⁴ sched…⁵ carrier
#> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <int> <chr>
#> 1 2013 1 1 830 11 517 515 2 819 UA
#> 2 2013 1 1 850 20 533 529 4 830 UA
#> 3 2013 1 1 923 33 542 540 2 850 AA
#> 4 2013 1 1 1004 -18 544 545 -1 1022 B6
#> 5 2013 1 1 812 -25 554 600 -6 837 DL
#> 6 2013 1 1 740 12 554 558 -4 728 UA
#> # … with 336,770 more rows, 9 more variables: flight <int>, tailnum <chr>,
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> # minute <dbl>, time_hour <dttm>, and abbreviated variable names
#> # ¹arr_delay, ²dep_time, ³sched_dep_time, ⁴dep_delay, ⁵sched_arr_time