diff --git a/README.md b/README.md index 5147df2..f0bfeed 100644 --- a/README.md +++ b/README.md @@ -47,6 +47,17 @@ devtools::install_github("hadley/r4ds") knitr::include_graphics("screenshots/rstudio-wg.png") ``` +### O'Reilly + +To generate book for O'Reilly, build the book then: + +```{r} +devtools::load_all("../minibook/"); process_book() + +html <- list.files("oreilly", pattern = "[.]html$", full.names = TRUE) +file.copy(html, "../r-for-data-science-2e/", overwrite = TRUE) +``` + ## Code of Conduct Please note that r4ds uses a [Contributor Code of Conduct](https://contributor-covenant.org/version/2/0/CODE_OF_CONDUCT.html). diff --git a/_common.R b/_common.R index dcdb77a..8165433 100644 --- a/_common.R +++ b/_common.R @@ -17,7 +17,7 @@ options( # Activate crayon output - temporarily disabled for quarto # crayon.enabled = TRUE, pillar.bold = TRUE, - width = 80 + width = 77 # 80 - 3 for #> comment ) ggplot2::theme_set(ggplot2::theme_gray(12)) @@ -39,7 +39,7 @@ status <- function(type) { ) cat(paste0( - "::: callout-", class, "\n", + "::: status callout-", class, "\n", "You are reading the work-in-progress second edition of R for Data Science. ", "This chapter ", status, ". ", "You can find the complete first edition at .\n", diff --git a/intro.qmd b/intro.qmd index 2510ae9..6df74e4 100644 --- a/intro.qmd +++ b/intro.qmd @@ -340,6 +340,22 @@ The book is powered by [Quarto](https://quarto.org) which makes it easy to write This book was built with: ```{r} -sessioninfo::session_info(c("tidyverse")) +#| echo: false +#| results: asis + +pkgs <- sessioninfo::package_info( + tidyverse:::tidyverse_packages(), + dependencies = FALSE +) +df <- tibble( + package = pkgs$package, + version = pkgs$ondiskversion, + source = gsub("@", "\\\\@", pkgs$source) +) +knitr::kable(df, format = "markdown") +``` + +```{r} cli:::ruler() ``` + diff --git a/oreilly/EDA.html b/oreilly/EDA.html index b2b854d..af195b5 100644 --- a/oreilly/EDA.html +++ b/oreilly/EDA.html @@ -1,13 +1,5 @@
-

Exploratory data analysis

-
- -
- -
- -

You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

- +

Exploratory data analysis

::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz. :::

Introduction

diff --git a/oreilly/base-R.html b/oreilly/base-R.html index eea74a3..7ff6a62 100644 --- a/oreilly/base-R.html +++ b/oreilly/base-R.html @@ -1,13 +1,5 @@
-

A field guide to base R

-
- -
- -
- -

You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

-

To finish off the programming section, we’re going to give you a quick tour of the most important base R functions that we don’t otherwise discuss in the book. These tools are particularly useful as you do more programming and will help you read code that you’ll encounter in the wild.

This is a good place to remind you that the tidyverse is not the only way to solve data science problems. We teach the tidyverse in this book because tidyverse packages share a common design philosophy, which increases the consistency across functions, making each new function or package a little easier to learn and use. It’s not possible to use the tidyverse without using base R, so we’ve actually already taught you a lot of base R functions: from library() to load packages, to sum() and mean() for numeric summaries, to the factor, date, and POSIXct data types, and of course all the basic operators like +, -, /, *, |, &, and !. What we haven’t focused on so far is base R workflows, so we will highlight a few of those in this chapter.

After you read this book you’ll learn other approaches to the same problems using base R, data.table, and other packages. You’ll certainly encounter these other approaches when you start reading R code written by other people, particularly if you’re using StackOverflow. It’s 100% okay to write code that uses a mix of approaches, and don’t let anyone tell you otherwise!

In this chapter, we’ll focus on four big topics: subsetting with [, subsetting with [[ and $, the apply family of functions, and for loops. To finish off, we’ll briefly discuss two important plotting functions.

+

A field guide to base R

::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz. :::

To finish off the programming section, we’re going to give you a quick tour of the most important base R functions that we don’t otherwise discuss in the book. These tools are particularly useful as you do more programming and will help you read code that you’ll encounter in the wild.

This is a good place to remind you that the tidyverse is not the only way to solve data science problems. We teach the tidyverse in this book because tidyverse packages share a common design philosophy, which increases the consistency across functions, making each new function or package a little easier to learn and use. It’s not possible to use the tidyverse without using base R, so we’ve actually already taught you a lot of base R functions: from library() to load packages, to sum() and mean() for numeric summaries, to the factor, date, and POSIXct data types, and of course all the basic operators like +, -, /, *, |, &, and !. What we haven’t focused on so far is base R workflows, so we will highlight a few of those in this chapter.

After you read this book you’ll learn other approaches to the same problems using base R, data.table, and other packages. You’ll certainly encounter these other approaches when you start reading R code written by other people, particularly if you’re using StackOverflow. It’s 100% okay to write code that uses a mix of approaches, and don’t let anyone tell you otherwise!

In this chapter, we’ll focus on four big topics: subsetting with [, subsetting with [[ and $, the apply family of functions, and for loops. To finish off, we’ll briefly discuss two important plotting functions.

Prerequisites

diff --git a/oreilly/communicate-plots.html b/oreilly/communicate-plots.html index c239512..5e0e045 100644 --- a/oreilly/communicate-plots.html +++ b/oreilly/communicate-plots.html @@ -1,13 +1,5 @@
-

Graphics for communication

-
- -
- -
- -

You are reading the work-in-progress second edition of R for Data Science. This chapter is currently a dumping ground for ideas, and we don’t recommend reading it. You can find the complete first edition at https://r4ds.had.co.nz.

- +

Graphics for communication

::: status callout-important You are reading the work-in-progress second edition of R for Data Science. This chapter is currently a dumping ground for ideas, and we don’t recommend reading it. You can find the complete first edition at https://r4ds.had.co.nz. :::

Introduction

diff --git a/oreilly/data-import.html b/oreilly/data-import.html index 4e70432..11cef6a 100644 --- a/oreilly/data-import.html +++ b/oreilly/data-import.html @@ -1,13 +1,5 @@
-

Data import

-
- -
- -
- -

You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

- +

Data import

::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz. :::

Introduction

@@ -83,7 +75,7 @@ Reading data from a file
students <- read_csv("data/students.csv")
 #> Rows: 6 Columns: 5
-#> ── Column specification ────────────────────────────────────────────────────────
+#> ── Column specification ─────────────────────────────────────────────────────
 #> Delimiter: ","
 #> chr (4): Full Name, favourite.food, mealPlan, AGE
 #> dbl (1): Student ID
@@ -324,7 +316,7 @@ Guessing types
   T,Inf,2021-02-16,ghi"
 )
 #> Rows: 3 Columns: 4
-#> ── Column specification ────────────────────────────────────────────────────────
+#> ── Column specification ─────────────────────────────────────────────────────
 #> Delimiter: ","
 #> chr  (1): string
 #> dbl  (1): numeric
@@ -360,7 +352,7 @@ Missing values, column types, and problems
 
df <- read_csv(csv)
 #> Rows: 4 Columns: 1
-#> ── Column specification ────────────────────────────────────────────────────────
+#> ── Column specification ─────────────────────────────────────────────────────
 #> Delimiter: ","
 #> chr (1): x
 #> 
@@ -370,8 +362,8 @@ Missing values, column types, and problems
 

In this very small case, you can easily see the missing value .. But what happens if you have thousands of rows with only a few missing values represented by .s speckled amongst them? One approach is to tell readr that x is a numeric column, and then see where it fails. You can do that with the col_types argument, which takes a named list:

df <- read_csv(csv, col_types = list(x = col_double()))
-#> Warning: One or more parsing issues, call `problems()` on your data frame for details,
-#> e.g.:
+#> Warning: One or more parsing issues, call `problems()` on your data frame for
+#> details, e.g.:
 #>   dat <- vroom(...)
 #>   problems(dat)
@@ -381,13 +373,13 @@ Missing values, column types, and problems #> # A tibble: 1 × 5 #> row col expected actual file #> <int> <int> <chr> <chr> <chr> -#> 1 3 1 a double . /private/tmp/Rtmp43JYhG/file7cf337a06034
+#> 1 3 1 a double . /private/tmp/Rtmpc2nAIe/file8f2f488fc2f4

This tells us that there was a problem in row 3, col 1 where readr expected a double but got a .. That suggests this dataset uses . for missing values. So then we set na = ".", the automatic guessing succeeds, giving us the numeric column that we want:

df <- read_csv(csv, na = ".")
 #> Rows: 4 Columns: 1
-#> ── Column specification ────────────────────────────────────────────────────────
+#> ── Column specification ─────────────────────────────────────────────────────
 #> Delimiter: ","
 #> dbl (1): x
 #> 
@@ -447,7 +439,7 @@ Reading data from multiple files
 
sales_files <- c("data/01-sales.csv", "data/02-sales.csv", "data/03-sales.csv")
 read_csv(sales_files, id = "file")
 #> Rows: 19 Columns: 6
-#> ── Column specification ────────────────────────────────────────────────────────
+#> ── Column specification ─────────────────────────────────────────────────────
 #> Delimiter: ","
 #> chr (1): month
 #> dbl (4): year, brand, item, n
diff --git a/oreilly/data-tidy.html b/oreilly/data-tidy.html
index e48477c..e6a4262 100644
--- a/oreilly/data-tidy.html
+++ b/oreilly/data-tidy.html
@@ -1,13 +1,5 @@
 
-

Data tidying

-
- -
- -
- -

You are reading the work-in-progress second edition of R for Data Science. This chapter is largely complete and just needs final proof reading. You can find the complete first edition at https://r4ds.had.co.nz.

- +

Data tidying

::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter is largely complete and just needs final proof reading. You can find the complete first edition at https://r4ds.had.co.nz. :::

Introduction

@@ -174,21 +166,21 @@ Data in column names
billboard
 #> # A tibble: 317 × 79
-#>   artist  track date.ent…¹   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8   wk9
-#>   <chr>   <chr> <date>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
-#> 1 2 Pac   Baby… 2000-02-26    87    82    72    77    87    94    99    NA    NA
-#> 2 2Ge+her The … 2000-09-02    91    87    92    NA    NA    NA    NA    NA    NA
-#> 3 3 Door… Kryp… 2000-04-08    81    70    68    67    66    57    54    53    51
-#> 4 3 Door… Loser 2000-10-21    76    76    72    69    67    65    55    59    62
-#> 5 504 Bo… Wobb… 2000-04-15    57    34    25    17    17    31    36    49    53
-#> 6 98^0    Give… 2000-08-19    51    39    34    26    26    19     2     2     3
-#> # … with 311 more rows, 67 more variables: wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
-#> #   wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
-#> #   wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
-#> #   wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
-#> #   wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
-#> #   wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>,
-#> #   wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, …
+#> artist track date.ent…¹ wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8 +#> <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> +#> 1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA +#> 2 2Ge+her The … 2000-09-02 91 87 92 NA NA NA NA NA +#> 3 3 Doors D… Kryp… 2000-04-08 81 70 68 67 66 57 54 53 +#> 4 3 Doors D… Loser 2000-10-21 76 76 72 69 67 65 55 59 +#> 5 504 Boyz Wobb… 2000-04-15 57 34 25 17 17 31 36 49 +#> 6 98^0 Give… 2000-08-19 51 39 34 26 26 19 2 2 +#> # … with 311 more rows, 68 more variables: wk9 <dbl>, wk10 <dbl>, +#> # wk11 <dbl>, wk12 <dbl>, wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, +#> # wk17 <dbl>, wk18 <dbl>, wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, +#> # wk23 <dbl>, wk24 <dbl>, wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, +#> # wk29 <dbl>, wk30 <dbl>, wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, +#> # wk35 <dbl>, wk36 <dbl>, wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, +#> # wk41 <dbl>, wk42 <dbl>, wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, …

In this dataset, each observation is a song. The first three columns (artist, track and date.entered) are variables that describe the song. Then we have 76 columns (wk1-wk76) that describe the rank of the song in each week. Here, the column names are one variable (the week) and the cell values are another (the rank).

To tidy this data, we’ll use pivot_longer(). After the data, there are three key arguments:

@@ -347,21 +339,21 @@ Many variables in column names
who2
 #> # A tibble: 7,240 × 58
-#>   country   year sp_m_…¹ sp_m_…² sp_m_…³ sp_m_…⁴ sp_m_…⁵ sp_m_…⁶ sp_m_65 sp_f_…⁷
-#>   <chr>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
-#> 1 Afghani…  1980      NA      NA      NA      NA      NA      NA      NA      NA
-#> 2 Afghani…  1981      NA      NA      NA      NA      NA      NA      NA      NA
-#> 3 Afghani…  1982      NA      NA      NA      NA      NA      NA      NA      NA
-#> 4 Afghani…  1983      NA      NA      NA      NA      NA      NA      NA      NA
-#> 5 Afghani…  1984      NA      NA      NA      NA      NA      NA      NA      NA
-#> 6 Afghani…  1985      NA      NA      NA      NA      NA      NA      NA      NA
-#> # … with 7,234 more rows, 48 more variables: sp_f_1524 <dbl>, sp_f_2534 <dbl>,
-#> #   sp_f_3544 <dbl>, sp_f_4554 <dbl>, sp_f_5564 <dbl>, sp_f_65 <dbl>,
-#> #   sn_m_014 <dbl>, sn_m_1524 <dbl>, sn_m_2534 <dbl>, sn_m_3544 <dbl>,
-#> #   sn_m_4554 <dbl>, sn_m_5564 <dbl>, sn_m_65 <dbl>, sn_f_014 <dbl>,
-#> #   sn_f_1524 <dbl>, sn_f_2534 <dbl>, sn_f_3544 <dbl>, sn_f_4554 <dbl>,
-#> #   sn_f_5564 <dbl>, sn_f_65 <dbl>, ep_m_014 <dbl>, ep_m_1524 <dbl>,
-#> #   ep_m_2534 <dbl>, ep_m_3544 <dbl>, ep_m_4554 <dbl>, ep_m_5564 <dbl>, …
+#> country year sp_m_014 sp_m_1…¹ sp_m_…² sp_m_…³ sp_m_…⁴ sp_m_…⁵ sp_m_65 +#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> +#> 1 Afghanistan 1980 NA NA NA NA NA NA NA +#> 2 Afghanistan 1981 NA NA NA NA NA NA NA +#> 3 Afghanistan 1982 NA NA NA NA NA NA NA +#> 4 Afghanistan 1983 NA NA NA NA NA NA NA +#> 5 Afghanistan 1984 NA NA NA NA NA NA NA +#> 6 Afghanistan 1985 NA NA NA NA NA NA NA +#> # … with 7,234 more rows, 49 more variables: sp_f_014 <dbl>, +#> # sp_f_1524 <dbl>, sp_f_2534 <dbl>, sp_f_3544 <dbl>, sp_f_4554 <dbl>, +#> # sp_f_5564 <dbl>, sp_f_65 <dbl>, sn_m_014 <dbl>, sn_m_1524 <dbl>, +#> # sn_m_2534 <dbl>, sn_m_3544 <dbl>, sn_m_4554 <dbl>, sn_m_5564 <dbl>, +#> # sn_m_65 <dbl>, sn_f_014 <dbl>, sn_f_1524 <dbl>, sn_f_2534 <dbl>, +#> # sn_f_3544 <dbl>, sn_f_4554 <dbl>, sn_f_5564 <dbl>, sn_f_65 <dbl>, +#> # ep_m_014 <dbl>, ep_m_1524 <dbl>, ep_m_2534 <dbl>, ep_m_3544 <dbl>, …

This dataset records information about tuberculosis data collected by the WHO. There are two columns that are already variables and are easy to interpret: country and year. They are followed by 56 columns like sp_m_014, ep_m_4554, and rel_m_3544. If you stare at these columns for long enough, you’ll notice there’s a pattern. Each column name is made up of three pieces separated by _. The first piece, sp/rel/ep, describes the method used for the diagnosis, the second piece, m/f is the gender, and the third piece, 014/1524/2535/3544/4554/65 is the age range.

So in this case we have six variables: two variables are already columns, three variables are contained in the column name, and one variable is in the cell name. This requires two changes to our call to pivot_longer(): names_to gets a vector of column names and names_sep describes how to split the variable name up into pieces:

@@ -454,14 +446,14 @@ Widening data
cms_patient_experience
 #> # A tibble: 500 × 5
-#>   org_pac_id org_nm                     measure_cd   measure_title       prf_r…¹
-#>   <chr>      <chr>                      <chr>        <chr>                 <dbl>
-#> 1 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_1  CAHPS for MIPS SSM…      63
-#> 2 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_2  CAHPS for MIPS SSM…      87
-#> 3 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_3  CAHPS for MIPS SSM…      86
-#> 4 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_5  CAHPS for MIPS SSM…      57
-#> 5 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_8  CAHPS for MIPS SSM…      85
-#> 6 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_12 CAHPS for MIPS SSM…      24
+#>   org_pac_id org_nm                     measure_cd   measure_title    prf_r…¹
+#>   <chr>      <chr>                      <chr>        <chr>              <dbl>
+#> 1 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_1  CAHPS for MIPS …      63
+#> 2 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_2  CAHPS for MIPS …      87
+#> 3 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_3  CAHPS for MIPS …      86
+#> 4 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_5  CAHPS for MIPS …      57
+#> 5 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_8  CAHPS for MIPS …      85
+#> 6 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_12 CAHPS for MIPS …      24
 #> # … with 494 more rows, and abbreviated variable name ¹​prf_rate

An observation is an organisation, but each organisation is spread across six rows, with one row for each variable, or measure. We can see the complete set of values for measure_cd and measure_title by using distinct():

@@ -469,13 +461,13 @@ Widening data
cms_patient_experience |> 
   distinct(measure_cd, measure_title)
 #> # A tibble: 6 × 2
-#>   measure_cd   measure_title                                                    
-#>   <chr>        <chr>                                                            
-#> 1 CAHPS_GRP_1  CAHPS for MIPS SSM: Getting Timely Care, Appointments, and Infor…
-#> 2 CAHPS_GRP_2  CAHPS for MIPS SSM: How Well Providers Communicate               
-#> 3 CAHPS_GRP_3  CAHPS for MIPS SSM: Patient's Rating of Provider                 
-#> 4 CAHPS_GRP_5  CAHPS for MIPS SSM: Health Promotion and Education               
-#> 5 CAHPS_GRP_8  CAHPS for MIPS SSM: Courteous and Helpful Office Staff           
+#>   measure_cd   measure_title                                                 
+#>   <chr>        <chr>                                                         
+#> 1 CAHPS_GRP_1  CAHPS for MIPS SSM: Getting Timely Care, Appointments, and In…
+#> 2 CAHPS_GRP_2  CAHPS for MIPS SSM: How Well Providers Communicate            
+#> 3 CAHPS_GRP_3  CAHPS for MIPS SSM: Patient's Rating of Provider              
+#> 4 CAHPS_GRP_5  CAHPS for MIPS SSM: Health Promotion and Education            
+#> 5 CAHPS_GRP_8  CAHPS for MIPS SSM: Courteous and Helpful Office Staff        
 #> 6 CAHPS_GRP_12 CAHPS for MIPS SSM: Stewardship of Patient Resources

Neither of these columns will make particularly great variable names: measure_cd doesn’t hint at the meaning of the variable and measure_title is a long sentence containing spaces. We’ll use measure_cd for now, but in a real analysis you might want to create your own variable names that are both short and meaningful.

@@ -487,14 +479,14 @@ Widening data values_from = prf_rate ) #> # A tibble: 500 × 9 -#> org_pac_id org_nm measu…¹ CAHPS…² CAHPS…³ CAHPS…⁴ CAHPS…⁵ CAHPS…⁶ CAHPS…⁷ -#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> -#> 1 0446157747 USC CARE M… CAHPS … 63 NA NA NA NA NA -#> 2 0446157747 USC CARE M… CAHPS … NA 87 NA NA NA NA -#> 3 0446157747 USC CARE M… CAHPS … NA NA 86 NA NA NA -#> 4 0446157747 USC CARE M… CAHPS … NA NA NA 57 NA NA -#> 5 0446157747 USC CARE M… CAHPS … NA NA NA NA 85 NA -#> 6 0446157747 USC CARE M… CAHPS … NA NA NA NA NA 24 +#> org_pac_id org_nm measu…¹ CAHPS…² CAHPS…³ CAHPS…⁴ CAHPS…⁵ CAHPS…⁶ CAHPS…⁷ +#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> +#> 1 0446157747 USC CAR… CAHPS … 63 NA NA NA NA NA +#> 2 0446157747 USC CAR… CAHPS … NA 87 NA NA NA NA +#> 3 0446157747 USC CAR… CAHPS … NA NA 86 NA NA NA +#> 4 0446157747 USC CAR… CAHPS … NA NA NA 57 NA NA +#> 5 0446157747 USC CAR… CAHPS … NA NA NA NA 85 NA +#> 6 0446157747 USC CAR… CAHPS … NA NA NA NA NA 24 #> # … with 494 more rows, and abbreviated variable names ¹​measure_title, #> # ²​CAHPS_GRP_1, ³​CAHPS_GRP_2, ⁴​CAHPS_GRP_3, ⁵​CAHPS_GRP_5, ⁶​CAHPS_GRP_8, #> # ⁷​CAHPS_GRP_12 @@ -508,14 +500,14 @@ Widening data values_from = prf_rate ) #> # A tibble: 95 × 8 -#> org_pac_id org_nm CAHPS…¹ CAHPS…² CAHPS…³ CAHPS…⁴ CAHPS…⁵ CAHPS…⁶ -#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> -#> 1 0446157747 USC CARE MEDICAL G… 63 87 86 57 85 24 -#> 2 0446162697 ASSOCIATION OF UNI… 59 85 83 63 88 22 -#> 3 0547164295 BEAVER MEDICAL GRO… 49 NA 75 44 73 12 -#> 4 0749333730 CAPE PHYSICIANS AS… 67 84 85 65 82 24 -#> 5 0840104360 ALLIANCE PHYSICIAN… 66 87 87 64 87 28 -#> 6 0840109864 REX HOSPITAL INC 73 87 84 67 91 30 +#> org_pac_id org_nm CAHPS…¹ CAHPS…² CAHPS…³ CAHPS…⁴ CAHPS…⁵ CAHPS…⁶ +#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> +#> 1 0446157747 USC CARE MEDICA… 63 87 86 57 85 24 +#> 2 0446162697 ASSOCIATION OF … 59 85 83 63 88 22 +#> 3 0547164295 BEAVER MEDICAL … 49 NA 75 44 73 12 +#> 4 0749333730 CAPE PHYSICIANS… 67 84 85 65 82 24 +#> 5 0840104360 ALLIANCE PHYSIC… 66 87 87 64 87 28 +#> 6 0840109864 REX HOSPITAL INC 73 87 84 67 91 30 #> # … with 89 more rows, and abbreviated variable names ¹​CAHPS_GRP_1, #> # ²​CAHPS_GRP_2, ³​CAHPS_GRP_3, ⁴​CAHPS_GRP_5, ⁵​CAHPS_GRP_8, ⁶​CAHPS_GRP_12 @@ -602,7 +594,8 @@ How doespivot_wider() work? names_from = name, values_from = value ) -#> Warning: Values from `value` are not uniquely identified; output will contain list-cols. +#> Warning: Values from `value` are not uniquely identified; output will contain +#> list-cols. #> • Use `values_fn = list` to suppress this warning. #> • Use `values_fn = {summary_fun}` to summarise duplicates. #> • Use the following dplyr code to identify duplicates. @@ -695,15 +688,16 @@ col_year <- gapminder |> ) col_year #> # A tibble: 142 × 13 -#> country `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` `1992` `1997` -#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> -#> 1 Afghani… 2.89 2.91 2.93 2.92 2.87 2.90 2.99 2.93 2.81 2.80 -#> 2 Albania 3.20 3.29 3.36 3.44 3.52 3.55 3.56 3.57 3.40 3.50 -#> 3 Algeria 3.39 3.48 3.41 3.51 3.62 3.69 3.76 3.75 3.70 3.68 -#> 4 Angola 3.55 3.58 3.63 3.74 3.74 3.48 3.44 3.39 3.42 3.36 -#> 5 Argenti… 3.77 3.84 3.85 3.91 3.98 4.00 3.95 3.96 3.97 4.04 -#> 6 Austral… 4.00 4.04 4.09 4.16 4.23 4.26 4.29 4.34 4.37 4.43 -#> # … with 136 more rows, and 2 more variables: `2002` <dbl>, `2007` <dbl> +#> country `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` `1992` +#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> +#> 1 Afghanistan 2.89 2.91 2.93 2.92 2.87 2.90 2.99 2.93 2.81 +#> 2 Albania 3.20 3.29 3.36 3.44 3.52 3.55 3.56 3.57 3.40 +#> 3 Algeria 3.39 3.48 3.41 3.51 3.62 3.69 3.76 3.75 3.70 +#> 4 Angola 3.55 3.58 3.63 3.74 3.74 3.48 3.44 3.39 3.42 +#> 5 Argentina 3.77 3.84 3.85 3.91 3.98 4.00 3.95 3.96 3.97 +#> 6 Australia 4.00 4.04 4.09 4.16 4.23 4.26 4.29 4.34 4.37 +#> # … with 136 more rows, and 3 more variables: `1997` <dbl>, `2002` <dbl>, +#> # `2007` <dbl>

pivot_wider() produces a tibble where each row is labelled by the country variable. But most classic statistical algorithms don’t want the identifier as an explicit variable; they want as a row name. We can turn the country variable into row names with column_to_rowname():

diff --git a/oreilly/data-transform.html b/oreilly/data-transform.html index faf6be0..f18fc45 100644 --- a/oreilly/data-transform.html +++ b/oreilly/data-transform.html @@ -1,13 +1,5 @@
-

Data transformation

-
- -
- -
- -

You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

- +

Data transformation

::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz. :::

Introduction

@@ -21,12 +13,12 @@ Prerequisites
library(nycflights13)
 library(tidyverse)
-#> ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
+#> ── Attaching packages ──────────────────────────────────── tidyverse 1.3.2 ──
 #> ✔ ggplot2 3.4.0.9000        ✔ purrr   0.9000.0.9000
 #> ✔ tibble  3.1.8             ✔ dplyr   1.0.99.9000  
 #> ✔ tidyr   1.2.1.9001        ✔ stringr 1.4.1.9000   
 #> ✔ readr   2.1.3             ✔ forcats 0.5.2        
-#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
+#> ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──
 #> ✖ dplyr::filter() masks stats::filter()
 #> ✖ dplyr::lag()    masks stats::lag()
@@ -40,14 +32,14 @@ nycflights13
flights
 #> # A tibble: 336,776 × 19
-#>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
-#>   <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
-#> 1  2013     1     1      517         515       2     830     819      11 UA     
-#> 2  2013     1     1      533         529       4     850     830      20 UA     
-#> 3  2013     1     1      542         540       2     923     850      33 AA     
-#> 4  2013     1     1      544         545      -1    1004    1022     -18 B6     
-#> 5  2013     1     1      554         600      -6     812     837     -25 DL     
-#> 6  2013     1     1      554         558      -4     740     728      12 UA     
+#>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
+#>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>  
+#> 1  2013     1     1      517      515       2     830     819      11 UA     
+#> 2  2013     1     1      533      529       4     850     830      20 UA     
+#> 3  2013     1     1      542      540       2     923     850      33 AA     
+#> 4  2013     1     1      544      545      -1    1004    1022     -18 B6     
+#> 5  2013     1     1      554      600      -6     812     837     -25 DL     
+#> 6  2013     1     1      554      558      -4     740     728      12 UA     
 #> # … with 336,770 more rows, 9 more variables: flight <int>, tailnum <chr>,
 #> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
 #> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
@@ -92,14 +84,14 @@ Rows
 
flights |> 
   filter(arr_delay > 120)
 #> # A tibble: 10,034 × 19
-#>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
-#>   <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
-#> 1  2013     1     1      811         630     101    1047     830     137 MQ     
-#> 2  2013     1     1      848        1835     853    1001    1950     851 MQ     
-#> 3  2013     1     1      957         733     144    1056     853     123 UA     
-#> 4  2013     1     1     1114         900     134    1447    1222     145 UA     
-#> 5  2013     1     1     1505        1310     115    1638    1431     127 EV     
-#> 6  2013     1     1     1525        1340     105    1831    1626     125 B6     
+#>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
+#>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>  
+#> 1  2013     1     1      811      630     101    1047     830     137 MQ     
+#> 2  2013     1     1      848     1835     853    1001    1950     851 MQ     
+#> 3  2013     1     1      957      733     144    1056     853     123 UA     
+#> 4  2013     1     1     1114      900     134    1447    1222     145 UA     
+#> 5  2013     1     1     1505     1310     115    1638    1431     127 EV     
+#> 6  2013     1     1     1525     1340     105    1831    1626     125 B6     
 #> # … with 10,028 more rows, 9 more variables: flight <int>, tailnum <chr>,
 #> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
 #> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
@@ -111,14 +103,14 @@ Rows
 flights |> 
   filter(month == 1 & day == 1)
 #> # A tibble: 842 × 19
-#>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
-#>   <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
-#> 1  2013     1     1      517         515       2     830     819      11 UA     
-#> 2  2013     1     1      533         529       4     850     830      20 UA     
-#> 3  2013     1     1      542         540       2     923     850      33 AA     
-#> 4  2013     1     1      544         545      -1    1004    1022     -18 B6     
-#> 5  2013     1     1      554         600      -6     812     837     -25 DL     
-#> 6  2013     1     1      554         558      -4     740     728      12 UA     
+#>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
+#>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>  
+#> 1  2013     1     1      517      515       2     830     819      11 UA     
+#> 2  2013     1     1      533      529       4     850     830      20 UA     
+#> 3  2013     1     1      542      540       2     923     850      33 AA     
+#> 4  2013     1     1      544      545      -1    1004    1022     -18 B6     
+#> 5  2013     1     1      554      600      -6     812     837     -25 DL     
+#> 6  2013     1     1      554      558      -4     740     728      12 UA     
 #> # … with 836 more rows, 9 more variables: flight <int>, tailnum <chr>,
 #> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
 #> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
@@ -128,14 +120,14 @@ flights |>
 flights |> 
   filter(month == 1 | month == 2)
 #> # A tibble: 51,955 × 19
-#>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
-#>   <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
-#> 1  2013     1     1      517         515       2     830     819      11 UA     
-#> 2  2013     1     1      533         529       4     850     830      20 UA     
-#> 3  2013     1     1      542         540       2     923     850      33 AA     
-#> 4  2013     1     1      544         545      -1    1004    1022     -18 B6     
-#> 5  2013     1     1      554         600      -6     812     837     -25 DL     
-#> 6  2013     1     1      554         558      -4     740     728      12 UA     
+#>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
+#>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>  
+#> 1  2013     1     1      517      515       2     830     819      11 UA     
+#> 2  2013     1     1      533      529       4     850     830      20 UA     
+#> 3  2013     1     1      542      540       2     923     850      33 AA     
+#> 4  2013     1     1      544      545      -1    1004    1022     -18 B6     
+#> 5  2013     1     1      554      600      -6     812     837     -25 DL     
+#> 6  2013     1     1      554      558      -4     740     728      12 UA     
 #> # … with 51,949 more rows, 9 more variables: flight <int>, tailnum <chr>,
 #> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
 #> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
@@ -147,14 +139,14 @@ flights |>
 flights |> 
   filter(month %in% c(1, 2))
 #> # A tibble: 51,955 × 19
-#>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
-#>   <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
-#> 1  2013     1     1      517         515       2     830     819      11 UA     
-#> 2  2013     1     1      533         529       4     850     830      20 UA     
-#> 3  2013     1     1      542         540       2     923     850      33 AA     
-#> 4  2013     1     1      544         545      -1    1004    1022     -18 B6     
-#> 5  2013     1     1      554         600      -6     812     837     -25 DL     
-#> 6  2013     1     1      554         558      -4     740     728      12 UA     
+#>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
+#>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>  
+#> 1  2013     1     1      517      515       2     830     819      11 UA     
+#> 2  2013     1     1      533      529       4     850     830      20 UA     
+#> 3  2013     1     1      542      540       2     923     850      33 AA     
+#> 4  2013     1     1      544      545      -1    1004    1022     -18 B6     
+#> 5  2013     1     1      554      600      -6     812     837     -25 DL     
+#> 6  2013     1     1      554      558      -4     740     728      12 UA     
 #> # … with 51,949 more rows, 9 more variables: flight <int>, tailnum <chr>,
 #> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
 #> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
@@ -197,14 +189,14 @@ Common mistakes
 
flights |> 
   arrange(year, month, day, dep_time)
 #> # A tibble: 336,776 × 19
-#>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
-#>   <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
-#> 1  2013     1     1      517         515       2     830     819      11 UA     
-#> 2  2013     1     1      533         529       4     850     830      20 UA     
-#> 3  2013     1     1      542         540       2     923     850      33 AA     
-#> 4  2013     1     1      544         545      -1    1004    1022     -18 B6     
-#> 5  2013     1     1      554         600      -6     812     837     -25 DL     
-#> 6  2013     1     1      554         558      -4     740     728      12 UA     
+#>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
+#>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>  
+#> 1  2013     1     1      517      515       2     830     819      11 UA     
+#> 2  2013     1     1      533      529       4     850     830      20 UA     
+#> 3  2013     1     1      542      540       2     923     850      33 AA     
+#> 4  2013     1     1      544      545      -1    1004    1022     -18 B6     
+#> 5  2013     1     1      554      600      -6     812     837     -25 DL     
+#> 6  2013     1     1      554      558      -4     740     728      12 UA     
 #> # … with 336,770 more rows, 9 more variables: flight <int>, tailnum <chr>,
 #> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
 #> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
@@ -215,14 +207,14 @@ Common mistakes
 
flights |> 
   arrange(desc(dep_delay))
 #> # A tibble: 336,776 × 19
-#>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
-#>   <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
-#> 1  2013     1     9      641         900    1301    1242    1530    1272 HA     
-#> 2  2013     6    15     1432        1935    1137    1607    2120    1127 MQ     
-#> 3  2013     1    10     1121        1635    1126    1239    1810    1109 MQ     
-#> 4  2013     9    20     1139        1845    1014    1457    2210    1007 AA     
-#> 5  2013     7    22      845        1600    1005    1044    1815     989 MQ     
-#> 6  2013     4    10     1100        1900     960    1342    2211     931 DL     
+#>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
+#>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>  
+#> 1  2013     1     9      641      900    1301    1242    1530    1272 HA     
+#> 2  2013     6    15     1432     1935    1137    1607    2120    1127 MQ     
+#> 3  2013     1    10     1121     1635    1126    1239    1810    1109 MQ     
+#> 4  2013     9    20     1139     1845    1014    1457    2210    1007 AA     
+#> 5  2013     7    22      845     1600    1005    1044    1815     989 MQ     
+#> 6  2013     4    10     1100     1900     960    1342    2211     931 DL     
 #> # … with 336,770 more rows, 9 more variables: flight <int>, tailnum <chr>,
 #> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
 #> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
@@ -234,14 +226,14 @@ Common mistakes
   filter(dep_delay <= 10 & dep_delay >= -10) |> 
   arrange(desc(arr_delay))
 #> # A tibble: 239,109 × 19
-#>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
-#>   <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
-#> 1  2013    11     1      658         700      -2    1329    1015     194 VX     
-#> 2  2013     4    18      558         600      -2    1149     850     179 AA     
-#> 3  2013     7     7     1659        1700      -1    2050    1823     147 US     
-#> 4  2013     7    22     1606        1615      -9    2056    1831     145 DL     
-#> 5  2013     9    19      648         641       7    1035     810     145 UA     
-#> 6  2013     4    18      655         700      -5    1213     950     143 AA     
+#>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
+#>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>  
+#> 1  2013    11     1      658      700      -2    1329    1015     194 VX     
+#> 2  2013     4    18      558      600      -2    1149     850     179 AA     
+#> 3  2013     7     7     1659     1700      -1    2050    1823     147 US     
+#> 4  2013     7    22     1606     1615      -9    2056    1831     145 DL     
+#> 5  2013     9    19      648      641       7    1035     810     145 UA     
+#> 6  2013     4    18      655      700      -5    1213     950     143 AA     
 #> # … with 239,103 more rows, 9 more variables: flight <int>, tailnum <chr>,
 #> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
 #> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
@@ -285,14 +277,14 @@ Columns
     speed = distance / air_time * 60
   )
 #> # A tibble: 336,776 × 21
-#>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
-#>   <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
-#> 1  2013     1     1      517         515       2     830     819      11 UA     
-#> 2  2013     1     1      533         529       4     850     830      20 UA     
-#> 3  2013     1     1      542         540       2     923     850      33 AA     
-#> 4  2013     1     1      544         545      -1    1004    1022     -18 B6     
-#> 5  2013     1     1      554         600      -6     812     837     -25 DL     
-#> 6  2013     1     1      554         558      -4     740     728      12 UA     
+#>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
+#>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>  
+#> 1  2013     1     1      517      515       2     830     819      11 UA     
+#> 2  2013     1     1      533      529       4     850     830      20 UA     
+#> 3  2013     1     1      542      540       2     923     850      33 AA     
+#> 4  2013     1     1      544      545      -1    1004    1022     -18 B6     
+#> 5  2013     1     1      554      600      -6     812     837     -25 DL     
+#> 6  2013     1     1      554      558      -4     740     728      12 UA     
 #> # … with 336,770 more rows, 11 more variables: flight <int>, tailnum <chr>,
 #> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
 #> #   minute <dbl>, time_hour <dttm>, gain <dbl>, speed <dbl>, and abbreviated
@@ -308,18 +300,19 @@ Columns
     .before = 1
   )
 #> # A tibble: 336,776 × 21
-#>    gain speed  year month   day dep_time sched…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵
-#>   <dbl> <dbl> <int> <int> <int>    <int>   <int>   <dbl>   <int>   <int>   <dbl>
-#> 1    -9  370.  2013     1     1      517     515       2     830     819      11
-#> 2   -16  374.  2013     1     1      533     529       4     850     830      20
-#> 3   -31  408.  2013     1     1      542     540       2     923     850      33
-#> 4    17  517.  2013     1     1      544     545      -1    1004    1022     -18
-#> 5    19  394.  2013     1     1      554     600      -6     812     837     -25
-#> 6   -16  288.  2013     1     1      554     558      -4     740     728      12
-#> # … with 336,770 more rows, 10 more variables: carrier <chr>, flight <int>,
-#> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
-#> #   hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated variable names
-#> #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
+#> gain speed year month day dep_time sched_dep_…¹ dep_d…² arr_t…³ sched…⁴ +#> <dbl> <dbl> <int> <int> <int> <int> <int> <dbl> <int> <int> +#> 1 -9 370. 2013 1 1 517 515 2 830 819 +#> 2 -16 374. 2013 1 1 533 529 4 850 830 +#> 3 -31 408. 2013 1 1 542 540 2 923 850 +#> 4 17 517. 2013 1 1 544 545 -1 1004 1022 +#> 5 19 394. 2013 1 1 554 600 -6 812 837 +#> 6 -16 288. 2013 1 1 554 558 -4 740 728 +#> # … with 336,770 more rows, 11 more variables: arr_delay <dbl>, +#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, +#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, +#> # time_hour <dttm>, and abbreviated variable names ¹​sched_dep_time, +#> # ²​dep_delay, ³​arr_time, ⁴​sched_arr_time

The . is a sign that .before is an argument to the function, not the name of a new variable. You can also use .after to add after a variable, and in both .before and .after you can the name of a variable name instead of a position. For example, we could add the new variables after day:

@@ -330,18 +323,19 @@ Columns .after = day ) #> # A tibble: 336,776 × 21 -#> year month day gain speed dep_time sched…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ -#> <int> <int> <int> <dbl> <dbl> <int> <int> <dbl> <int> <int> <dbl> -#> 1 2013 1 1 -9 370. 517 515 2 830 819 11 -#> 2 2013 1 1 -16 374. 533 529 4 850 830 20 -#> 3 2013 1 1 -31 408. 542 540 2 923 850 33 -#> 4 2013 1 1 17 517. 544 545 -1 1004 1022 -18 -#> 5 2013 1 1 19 394. 554 600 -6 812 837 -25 -#> 6 2013 1 1 -16 288. 554 558 -4 740 728 12 -#> # … with 336,770 more rows, 10 more variables: carrier <chr>, flight <int>, -#> # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, -#> # hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated variable names -#> # ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay +#> year month day gain speed dep_time sched_dep_…¹ dep_d…² arr_t…³ sched…⁴ +#> <int> <int> <int> <dbl> <dbl> <int> <int> <dbl> <int> <int> +#> 1 2013 1 1 -9 370. 517 515 2 830 819 +#> 2 2013 1 1 -16 374. 533 529 4 850 830 +#> 3 2013 1 1 -31 408. 542 540 2 923 850 +#> 4 2013 1 1 17 517. 544 545 -1 1004 1022 +#> 5 2013 1 1 19 394. 554 600 -6 812 837 +#> 6 2013 1 1 -16 288. 554 558 -4 740 728 +#> # … with 336,770 more rows, 11 more variables: arr_delay <dbl>, +#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, +#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, +#> # time_hour <dttm>, and abbreviated variable names ¹​sched_dep_time, +#> # ²​dep_delay, ³​arr_time, ⁴​sched_arr_time

Alternatively, you can control which variables are kept with the .keep argument. A particularly useful argument is "used" which allows you to see the inputs and outputs from your calculations:

@@ -403,18 +397,18 @@ flights |> flights |> select(!year:day) #> # A tibble: 336,776 × 16 -#> dep_time sched…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier flight tailnum origin -#> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> -#> 1 517 515 2 830 819 11 UA 1545 N14228 EWR -#> 2 533 529 4 850 830 20 UA 1714 N24211 LGA -#> 3 542 540 2 923 850 33 AA 1141 N619AA JFK -#> 4 544 545 -1 1004 1022 -18 B6 725 N804JB JFK -#> 5 554 600 -6 812 837 -25 DL 461 N668DN LGA -#> 6 554 558 -4 740 728 12 UA 1696 N39463 EWR -#> # … with 336,770 more rows, 6 more variables: dest <chr>, air_time <dbl>, -#> # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated -#> # variable names ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, -#> # ⁵​arr_delay +#> dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier flight tailnum +#> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> +#> 1 517 515 2 830 819 11 UA 1545 N14228 +#> 2 533 529 4 850 830 20 UA 1714 N24211 +#> 3 542 540 2 923 850 33 AA 1141 N619AA +#> 4 544 545 -1 1004 1022 -18 B6 725 N804JB +#> 5 554 600 -6 812 837 -25 DL 461 N668DN +#> 6 554 558 -4 740 728 12 UA 1696 N39463 +#> # … with 336,770 more rows, 7 more variables: origin <chr>, dest <chr>, +#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, +#> # time_hour <dttm>, and abbreviated variable names ¹​sched_dep_time, +#> # ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay # Select all columns that are characters flights |> @@ -466,14 +460,14 @@ flights |>
flights |> 
   rename(tail_num = tailnum)
 #> # A tibble: 336,776 × 19
-#>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
-#>   <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
-#> 1  2013     1     1      517         515       2     830     819      11 UA     
-#> 2  2013     1     1      533         529       4     850     830      20 UA     
-#> 3  2013     1     1      542         540       2     923     850      33 AA     
-#> 4  2013     1     1      544         545      -1    1004    1022     -18 B6     
-#> 5  2013     1     1      554         600      -6     812     837     -25 DL     
-#> 6  2013     1     1      554         558      -4     740     728      12 UA     
+#>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
+#>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>  
+#> 1  2013     1     1      517      515       2     830     819      11 UA     
+#> 2  2013     1     1      533      529       4     850     830      20 UA     
+#> 3  2013     1     1      542      540       2     923     850      33 AA     
+#> 4  2013     1     1      544      545      -1    1004    1022     -18 B6     
+#> 5  2013     1     1      554      600      -6     812     837     -25 DL     
+#> 6  2013     1     1      554      558      -4     740     728      12 UA     
 #> # … with 336,770 more rows, 9 more variables: flight <int>, tail_num <chr>,
 #> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
 #> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
@@ -492,51 +486,51 @@ flights |>
 
flights |> 
   relocate(time_hour, air_time)
 #> # A tibble: 336,776 × 19
-#>   time_hour           air_time  year month   day dep_t…¹ sched…² dep_d…³ arr_t…⁴
-#>   <dttm>                 <dbl> <int> <int> <int>   <int>   <int>   <dbl>   <int>
-#> 1 2013-01-01 05:00:00      227  2013     1     1     517     515       2     830
-#> 2 2013-01-01 05:00:00      227  2013     1     1     533     529       4     850
-#> 3 2013-01-01 05:00:00      160  2013     1     1     542     540       2     923
-#> 4 2013-01-01 05:00:00      183  2013     1     1     544     545      -1    1004
-#> 5 2013-01-01 06:00:00      116  2013     1     1     554     600      -6     812
-#> 6 2013-01-01 05:00:00      150  2013     1     1     554     558      -4     740
-#> # … with 336,770 more rows, 10 more variables: sched_arr_time <int>,
-#> #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
-#> #   dest <chr>, distance <dbl>, hour <dbl>, minute <dbl>, and abbreviated
-#> #   variable names ¹​dep_time, ²​sched_dep_time, ³​dep_delay, ⁴​arr_time
+#> time_hour air_time year month day dep_time sched_dep…¹ dep_d…² +#> <dttm> <dbl> <int> <int> <int> <int> <int> <dbl> +#> 1 2013-01-01 05:00:00 227 2013 1 1 517 515 2 +#> 2 2013-01-01 05:00:00 227 2013 1 1 533 529 4 +#> 3 2013-01-01 05:00:00 160 2013 1 1 542 540 2 +#> 4 2013-01-01 05:00:00 183 2013 1 1 544 545 -1 +#> 5 2013-01-01 06:00:00 116 2013 1 1 554 600 -6 +#> 6 2013-01-01 05:00:00 150 2013 1 1 554 558 -4 +#> # … with 336,770 more rows, 11 more variables: arr_time <int>, +#> # sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>, +#> # tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>, hour <dbl>, +#> # minute <dbl>, and abbreviated variable names ¹​sched_dep_time, ²​dep_delay

But you can use the same .before and .after arguments as mutate() to choose where to put them:

flights |> 
   relocate(year:dep_time, .after = time_hour)
 #> # A tibble: 336,776 × 19
-#>   sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier flight tailnum origin dest 
-#>        <int>   <dbl>   <int>   <int>   <dbl> <chr>    <int> <chr>   <chr>  <chr>
-#> 1        515       2     830     819      11 UA        1545 N14228  EWR    IAH  
-#> 2        529       4     850     830      20 UA        1714 N24211  LGA    IAH  
-#> 3        540       2     923     850      33 AA        1141 N619AA  JFK    MIA  
-#> 4        545      -1    1004    1022     -18 B6         725 N804JB  JFK    BQN  
-#> 5        600      -6     812     837     -25 DL         461 N668DN  LGA    ATL  
-#> 6        558      -4     740     728      12 UA        1696 N39463  EWR    ORD  
-#> # … with 336,770 more rows, 9 more variables: air_time <dbl>, distance <dbl>,
-#> #   hour <dbl>, minute <dbl>, time_hour <dttm>, year <int>, month <int>,
-#> #   day <int>, dep_time <int>, and abbreviated variable names ¹​sched_dep_time,
-#> #   ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
+#>   sched…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier flight tailnum origin dest 
+#>     <int>   <dbl>   <int>   <int>   <dbl> <chr>    <int> <chr>   <chr>  <chr>
+#> 1     515       2     830     819      11 UA        1545 N14228  EWR    IAH  
+#> 2     529       4     850     830      20 UA        1714 N24211  LGA    IAH  
+#> 3     540       2     923     850      33 AA        1141 N619AA  JFK    MIA  
+#> 4     545      -1    1004    1022     -18 B6         725 N804JB  JFK    BQN  
+#> 5     600      -6     812     837     -25 DL         461 N668DN  LGA    ATL  
+#> 6     558      -4     740     728      12 UA        1696 N39463  EWR    ORD  
+#> # … with 336,770 more rows, 9 more variables: air_time <dbl>,
+#> #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, year <int>,
+#> #   month <int>, day <int>, dep_time <int>, and abbreviated variable names
+#> #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
 flights |> 
   relocate(starts_with("arr"), .before = dep_time)
 #> # A tibble: 336,776 × 19
-#>    year month   day arr_time arr_delay dep_time sched_…¹ dep_d…² sched…³ carrier
-#>   <int> <int> <int>    <int>     <dbl>    <int>    <int>   <dbl>   <int> <chr>  
-#> 1  2013     1     1      830        11      517      515       2     819 UA     
-#> 2  2013     1     1      850        20      533      529       4     830 UA     
-#> 3  2013     1     1      923        33      542      540       2     850 AA     
-#> 4  2013     1     1     1004       -18      544      545      -1    1022 B6     
-#> 5  2013     1     1      812       -25      554      600      -6     837 DL     
-#> 6  2013     1     1      740        12      554      558      -4     728 UA     
+#>    year month   day arr_time arr_de…¹ dep_t…² sched…³ dep_d…⁴ sched…⁵ carrier
+#>   <int> <int> <int>    <int>    <dbl>   <int>   <int>   <dbl>   <int> <chr>  
+#> 1  2013     1     1      830       11     517     515       2     819 UA     
+#> 2  2013     1     1      850       20     533     529       4     830 UA     
+#> 3  2013     1     1      923       33     542     540       2     850 AA     
+#> 4  2013     1     1     1004      -18     544     545      -1    1022 B6     
+#> 5  2013     1     1      812      -25     554     600      -6     837 DL     
+#> 6  2013     1     1      740       12     554     558      -4     728 UA     
 #> # … with 336,770 more rows, 9 more variables: flight <int>, tailnum <chr>,
 #> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
 #> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
-#> #   ¹​sched_dep_time, ²​dep_delay, ³​sched_arr_time
+#> # ¹​arr_delay, ²​dep_time, ³​sched_dep_time, ⁴​dep_delay, ⁵​sched_arr_time
@@ -580,14 +574,14 @@ Groups group_by(month) #> # A tibble: 336,776 × 19 #> # Groups: month [12] -#> year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier -#> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> -#> 1 2013 1 1 517 515 2 830 819 11 UA -#> 2 2013 1 1 533 529 4 850 830 20 UA -#> 3 2013 1 1 542 540 2 923 850 33 AA -#> 4 2013 1 1 544 545 -1 1004 1022 -18 B6 -#> 5 2013 1 1 554 600 -6 812 837 -25 DL -#> 6 2013 1 1 554 558 -4 740 728 12 UA +#> year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier +#> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> +#> 1 2013 1 1 517 515 2 830 819 11 UA +#> 2 2013 1 1 533 529 4 850 830 20 UA +#> 3 2013 1 1 542 540 2 923 850 33 AA +#> 4 2013 1 1 544 545 -1 1004 1022 -18 B6 +#> 5 2013 1 1 554 600 -6 812 837 -25 DL +#> 6 2013 1 1 554 558 -4 740 728 12 UA #> # … with 336,770 more rows, 9 more variables: flight <int>, tailnum <chr>, #> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, #> # minute <dbl>, time_hour <dttm>, and abbreviated variable names @@ -679,14 +673,14 @@ Theslice_ functions slice_max(arr_delay, n = 1) #> # A tibble: 108 × 19 #> # Groups: dest [105] -#> year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier -#> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> -#> 1 2013 7 22 2145 2007 98 132 2259 153 B6 -#> 2 2013 7 23 1139 800 219 1250 909 221 B6 -#> 3 2013 1 25 123 2000 323 229 2101 328 EV -#> 4 2013 8 17 1740 1625 75 2042 2003 39 UA -#> 5 2013 7 22 2257 759 898 121 1026 895 DL -#> 6 2013 7 10 2056 1505 351 2347 1758 349 UA +#> year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier +#> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> +#> 1 2013 7 22 2145 2007 98 132 2259 153 B6 +#> 2 2013 7 23 1139 800 219 1250 909 221 B6 +#> 3 2013 1 25 123 2000 323 229 2101 328 EV +#> 4 2013 8 17 1740 1625 75 2042 2003 39 UA +#> 5 2013 7 22 2257 759 898 121 1026 895 DL +#> 6 2013 7 10 2056 1505 351 2347 1758 349 UA #> # … with 102 more rows, 9 more variables: flight <int>, tailnum <chr>, #> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, #> # minute <dbl>, time_hour <dttm>, and abbreviated variable names @@ -725,14 +719,14 @@ Grouping by multiple variables daily #> # A tibble: 336,776 × 19 #> # Groups: year, month, day [365] -#> year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier -#> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> -#> 1 2013 1 1 517 515 2 830 819 11 UA -#> 2 2013 1 1 533 529 4 850 830 20 UA -#> 3 2013 1 1 542 540 2 923 850 33 AA -#> 4 2013 1 1 544 545 -1 1004 1022 -18 B6 -#> 5 2013 1 1 554 600 -6 812 837 -25 DL -#> 6 2013 1 1 554 558 -4 740 728 12 UA +#> year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier +#> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> +#> 1 2013 1 1 517 515 2 830 819 11 UA +#> 2 2013 1 1 533 529 4 850 830 20 UA +#> 3 2013 1 1 542 540 2 923 850 33 AA +#> 4 2013 1 1 544 545 -1 1004 1022 -18 B6 +#> 5 2013 1 1 554 600 -6 812 837 -25 DL +#> 6 2013 1 1 554 558 -4 740 728 12 UA #> # … with 336,770 more rows, 9 more variables: flight <int>, tailnum <chr>, #> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, #> # minute <dbl>, time_hour <dttm>, and abbreviated variable names @@ -744,8 +738,8 @@ daily summarize( n = n() ) -#> `summarise()` has grouped output by 'year', 'month'. You can override using the -#> `.groups` argument. +#> `summarise()` has grouped output by 'year', 'month'. You can override using +#> the `.groups` argument.

If you’re happy with this behavior, you can explicitly request it in order to suppress the message:

diff --git a/oreilly/data-visualize.html b/oreilly/data-visualize.html index cacaf0a..5b9b932 100644 --- a/oreilly/data-visualize.html +++ b/oreilly/data-visualize.html @@ -14,12 +14,12 @@ Prerequisites

This chapter focuses on ggplot2, one of the core packages in the tidyverse. To access the datasets, help pages, and functions used in this chapter, load the tidyverse by running this code:

library(tidyverse)
-#> ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
+#> ── Attaching packages ──────────────────────────────────── tidyverse 1.3.2 ──
 #> ✔ ggplot2 3.4.0.9000        ✔ purrr   0.9000.0.9000
 #> ✔ tibble  3.1.8             ✔ dplyr   1.0.99.9000  
 #> ✔ tidyr   1.2.1.9001        ✔ stringr 1.4.1.9000   
 #> ✔ readr   2.1.3             ✔ forcats 0.5.2        
-#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
+#> ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──
 #> ✖ dplyr::filter() masks stats::filter()
 #> ✖ dplyr::lag()    masks stats::lag()
@@ -45,14 +45,14 @@ Thempg data frame
mpg
 #> # A tibble: 234 × 11
-#>   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
-#>   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
-#> 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
-#> 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
-#> 3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
-#> 4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
-#> 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…
-#> 6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa…
+#>   manufacturer model displ  year   cyl trans    drv     cty   hwy fl    class
+#>   <chr>        <chr> <dbl> <int> <int> <chr>    <chr> <int> <int> <chr> <chr>
+#> 1 audi         a4      1.8  1999     4 auto(l5) f        18    29 p     comp…
+#> 2 audi         a4      1.8  1999     4 manual(… f        21    29 p     comp…
+#> 3 audi         a4      2    2008     4 manual(… f        20    31 p     comp…
+#> 4 audi         a4      2    2008     4 auto(av) f        21    30 p     comp…
+#> 5 audi         a4      2.8  1999     6 auto(l5) f        16    26 p     comp…
+#> 6 audi         a4      2.8  1999     6 manual(… f        18    26 p     comp…
 #> # … with 228 more rows

Among the variables in mpg are:

diff --git a/oreilly/databases.html b/oreilly/databases.html index 213716b..6c9be8b 100644 --- a/oreilly/databases.html +++ b/oreilly/databases.html @@ -1,26 +1,5 @@
-

Databases

-
- -
- -
- -

You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

- -

There are two other common ways to interact with a database. First, many corporate databases are very large so you need some hierarchy to keep all the tables organised. In that case you might need to supply a schema, or a catalog and a schema, in order to pick the table you’re interested in:

-
diamonds_db <- tbl(con, in_schema("sales", "diamonds"))
-diamonds_db <- tbl(con, in_catalog("north_america", "sales", "diamonds"))
-

Other times you might want to use your own SQL query as a starting point:

-
diamonds_db <- tbl(con, sql("SELECT * FROM diamonds"))
-
- -

Note that while SQL is a standard, it is extremely complex and no database follows it exactly. While the main components that we’ll focus on in this book are very similar between DBMSs, there are many minor variations. Fortunately, dbplyr is designed to handle this problem and generates different translations for different databases. It’s not perfect, but it’s continually improving, and if you hit a problem you can file an issue on GitHub to help us do better.

- -

In the examples above note that "year" and "type" are wrapped in double quotes. That’s because these are reserved words in duckdb, so dbplyr quotes them to avoid any potential confusion between column/table names and SQL operators.

When working with other databases you’re likely to see every variable name quotes because only a handful of client packages, like duckdb, know what all the reserved words are, so they quote everything to be safe.

SELECT "tailnum", "type", "manufacturer", "model", "year"
-FROM "planes"

Some other database systems use backticks instead of quotes:

SELECT `tailnum`, `type`, `manufacturer`, `model`, `year`
-FROM `planes`
- +

Databases

::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz. :::

Introduction

@@ -203,8 +182,6 @@ diamonds_db
-

You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

-

There are two other common ways to interact with a database. First, many corporate databases are very large so you need some hierarchy to keep all the tables organised. In that case you might need to supply a schema, or a catalog and a schema, in order to pick the table you’re interested in:

diamonds_db <- tbl(con, in_schema("sales", "diamonds"))
 diamonds_db <- tbl(con, in_catalog("north_america", "sales", "diamonds"))
@@ -334,8 +311,6 @@ planes |> show_query()
-

You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

-

There are two other common ways to interact with a database. First, many corporate databases are very large so you need some hierarchy to keep all the tables organised. In that case you might need to supply a schema, or a catalog and a schema, in order to pick the table you’re interested in:

diamonds_db <- tbl(con, in_schema("sales", "diamonds"))
 diamonds_db <- tbl(con, in_catalog("north_america", "sales", "diamonds"))
@@ -388,8 +363,6 @@ planes |>
-

You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

-

There are two other common ways to interact with a database. First, many corporate databases are very large so you need some hierarchy to keep all the tables organised. In that case you might need to supply a schema, or a catalog and a schema, in order to pick the table you’re interested in:

diamonds_db <- tbl(con, in_schema("sales", "diamonds"))
 diamonds_db <- tbl(con, in_catalog("north_america", "sales", "diamonds"))
@@ -665,8 +638,8 @@ mutate_query <- function(df, ...) { mean = mean(arr_delay, na.rm = TRUE), median = median(arr_delay, na.rm = TRUE) ) -#> `summarise()` has grouped output by "year" and "month". You can override using -#> the `.groups` argument. +#> `summarise()` has grouped output by "year" and "month". You can override +#> using the `.groups` argument. #> <SQL> #> SELECT #> "year", diff --git a/oreilly/datetimes.html b/oreilly/datetimes.html index 3357b5d..cb912a4 100644 --- a/oreilly/datetimes.html +++ b/oreilly/datetimes.html @@ -1,13 +1,5 @@
-

Dates and times

-
- -
- -
- -

You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

- +

Dates and times

::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz. :::

Introduction

@@ -43,7 +35,7 @@ Creating date/times
today()
 #> [1] "2022-11-18"
 now()
-#> [1] "2022-11-18 10:21:36 CST"
+#> [1] "2022-11-18 10:59:07 CST"

Otherwise, the following sections describe the four ways you’re likely to create a date/time:

  • While reading a file with readr.
  • diff --git a/oreilly/factors.html b/oreilly/factors.html index 8801312..db71b1e 100644 --- a/oreilly/factors.html +++ b/oreilly/factors.html @@ -1,13 +1,5 @@
    -

    Factors

    -
    - -
    - -
    - -

    You are reading the work-in-progress second edition of R for Data Science. This chapter is largely complete and just needs final proof reading. You can find the complete first edition at https://r4ds.had.co.nz.

    - +

    Factors

    ::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter is largely complete and just needs final proof reading. You can find the complete first edition at https://r4ds.had.co.nz. :::

    Introduction

    @@ -122,14 +114,14 @@ General Social Survey
    gss_cat
     #> # A tibble: 21,483 × 9
    -#>    year marital         age race  rincome        partyid     relig denom tvhours
    -#>   <int> <fct>         <int> <fct> <fct>          <fct>       <fct> <fct>   <int>
    -#> 1  2000 Never married    26 White $8000 to 9999  Ind,near r… Prot… Sout…      12
    -#> 2  2000 Divorced         48 White $8000 to 9999  Not str re… Prot… Bapt…      NA
    -#> 3  2000 Widowed          67 White Not applicable Independent Prot… No d…       2
    -#> 4  2000 Never married    39 White Not applicable Ind,near r… Orth… Not …       4
    -#> 5  2000 Divorced         25 White Not applicable Not str de… None  Not …       1
    -#> 6  2000 Married          25 White $20000 - 24999 Strong dem… Prot… Sout…      NA
    +#>    year marital         age race  rincome        partyid  relig denom tvhours
    +#>   <int> <fct>         <int> <fct> <fct>          <fct>    <fct> <fct>   <int>
    +#> 1  2000 Never married    26 White $8000 to 9999  Ind,nea… Prot… Sout…      12
    +#> 2  2000 Divorced         48 White $8000 to 9999  Not str… Prot… Bapt…      NA
    +#> 3  2000 Widowed          67 White Not applicable Indepen… Prot… No d…       2
    +#> 4  2000 Never married    39 White Not applicable Ind,nea… Orth… Not …       4
    +#> 5  2000 Divorced         25 White Not applicable Not str… None  Not …       1
    +#> 6  2000 Married          25 White $20000 - 24999 Strong … Prot… Sout…      NA
     #> # … with 21,477 more rows

    (Remember, since this dataset is provided by a package, you can get more information about the variables with ?gss_cat.)

    diff --git a/oreilly/functions.html b/oreilly/functions.html index 33db1b8..207dbcf 100644 --- a/oreilly/functions.html +++ b/oreilly/functions.html @@ -1,17 +1,5 @@
    -

    Functions

    -
    - -
    - -

    -RStudio -

    You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

    - -

    Once you start writing functions, there are two RStudio shortcuts that are super useful:

    • To find the definition of a function that you’ve written, place the cursor on the name of the function and press F2.

    • -
    • To quickly jump to a function, press Ctrl + . to open the fuzzy file and function finder and type the first few letters of your function name. You can also navigate to files, Quarto sections, and more, making it a very handy navigation tool.

    • -
    - +

    Functions

    ::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz. :::

    Introduction

    @@ -278,9 +266,7 @@ mape <- function(actual, predicted) {

    RStudio -

    You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

    - -

    Once you start writing functions, there are two RStudio shortcuts that are super useful:

    • To find the definition of a function that you’ve written, place the cursor on the name of the function and press F2.

    • +

      Once you start writing functions, there are two RStudio shortcuts that are super useful:

      • To find the definition of a function that you’ve written, place the cursor on the name of the function and press F2.

      • To quickly jump to a function, press Ctrl + . to open the fuzzy file and function finder and type the first few letters of your function name. You can also navigate to files, Quarto sections, and more, making it a very handy navigation tool.

    @@ -490,14 +476,14 @@ flights |> unique_where(tailnum == "N14228", month) flights_sub(dest == "IAH", contains("time")) #> # A tibble: 7,198 × 8 -#> time_hour carrier flight dep_time sched_de…¹ arr_t…² sched…³ air_t…⁴ -#> <dttm> <chr> <int> <int> <int> <int> <int> <dbl> -#> 1 2013-01-01 05:00:00 UA 1545 517 515 830 819 227 -#> 2 2013-01-01 05:00:00 UA 1714 533 529 850 830 227 -#> 3 2013-01-01 06:00:00 UA 496 623 627 933 932 229 -#> 4 2013-01-01 07:00:00 UA 473 728 732 1041 1038 238 -#> 5 2013-01-01 07:00:00 UA 1479 739 739 1104 1038 249 -#> 6 2013-01-01 09:00:00 UA 1220 908 908 1228 1219 233 +#> time_hour carrier flight dep_time sched…¹ arr_t…² sched…³ air_t…⁴ +#> <dttm> <chr> <int> <int> <int> <int> <int> <dbl> +#> 1 2013-01-01 05:00:00 UA 1545 517 515 830 819 227 +#> 2 2013-01-01 05:00:00 UA 1714 533 529 850 830 227 +#> 3 2013-01-01 06:00:00 UA 496 623 627 933 932 229 +#> 4 2013-01-01 07:00:00 UA 473 728 732 1041 1038 238 +#> 5 2013-01-01 07:00:00 UA 1479 739 739 1104 1038 249 +#> 6 2013-01-01 09:00:00 UA 1220 908 908 1228 1219 233 #> # … with 7,192 more rows, and abbreviated variable names ¹​sched_dep_time, #> # ²​arr_time, ³​sched_arr_time, ⁴​air_time @@ -529,8 +515,8 @@ flights |> } flights |> count_missing(c(year, month, day), dep_time) -#> `summarise()` has grouped output by 'year', 'month'. You can override using the -#> `.groups` argument. +#> `summarise()` has grouped output by 'year', 'month'. You can override using +#> the `.groups` argument. #> # A tibble: 365 × 4 #> # Groups: year, month [12] #> year month day n_miss diff --git a/oreilly/intro.html b/oreilly/intro.html index 3d940dd..a5af969 100644 --- a/oreilly/intro.html +++ b/oreilly/intro.html @@ -98,12 +98,12 @@ The tidyverse

    You will not be able to use the functions, objects, or help files in a package until you load it with library(). Once you have installed a package, you can load it using the library() function:

    library(tidyverse)
    -#> ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
    +#> ── Attaching packages ──────────────────────────────────── tidyverse 1.3.2 ──
     #> ✔ ggplot2 3.4.0.9000        ✔ purrr   0.9000.0.9000
     #> ✔ tibble  3.1.8             ✔ dplyr   1.0.99.9000  
     #> ✔ tidyr   1.2.1.9001        ✔ stringr 1.4.1.9000   
     #> ✔ readr   2.1.3             ✔ forcats 0.5.2        
    -#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
    +#> ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──
     #> ✖ dplyr::filter() masks stats::filter()
     #> ✖ dplyr::lag()    masks stats::lag()
    @@ -162,134 +162,105 @@ Acknowledgements Colophon

    An online version of this book is available at https://r4ds.hadley.nz. It will continue to evolve in between reprints of the physical book. The source of the book is available at https://github.com/hadley/r4ds. The book is powered by Quarto which makes it easy to write books that combine text and executable code.

    This book was built with:

    +
    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    packageversionsource
    broom1.0.1CRAN (R 4.2.0)
    cli3.4.1CRAN (R 4.2.1)
    crayon1.5.2CRAN (R 4.2.0)
    dbplyr2.2.1.9000Github (tidyverse/dbplyr@f7b5596f6125011ab0dcd4eccbfe56c5294214da)
    dplyr1.0.99.9000local
    dtplyr1.2.2CRAN (R 4.2.0)
    forcats0.5.2CRAN (R 4.2.0)
    ggplot23.4.0.9000Github (tidyverse/ggplot2@4fea51b1eb2cdacebeacf425627dcbc1d61a5d3e)
    googledrive2.0.0CRAN (R 4.2.0)
    googlesheets41.0.1CRAN (R 4.2.0)
    haven2.5.1CRAN (R 4.2.0)
    hms1.1.2CRAN (R 4.2.0)
    httr1.4.4CRAN (R 4.2.0)
    jsonlite1.8.3CRAN (R 4.2.1)
    lubridate1.9.0CRAN (R 4.2.1)
    magrittr2.0.3CRAN (R 4.2.0)
    modelr0.1.9CRAN (R 4.2.0)
    pillar1.8.1CRAN (R 4.2.0)
    purrr0.9000.0.9000Github (tidyverse/purrr@aaaa58a571cc449dbcc4348e77e589b373e1e059)
    readr2.1.3CRAN (R 4.2.1)
    readxl1.4.1CRAN (R 4.2.0)
    reprex2.0.2CRAN (R 4.2.0)
    rlang1.0.6CRAN (R 4.2.0)
    rstudioapi0.14CRAN (R 4.2.0)
    rvest1.0.3CRAN (R 4.2.0)
    stringr1.4.1.9000Github (tidyverse/stringr@ebf38238cbb80bf0e852d5d8d056c04e36d7c20c)
    tibble3.1.8CRAN (R 4.2.0)
    tidyr1.2.1.9001Github (tidyverse/tidyr@91747952f10c961be747c0de1026d109c920e4fc)
    tidyverse1.3.2CRAN (R 4.2.0)
    xml21.3.3CRAN (R 4.2.0)
    -
    sessioninfo::session_info(c("tidyverse"))
    -#> ─ Session info ───────────────────────────────────────────────────────────────
    -#>  setting  value
    -#>  version  R version 4.2.1 (2022-06-23)
    -#>  os       macOS Ventura 13.0.1
    -#>  system   aarch64, darwin20
    -#>  ui       X11
    -#>  language (EN)
    -#>  collate  en_US.UTF-8
    -#>  ctype    en_US.UTF-8
    -#>  tz       America/Chicago
    -#>  date     2022-11-18
    -#>  pandoc   2.19.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
    -#> 
    -#> ─ Packages ───────────────────────────────────────────────────────────────────
    -#>  package       * version       date (UTC) lib source
    -#>  askpass         1.1           2019-01-13 [1] CRAN (R 4.2.0)
    -#>  assertthat      0.2.1         2019-03-21 [1] CRAN (R 4.2.0)
    -#>  backports       1.4.1         2021-12-13 [1] CRAN (R 4.2.0)
    -#>  base64enc       0.1-3         2015-07-28 [1] CRAN (R 4.2.0)
    -#>  bit             4.0.4         2020-08-04 [1] CRAN (R 4.2.0)
    -#>  bit64           4.0.5         2020-08-30 [1] CRAN (R 4.2.0)
    -#>  blob            1.2.3         2022-04-10 [1] CRAN (R 4.2.0)
    -#>  broom           1.0.1         2022-08-29 [1] CRAN (R 4.2.0)
    -#>  bslib           0.4.1         2022-11-02 [1] CRAN (R 4.2.0)
    -#>  cachem          1.0.6         2021-08-19 [1] CRAN (R 4.2.0)
    -#>  callr           3.7.3         2022-11-02 [1] CRAN (R 4.2.1)
    -#>  cellranger      1.1.0         2016-07-27 [1] CRAN (R 4.2.0)
    -#>  cli             3.4.1         2022-09-23 [1] CRAN (R 4.2.1)
    -#>  clipr           0.8.0         2022-02-22 [1] CRAN (R 4.2.0)
    -#>  colorspace      2.0-3         2022-02-21 [1] CRAN (R 4.2.0)
    -#>  cpp11           0.4.3         2022-10-12 [1] CRAN (R 4.2.0)
    -#>  crayon          1.5.2         2022-09-29 [1] CRAN (R 4.2.0)
    -#>  curl            4.3.3         2022-10-06 [1] CRAN (R 4.2.0)
    -#>  data.table      1.14.4        2022-10-17 [1] CRAN (R 4.2.1)
    -#>  DBI             1.1.3         2022-06-18 [1] CRAN (R 4.2.0)
    -#>  dbplyr          2.2.1.9000    2022-11-03 [1] Github (tidyverse/dbplyr@f7b5596)
    -#>  digest          0.6.30        2022-10-18 [1] CRAN (R 4.2.0)
    -#>  dplyr         * 1.0.99.9000   2022-11-17 [1] local
    -#>  dtplyr          1.2.2         2022-08-20 [1] CRAN (R 4.2.0)
    -#>  ellipsis        0.3.2         2021-04-29 [1] CRAN (R 4.2.0)
    -#>  evaluate        0.18          2022-11-07 [1] CRAN (R 4.2.1)
    -#>  fansi           1.0.3         2022-03-24 [1] CRAN (R 4.2.0)
    -#>  farver          2.1.1         2022-07-06 [1] CRAN (R 4.2.0)
    -#>  fastmap         1.1.0         2021-01-25 [1] CRAN (R 4.2.0)
    -#>  forcats       * 0.5.2         2022-08-19 [1] CRAN (R 4.2.0)
    -#>  fs              1.5.2         2021-12-08 [1] CRAN (R 4.2.0)
    -#>  gargle          1.2.1.9000    2022-10-27 [1] Github (r-lib/gargle@69d3f28)
    -#>  generics        0.1.3         2022-07-05 [1] CRAN (R 4.2.0)
    -#>  ggplot2       * 3.4.0.9000    2022-11-10 [1] Github (tidyverse/ggplot2@4fea51b)
    -#>  glue            1.6.2         2022-02-24 [1] CRAN (R 4.2.0)
    -#>  googledrive     2.0.0         2021-07-08 [1] CRAN (R 4.2.0)
    -#>  googlesheets4   1.0.1         2022-08-13 [1] CRAN (R 4.2.0)
    -#>  gtable          0.3.1.9000    2022-09-25 [1] local
    -#>  haven           2.5.1         2022-08-22 [1] CRAN (R 4.2.0)
    -#>  highr           0.9           2021-04-16 [1] CRAN (R 4.2.0)
    -#>  hms             1.1.2         2022-08-19 [1] CRAN (R 4.2.0)
    -#>  htmltools       0.5.3         2022-07-18 [1] CRAN (R 4.2.0)
    -#>  httr            1.4.4         2022-08-17 [1] CRAN (R 4.2.0)
    -#>  ids             1.0.1         2017-05-31 [1] CRAN (R 4.2.0)
    -#>  isoband         0.2.6         2022-10-06 [1] CRAN (R 4.2.0)
    -#>  jquerylib       0.1.4         2021-04-26 [1] CRAN (R 4.2.0)
    -#>  jsonlite        1.8.3         2022-10-21 [1] CRAN (R 4.2.1)
    -#>  knitr           1.40          2022-08-24 [1] CRAN (R 4.2.0)
    -#>  labeling        0.4.2         2020-10-20 [1] CRAN (R 4.2.0)
    -#>  lattice         0.20-45       2021-09-22 [2] CRAN (R 4.2.1)
    -#>  lifecycle       1.0.3.9000    2022-10-10 [1] Github (r-lib/lifecycle@80a1e52)
    -#>  lubridate       1.9.0         2022-11-06 [1] CRAN (R 4.2.1)
    -#>  magrittr        2.0.3         2022-03-30 [1] CRAN (R 4.2.0)
    -#>  MASS            7.3-58.1      2022-08-03 [1] CRAN (R 4.2.0)
    -#>  Matrix          1.5-1         2022-09-13 [1] CRAN (R 4.2.0)
    -#>  memoise         2.0.1         2021-11-26 [1] CRAN (R 4.2.0)
    -#>  mgcv            1.8-41        2022-10-21 [1] CRAN (R 4.2.0)
    -#>  mime            0.12          2021-09-28 [1] CRAN (R 4.2.0)
    -#>  modelr          0.1.9         2022-08-19 [1] CRAN (R 4.2.0)
    -#>  munsell         0.5.0         2018-06-12 [1] CRAN (R 4.2.0)
    -#>  nlme            3.1-160       2022-10-10 [1] CRAN (R 4.2.0)
    -#>  openssl         2.0.4         2022-10-17 [1] CRAN (R 4.2.1)
    -#>  pillar          1.8.1         2022-08-19 [1] CRAN (R 4.2.0)
    -#>  pkgconfig       2.0.3         2019-09-22 [1] CRAN (R 4.2.0)
    -#>  prettyunits     1.1.1         2020-01-24 [1] CRAN (R 4.2.0)
    -#>  processx        3.8.0         2022-10-26 [1] CRAN (R 4.2.1)
    -#>  progress        1.2.2         2019-05-16 [1] CRAN (R 4.2.0)
    -#>  ps              1.7.2         2022-10-26 [1] CRAN (R 4.2.1)
    -#>  purrr         * 0.9000.0.9000 2022-11-10 [1] Github (tidyverse/purrr@aaaa58a)
    -#>  R6              2.5.1         2021-08-19 [1] CRAN (R 4.2.0)
    -#>  rappdirs        0.3.3         2021-01-31 [1] CRAN (R 4.2.0)
    -#>  RColorBrewer    1.1-3         2022-04-03 [1] CRAN (R 4.2.0)
    -#>  readr         * 2.1.3         2022-10-01 [1] CRAN (R 4.2.1)
    -#>  readxl          1.4.1         2022-08-17 [1] CRAN (R 4.2.0)
    -#>  rematch         1.0.1         2016-04-21 [1] CRAN (R 4.2.0)
    -#>  rematch2        2.1.2         2020-05-01 [1] CRAN (R 4.2.0)
    -#>  reprex          2.0.2         2022-08-17 [1] CRAN (R 4.2.0)
    -#>  rlang           1.0.6         2022-09-24 [1] CRAN (R 4.2.0)
    -#>  rmarkdown       2.18          2022-11-09 [1] CRAN (R 4.2.1)
    -#>  rstudioapi      0.14          2022-08-22 [1] CRAN (R 4.2.0)
    -#>  rvest           1.0.3         2022-08-19 [1] CRAN (R 4.2.0)
    -#>  sass            0.4.2         2022-07-16 [1] CRAN (R 4.2.0)
    -#>  scales          1.2.1         2022-08-20 [1] CRAN (R 4.2.0)
    -#>  selectr         0.4-2         2019-11-20 [1] CRAN (R 4.2.0)
    -#>  stringi         1.7.8         2022-07-11 [1] CRAN (R 4.2.0)
    -#>  stringr       * 1.4.1.9000    2022-11-10 [1] Github (tidyverse/stringr@ebf3823)
    -#>  sys             3.4.1         2022-10-18 [1] CRAN (R 4.2.0)
    -#>  tibble        * 3.1.8         2022-07-22 [1] CRAN (R 4.2.0)
    -#>  tidyr         * 1.2.1.9001    2022-11-05 [1] Github (tidyverse/tidyr@9174795)
    -#>  tidyselect      1.2.0         2022-10-10 [1] CRAN (R 4.2.1)
    -#>  tidyverse     * 1.3.2         2022-07-18 [1] CRAN (R 4.2.0)
    -#>  timechange      0.1.1         2022-11-04 [1] CRAN (R 4.2.1)
    -#>  tinytex         0.42          2022-09-27 [1] CRAN (R 4.2.1)
    -#>  tzdb            0.3.0         2022-03-28 [1] CRAN (R 4.2.0)
    -#>  utf8            1.2.2         2021-07-24 [1] CRAN (R 4.2.0)
    -#>  uuid            1.1-0         2022-04-19 [1] CRAN (R 4.2.0)
    -#>  vctrs           0.5.0         2022-10-22 [1] CRAN (R 4.2.0)
    -#>  viridisLite     0.4.1         2022-08-22 [1] CRAN (R 4.2.0)
    -#>  vroom           1.6.0         2022-09-30 [1] CRAN (R 4.2.0)
    -#>  withr           2.5.0         2022-03-03 [1] CRAN (R 4.2.0)
    -#>  xfun            0.34          2022-10-18 [1] CRAN (R 4.2.1)
    -#>  xml2            1.3.3         2021-11-30 [1] CRAN (R 4.2.0)
    -#>  yaml            2.3.6         2022-10-18 [1] CRAN (R 4.2.0)
    -#> 
    -#>  [1] /Users/hadleywickham/Library/R/arm64/4.2/library
    -#>  [2] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library
    -#> 
    -#> ──────────────────────────────────────────────────────────────────────────────
    -cli:::ruler()
    -#> ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8
    -#> 12345678901234567890123456789012345678901234567890123456789012345678901234567890
    +
    cli:::ruler()
    +#> ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+--
    +#> 12345678901234567890123456789012345678901234567890123456789012345678901234567
    diff --git a/oreilly/iteration.html b/oreilly/iteration.html index 9039d3d..a6068d2 100644 --- a/oreilly/iteration.html +++ b/oreilly/iteration.html @@ -1,13 +1,5 @@
    -

    Iteration

    -
    - -
    - -
    - -

    You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

    - +

    Iteration

    ::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz. :::

    Introduction

    @@ -226,9 +218,10 @@ df_miss |> n = n() ) #> # A tibble: 1 × 9 -#> a_median a_n_miss b_median b_n_miss c_median c_n_miss d_median d_n_miss n -#> <dbl> <int> <dbl> <int> <dbl> <int> <dbl> <int> <int> -#> 1 0.429 1 -0.721 1 -0.796 2 0.704 0 5 +#> a_median a_n_miss b_median b_n_miss c_median c_n_miss d_med…¹ d_n_m…² n +#> <dbl> <int> <dbl> <int> <dbl> <int> <dbl> <int> <int> +#> 1 0.429 1 -0.721 1 -0.796 2 0.704 0 5 +#> # … with abbreviated variable names ¹​d_median, ²​d_n_miss

    If you look carefully, you might intuit that the columns are named using using a glue specification (#sec-glue) like {.col}_{.fn} where .col is the name of the original column and .fn is the name of the function. That’s not a coincidence! As you’ll learn in the next section, you can use .names argument to supply your own glue spec.

    @@ -251,9 +244,10 @@ Column names n = n(), ) #> # A tibble: 1 × 9 -#> median_a n_miss_a median_b n_miss_b median_c n_miss_c median_d n_miss_d n -#> <dbl> <int> <dbl> <int> <dbl> <int> <dbl> <int> <int> -#> 1 0.429 1 -0.721 1 -0.796 2 0.704 0 5 +#> median_a n_miss_a median_b n_miss_b median_c n_miss_c media…¹ n_mis…² n +#> <dbl> <int> <dbl> <int> <dbl> <int> <dbl> <int> <int> +#> 1 0.429 1 -0.721 1 -0.796 2 0.704 0 5 +#> # … with abbreviated variable names ¹​median_d, ²​n_miss_d

    The .names argument is particularly important when you use across() with mutate(). By default the output of across() is given the same names as the inputs. This means that across() inside of mutate() will replace existing columns. For example, here we use coalesce() to replace NAs with 0:

    @@ -930,8 +924,8 @@ DBI::dbCreateTable(con, "gapminder", template)
    con |> tbl("gapminder")
     #> # Source:   table<gapminder> [0 x 6]
     #> # Database: DuckDB 0.5.1 [root@Darwin 22.1.0:R 4.2.1/:memory:]
    -#> # … with 6 variables: country <chr>, continent <chr>, lifeExp <dbl>, pop <dbl>,
    -#> #   gdpPercap <dbl>, year <dbl>
    +#> # … with 6 variables: country <chr>, continent <chr>, lifeExp <dbl>, +#> # pop <dbl>, gdpPercap <dbl>, year <dbl>

    Next, we need a function that takes a single file path, reads it into R, and adds the result to the gapminder table. We can do that by combining read_excel() with DBI::dbAppendTable():

    diff --git a/oreilly/joins.html b/oreilly/joins.html index 0b17840..fdda01c 100644 --- a/oreilly/joins.html +++ b/oreilly/joins.html @@ -1,13 +1,5 @@
    -

    Joins

    -
    - -
    - -
    - -

    You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

    - +

    Joins

    ::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz. :::

    Introduction

    @@ -57,14 +49,14 @@ Primary and foreign keys
    airports
     #> # A tibble: 1,458 × 8
    -#>   faa   name                             lat   lon   alt    tz dst   tzone      
    -#>   <chr> <chr>                          <dbl> <dbl> <dbl> <dbl> <chr> <chr>      
    -#> 1 04G   Lansdowne Airport               41.1 -80.6  1044    -5 A     America/Ne…
    -#> 2 06A   Moton Field Municipal Airport   32.5 -85.7   264    -6 A     America/Ch…
    -#> 3 06C   Schaumburg Regional             42.0 -88.1   801    -6 A     America/Ch…
    -#> 4 06N   Randall Airport                 41.4 -74.4   523    -5 A     America/Ne…
    -#> 5 09J   Jekyll Island Airport           31.1 -81.4    11    -5 A     America/Ne…
    -#> 6 0A9   Elizabethton Municipal Airport  36.4 -82.2  1593    -5 A     America/Ne…
    +#>   faa   name                             lat   lon   alt    tz dst   tzone   
    +#>   <chr> <chr>                          <dbl> <dbl> <dbl> <dbl> <chr> <chr>   
    +#> 1 04G   Lansdowne Airport               41.1 -80.6  1044    -5 A     America…
    +#> 2 06A   Moton Field Municipal Airport   32.5 -85.7   264    -6 A     America…
    +#> 3 06C   Schaumburg Regional             42.0 -88.1   801    -6 A     America…
    +#> 4 06N   Randall Airport                 41.4 -74.4   523    -5 A     America…
    +#> 5 09J   Jekyll Island Airport           31.1 -81.4    11    -5 A     America…
    +#> 6 0A9   Elizabethton Municipal Airport  36.4 -82.2  1593    -5 A     America…
     #> # … with 1,452 more rows
    @@ -73,14 +65,14 @@ Primary and foreign keys
    planes
     #> # A tibble: 3,322 × 9
    -#>   tailnum  year type                    manuf…¹ model engines seats speed engine
    -#>   <chr>   <int> <chr>                   <chr>   <chr>   <int> <int> <int> <chr> 
    -#> 1 N10156   2004 Fixed wing multi engine EMBRAER EMB-…       2    55    NA Turbo…
    -#> 2 N102UW   1998 Fixed wing multi engine AIRBUS… A320…       2   182    NA Turbo…
    -#> 3 N103US   1999 Fixed wing multi engine AIRBUS… A320…       2   182    NA Turbo…
    -#> 4 N104UW   1999 Fixed wing multi engine AIRBUS… A320…       2   182    NA Turbo…
    -#> 5 N10575   2002 Fixed wing multi engine EMBRAER EMB-…       2    55    NA Turbo…
    -#> 6 N105UW   1999 Fixed wing multi engine AIRBUS… A320…       2   182    NA Turbo…
    +#>   tailnum  year type                 manuf…¹ model engines seats speed engine
    +#>   <chr>   <int> <chr>                <chr>   <chr>   <int> <int> <int> <chr> 
    +#> 1 N10156   2004 Fixed wing multi en… EMBRAER EMB-…       2    55    NA Turbo…
    +#> 2 N102UW   1998 Fixed wing multi en… AIRBUS… A320…       2   182    NA Turbo…
    +#> 3 N103US   1999 Fixed wing multi en… AIRBUS… A320…       2   182    NA Turbo…
    +#> 4 N104UW   1999 Fixed wing multi en… AIRBUS… A320…       2   182    NA Turbo…
    +#> 5 N10575   2002 Fixed wing multi en… EMBRAER EMB-…       2    55    NA Turbo…
    +#> 6 N105UW   1999 Fixed wing multi en… AIRBUS… A320…       2   182    NA Turbo…
     #> # … with 3,316 more rows, and abbreviated variable name ¹​manufacturer
    @@ -89,16 +81,17 @@ Primary and foreign keys
    weather
     #> # A tibble: 26,115 × 15
    -#>   origin  year month   day  hour  temp  dewp humid wind_dir wind_speed wind_gust
    -#>   <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>     <dbl>
    -#> 1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4         NA
    -#> 2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06        NA
    -#> 3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5         NA
    -#> 4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7         NA
    -#> 5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7         NA
    -#> 6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5         NA
    -#> # … with 26,109 more rows, and 4 more variables: precip <dbl>, pressure <dbl>,
    -#> #   visib <dbl>, time_hour <dttm>
    +#> origin year month day hour temp dewp humid wind_dir wind_sp…¹ wind_…² +#> <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> +#> 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4 NA +#> 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06 NA +#> 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5 NA +#> 4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7 NA +#> 5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7 NA +#> 6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5 NA +#> # … with 26,109 more rows, 4 more variables: precip <dbl>, pressure <dbl>, +#> # visib <dbl>, time_hour <dttm>, and abbreviated variable names +#> # ¹​wind_speed, ²​wind_gust

A foreign key is a variable (or set of variables) that corresponds to a primary key in another table. For example:

@@ -147,8 +140,8 @@ weather |> filter(is.na(tailnum)) #> # A tibble: 0 × 9 #> # … with 9 variables: tailnum <chr>, year <int>, type <chr>, -#> # manufacturer <chr>, model <chr>, engines <int>, seats <int>, speed <int>, -#> # engine <chr> +#> # manufacturer <chr>, model <chr>, engines <int>, seats <int>, +#> # speed <int>, engine <chr> weather |> filter(is.na(time_hour) | is.na(origin)) @@ -189,18 +182,19 @@ Surrogate keys mutate(id = row_number(), .before = 1) flights2 #> # A tibble: 336,776 × 20 -#> id year month day dep_time sched_dep_t…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ -#> <int> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> -#> 1 1 2013 1 1 517 515 2 830 819 11 -#> 2 2 2013 1 1 533 529 4 850 830 20 -#> 3 3 2013 1 1 542 540 2 923 850 33 -#> 4 4 2013 1 1 544 545 -1 1004 1022 -18 -#> 5 5 2013 1 1 554 600 -6 812 837 -25 -#> 6 6 2013 1 1 554 558 -4 740 728 12 +#> id year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ +#> <int> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> +#> 1 1 2013 1 1 517 515 2 830 819 11 +#> 2 2 2013 1 1 533 529 4 850 830 20 +#> 3 3 2013 1 1 542 540 2 923 850 33 +#> 4 4 2013 1 1 544 545 -1 1004 1022 -18 +#> 5 5 2013 1 1 554 600 -6 812 837 -25 +#> 6 6 2013 1 1 554 558 -4 740 728 12 #> # … with 336,770 more rows, 10 more variables: carrier <chr>, flight <int>, #> # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, -#> # hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated variable names -#> # ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay +#> # hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated variable +#> # names ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, +#> # ⁵​arr_delay

Surrogate keys can be particular useful when communicating to other humans: it’s much easier to tell someone to take a look at flight 2001 than to say look at UA430 which departed 9am 2013-01-03.

@@ -247,14 +241,14 @@ flights2 left_join(airlines) #> Joining with `by = join_by(carrier)` #> # A tibble: 336,776 × 7 -#> year time_hour origin dest tailnum carrier name -#> <int> <dttm> <chr> <chr> <chr> <chr> <chr> -#> 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA United Air Lines Inc. -#> 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA United Air Lines Inc. -#> 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA American Airlines Inc. -#> 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 JetBlue Airways -#> 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL Delta Air Lines Inc. -#> 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA United Air Lines Inc. +#> year time_hour origin dest tailnum carrier name +#> <int> <dttm> <chr> <chr> <chr> <chr> <chr> +#> 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA United Air Lines In… +#> 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA United Air Lines In… +#> 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA American Airlines I… +#> 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 JetBlue Airways +#> 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL Delta Air Lines Inc. +#> 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA United Air Lines In… #> # … with 336,770 more rows

Or we could find out the temperature and wind speed when each plane departed:

@@ -279,14 +273,14 @@ flights2 left_join(planes |> select(tailnum, type, engines, seats)) #> Joining with `by = join_by(tailnum)` #> # A tibble: 336,776 × 9 -#> year time_hour origin dest tailnum carrier type engines seats -#> <int> <dttm> <chr> <chr> <chr> <chr> <chr> <int> <int> -#> 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA Fixed wi… 2 149 -#> 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA Fixed wi… 2 149 -#> 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA Fixed wi… 2 178 -#> 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 Fixed wi… 2 200 -#> 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL Fixed wi… 2 178 -#> 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA Fixed wi… 2 191 +#> year time_hour origin dest tailnum carrier type engines seats +#> <int> <dttm> <chr> <chr> <chr> <chr> <chr> <int> <int> +#> 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA Fixed… 2 149 +#> 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA Fixed… 2 149 +#> 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA Fixed… 2 178 +#> 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 Fixed… 2 200 +#> 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL Fixed… 2 178 +#> 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA Fixed… 2 191 #> # … with 336,770 more rows

When left_join() fails to find a match for a row in x, it fills in the new variables with missing values. For example, there’s no information about the plane with tail number N3ALAA so the type, engines, and seats will be missing:

@@ -318,14 +312,14 @@ Specifying join keys left_join(planes) #> Joining with `by = join_by(year, tailnum)` #> # A tibble: 336,776 × 13 -#> year time_hour origin dest tailnum carrier type manufactu…¹ model -#> <int> <dttm> <chr> <chr> <chr> <chr> <chr> <chr> <chr> -#> 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA <NA> <NA> <NA> -#> 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA <NA> <NA> <NA> -#> 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA <NA> <NA> <NA> -#> 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 <NA> <NA> <NA> -#> 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL <NA> <NA> <NA> -#> 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA <NA> <NA> <NA> +#> year time_hour origin dest tailnum carrier type manufa…¹ model +#> <int> <dttm> <chr> <chr> <chr> <chr> <chr> <chr> <chr> +#> 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA <NA> <NA> <NA> +#> 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA <NA> <NA> <NA> +#> 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA <NA> <NA> <NA> +#> 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 <NA> <NA> <NA> +#> 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL <NA> <NA> <NA> +#> 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA <NA> <NA> <NA> #> # … with 336,770 more rows, 4 more variables: engines <int>, seats <int>, #> # speed <int>, engine <chr>, and abbreviated variable name ¹​manufacturer @@ -334,17 +328,16 @@ Specifying join keys
flights2 |> 
   left_join(planes, join_by(tailnum))
 #> # A tibble: 336,776 × 14
-#>   year.x time_hour           origin dest  tailnum carrier year.y type    manuf…¹
-#>    <int> <dttm>              <chr>  <chr> <chr>   <chr>    <int> <chr>   <chr>  
-#> 1   2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA        1999 Fixed … BOEING 
-#> 2   2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA        1998 Fixed … BOEING 
-#> 3   2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA        1990 Fixed … BOEING 
-#> 4   2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6        2012 Fixed … AIRBUS 
-#> 5   2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL        1991 Fixed … BOEING 
-#> 6   2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA        2012 Fixed … BOEING 
-#> # … with 336,770 more rows, 5 more variables: model <chr>, engines <int>,
-#> #   seats <int>, speed <int>, engine <chr>, and abbreviated variable name
-#> #   ¹​manufacturer
+#> year.x time_hour origin dest tailnum carrier year.y type +#> <int> <dttm> <chr> <chr> <chr> <chr> <int> <chr> +#> 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA 1999 Fixed wing … +#> 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA 1998 Fixed wing … +#> 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA 1990 Fixed wing … +#> 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 2012 Fixed wing … +#> 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL 1991 Fixed wing … +#> 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA 2012 Fixed wing … +#> # … with 336,770 more rows, and 6 more variables: manufacturer <chr>, +#> # model <chr>, engines <int>, seats <int>, speed <int>, engine <chr>

Note that the year variables are disambiguated in the output with a suffix (year.x and year.y), which tells you whether the variable came from the x or y argument. You can override the default suffixes with the suffix argument.

join_by(tailnum) is short for join_by(tailnum == tailnum). It’s important to know about this fuller form for two reasons. Firstly, it describes the relationship between the two tables: the keys must be equal. That’s why this type of join is often called an equi-join. You’ll learn about non-equi-joins in #sec-non-equi-joins.

@@ -353,30 +346,30 @@ Specifying join keys
flights2 |> 
   left_join(airports, join_by(dest == faa))
 #> # A tibble: 336,776 × 13
-#>    year time_hour           origin dest  tailnum carrier name    lat   lon   alt
-#>   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr> <dbl> <dbl> <dbl>
-#> 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      Geor…  30.0 -95.3    97
-#> 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      Geor…  30.0 -95.3    97
-#> 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      Miam…  25.8 -80.3     8
-#> 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      <NA>   NA    NA      NA
-#> 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      Hart…  33.6 -84.4  1026
-#> 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      Chic…  42.0 -87.9   668
-#> # … with 336,770 more rows, and 3 more variables: tz <dbl>, dst <chr>,
-#> #   tzone <chr>
+#>    year time_hour           origin dest  tailnum carrier name       lat   lon
+#>   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr>    <dbl> <dbl>
+#> 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      George …  30.0 -95.3
+#> 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      George …  30.0 -95.3
+#> 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      Miami I…  25.8 -80.3
+#> 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      <NA>      NA    NA  
+#> 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      Hartsfi…  33.6 -84.4
+#> 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      Chicago…  42.0 -87.9
+#> # … with 336,770 more rows, and 4 more variables: alt <dbl>, tz <dbl>,
+#> #   dst <chr>, tzone <chr>
 
 flights2 |> 
   left_join(airports, join_by(origin == faa))
 #> # A tibble: 336,776 × 13
-#>    year time_hour           origin dest  tailnum carrier name    lat   lon   alt
-#>   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr> <dbl> <dbl> <dbl>
-#> 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      Newa…  40.7 -74.2    18
-#> 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      La G…  40.8 -73.9    22
-#> 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      John…  40.6 -73.8    13
-#> 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      John…  40.6 -73.8    13
-#> 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      La G…  40.8 -73.9    22
-#> 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      Newa…  40.7 -74.2    18
-#> # … with 336,770 more rows, and 3 more variables: tz <dbl>, dst <chr>,
-#> #   tzone <chr>
+#> year time_hour origin dest tailnum carrier name lat lon +#> <int> <dttm> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> +#> 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA Newark … 40.7 -74.2 +#> 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA La Guar… 40.8 -73.9 +#> 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA John F … 40.6 -73.8 +#> 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 John F … 40.6 -73.8 +#> 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL La Guar… 40.8 -73.9 +#> 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA Newark … 40.7 -74.2 +#> # … with 336,770 more rows, and 4 more variables: alt <dbl>, tz <dbl>, +#> # dst <chr>, tzone <chr>

In older code you might see a different way of specifying the join keys, using a character vector:

  • @@ -405,14 +398,14 @@ Filtering joins
    airports |> 
       semi_join(flights2, join_by(faa == dest))
     #> # A tibble: 101 × 8
    -#>   faa   name                                lat    lon   alt    tz dst   tzone  
    -#>   <chr> <chr>                             <dbl>  <dbl> <dbl> <dbl> <chr> <chr>  
    -#> 1 ABQ   Albuquerque International Sunport  35.0 -107.   5355    -7 A     Americ…
    -#> 2 ACK   Nantucket Mem                      41.3  -70.1    48    -5 A     Americ…
    -#> 3 ALB   Albany Intl                        42.7  -73.8   285    -5 A     Americ…
    -#> 4 ANC   Ted Stevens Anchorage Intl         61.2 -150.    152    -9 A     Americ…
    -#> 5 ATL   Hartsfield Jackson Atlanta Intl    33.6  -84.4  1026    -5 A     Americ…
    -#> 6 AUS   Austin Bergstrom Intl              30.2  -97.7   542    -6 A     Americ…
    +#>   faa   name                               lat    lon   alt    tz dst   tzone
    +#>   <chr> <chr>                            <dbl>  <dbl> <dbl> <dbl> <chr> <chr>
    +#> 1 ABQ   Albuquerque International Sunpo…  35.0 -107.   5355    -7 A     Amer…
    +#> 2 ACK   Nantucket Mem                     41.3  -70.1    48    -5 A     Amer…
    +#> 3 ALB   Albany Intl                       42.7  -73.8   285    -5 A     Amer…
    +#> 4 ANC   Ted Stevens Anchorage Intl        61.2 -150.    152    -9 A     Amer…
    +#> 5 ATL   Hartsfield Jackson Atlanta Intl   33.6  -84.4  1026    -5 A     Amer…
    +#> 6 AUS   Austin Bergstrom Intl             30.2  -97.7   542    -6 A     Amer…
     #> # … with 95 more rows

    Anti-joins are the opposite: they return all rows in x that don’t have a match in y. They’re useful for finding missing values that are implicit in the data, the topic of #sec-missing-implicit. Implicitly missing values don’t show up as NAs but instead only exist as an absence. For example, we can find rows that as missing from airports by looking for flights that don’t have a matching destination airport:

    @@ -664,14 +657,14 @@ Allow multiple rows plane_flights #> # A tibble: 284,170 × 9 -#> tailnum type engines seats year time_hour origin dest carrier -#> <chr> <chr> <int> <int> <int> <dttm> <chr> <chr> <chr> -#> 1 N10156 Fixed wi… 2 55 2013 2013-01-10 06:00:00 EWR PIT EV -#> 2 N10156 Fixed wi… 2 55 2013 2013-01-10 10:00:00 EWR CHS EV -#> 3 N10156 Fixed wi… 2 55 2013 2013-01-10 15:00:00 EWR MSP EV -#> 4 N10156 Fixed wi… 2 55 2013 2013-01-11 06:00:00 EWR CMH EV -#> 5 N10156 Fixed wi… 2 55 2013 2013-01-11 11:00:00 EWR MCI EV -#> 6 N10156 Fixed wi… 2 55 2013 2013-01-11 18:00:00 EWR PWM EV +#> tailnum type engines seats year time_hour origin dest carrier +#> <chr> <chr> <int> <int> <int> <dttm> <chr> <chr> <chr> +#> 1 N10156 Fixed… 2 55 2013 2013-01-10 06:00:00 EWR PIT EV +#> 2 N10156 Fixed… 2 55 2013 2013-01-10 10:00:00 EWR CHS EV +#> 3 N10156 Fixed… 2 55 2013 2013-01-10 15:00:00 EWR MSP EV +#> 4 N10156 Fixed… 2 55 2013 2013-01-11 06:00:00 EWR CMH EV +#> 5 N10156 Fixed… 2 55 2013 2013-01-11 11:00:00 EWR MCI EV +#> 6 N10156 Fixed… 2 55 2013 2013-01-11 18:00:00 EWR PWM EV #> # … with 284,164 more rows
diff --git a/oreilly/logicals.html b/oreilly/logicals.html index 782a969..85a2a94 100644 --- a/oreilly/logicals.html +++ b/oreilly/logicals.html @@ -1,13 +1,5 @@
-

Logical vectors

-
- -
- -
- -

You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

- +

Logical vectors

::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz. :::

Introduction

@@ -55,14 +47,14 @@ Comparisons
flights |> 
   filter(dep_time > 600 & dep_time < 2000 & abs(arr_delay) < 20)
 #> # A tibble: 172,286 × 19
-#>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
-#>   <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
-#> 1  2013     1     1      601         600       1     844     850      -6 B6     
-#> 2  2013     1     1      602         610      -8     812     820      -8 DL     
-#> 3  2013     1     1      602         605      -3     821     805      16 MQ     
-#> 4  2013     1     1      606         610      -4     858     910     -12 AA     
-#> 5  2013     1     1      606         610      -4     837     845      -8 DL     
-#> 6  2013     1     1      607         607       0     858     915     -17 UA     
+#>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
+#>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>  
+#> 1  2013     1     1      601      600       1     844     850      -6 B6     
+#> 2  2013     1     1      602      610      -8     812     820      -8 DL     
+#> 3  2013     1     1      602      605      -3     821     805      16 MQ     
+#> 4  2013     1     1      606      610      -4     858     910     -12 AA     
+#> 5  2013     1     1      606      610      -4     837     845      -8 DL     
+#> 6  2013     1     1      607      607       0     858     915     -17 UA     
 #> # … with 172,280 more rows, 9 more variables: flight <int>, tailnum <chr>,
 #> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
 #> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
@@ -185,14 +177,14 @@ is.na(c("a", NA, "b"))
 
flights |> 
   filter(is.na(dep_time))
 #> # A tibble: 8,255 × 19
-#>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
-#>   <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
-#> 1  2013     1     1       NA        1630      NA      NA    1815      NA EV     
-#> 2  2013     1     1       NA        1935      NA      NA    2240      NA AA     
-#> 3  2013     1     1       NA        1500      NA      NA    1825      NA AA     
-#> 4  2013     1     1       NA         600      NA      NA     901      NA B6     
-#> 5  2013     1     2       NA        1540      NA      NA    1747      NA EV     
-#> 6  2013     1     2       NA        1620      NA      NA    1746      NA EV     
+#>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
+#>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>  
+#> 1  2013     1     1       NA     1630      NA      NA    1815      NA EV     
+#> 2  2013     1     1       NA     1935      NA      NA    2240      NA AA     
+#> 3  2013     1     1       NA     1500      NA      NA    1825      NA AA     
+#> 4  2013     1     1       NA      600      NA      NA     901      NA B6     
+#> 5  2013     1     2       NA     1540      NA      NA    1747      NA EV     
+#> 6  2013     1     2       NA     1620      NA      NA    1746      NA EV     
 #> # … with 8,249 more rows, 9 more variables: flight <int>, tailnum <chr>,
 #> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
 #> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
@@ -204,14 +196,14 @@ is.na(c("a", NA, "b"))
   filter(month == 1, day == 1) |> 
   arrange(dep_time)
 #> # A tibble: 842 × 19
-#>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
-#>   <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
-#> 1  2013     1     1      517         515       2     830     819      11 UA     
-#> 2  2013     1     1      533         529       4     850     830      20 UA     
-#> 3  2013     1     1      542         540       2     923     850      33 AA     
-#> 4  2013     1     1      544         545      -1    1004    1022     -18 B6     
-#> 5  2013     1     1      554         600      -6     812     837     -25 DL     
-#> 6  2013     1     1      554         558      -4     740     728      12 UA     
+#>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
+#>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>  
+#> 1  2013     1     1      517      515       2     830     819      11 UA     
+#> 2  2013     1     1      533      529       4     850     830      20 UA     
+#> 3  2013     1     1      542      540       2     923     850      33 AA     
+#> 4  2013     1     1      544      545      -1    1004    1022     -18 B6     
+#> 5  2013     1     1      554      600      -6     812     837     -25 DL     
+#> 6  2013     1     1      554      558      -4     740     728      12 UA     
 #> # … with 836 more rows, 9 more variables: flight <int>, tailnum <chr>,
 #> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
 #> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
@@ -221,14 +213,14 @@ flights |>
   filter(month == 1, day == 1) |> 
   arrange(desc(is.na(dep_time)), dep_time)
 #> # A tibble: 842 × 19
-#>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
-#>   <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
-#> 1  2013     1     1       NA        1630      NA      NA    1815      NA EV     
-#> 2  2013     1     1       NA        1935      NA      NA    2240      NA AA     
-#> 3  2013     1     1       NA        1500      NA      NA    1825      NA AA     
-#> 4  2013     1     1       NA         600      NA      NA     901      NA B6     
-#> 5  2013     1     1      517         515       2     830     819      11 UA     
-#> 6  2013     1     1      533         529       4     850     830      20 UA     
+#>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
+#>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>  
+#> 1  2013     1     1       NA     1630      NA      NA    1815      NA EV     
+#> 2  2013     1     1       NA     1935      NA      NA    2240      NA AA     
+#> 3  2013     1     1       NA     1500      NA      NA    1825      NA AA     
+#> 4  2013     1     1       NA      600      NA      NA     901      NA B6     
+#> 5  2013     1     1      517      515       2     830     819      11 UA     
+#> 6  2013     1     1      533      529       4     850     830      20 UA     
 #> # … with 836 more rows, 9 more variables: flight <int>, tailnum <chr>,
 #> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
 #> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
@@ -294,14 +286,14 @@ Order of operations
 
flights |> 
    filter(month == 11 | 12)
 #> # A tibble: 336,776 × 19
-#>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
-#>   <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
-#> 1  2013     1     1      517         515       2     830     819      11 UA     
-#> 2  2013     1     1      533         529       4     850     830      20 UA     
-#> 3  2013     1     1      542         540       2     923     850      33 AA     
-#> 4  2013     1     1      544         545      -1    1004    1022     -18 B6     
-#> 5  2013     1     1      554         600      -6     812     837     -25 DL     
-#> 6  2013     1     1      554         558      -4     740     728      12 UA     
+#>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
+#>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>  
+#> 1  2013     1     1      517      515       2     830     819      11 UA     
+#> 2  2013     1     1      533      529       4     850     830      20 UA     
+#> 3  2013     1     1      542      540       2     923     850      33 AA     
+#> 4  2013     1     1      544      545      -1    1004    1022     -18 B6     
+#> 5  2013     1     1      554      600      -6     812     837     -25 DL     
+#> 6  2013     1     1      554      558      -4     740     728      12 UA     
 #> # … with 336,770 more rows, 9 more variables: flight <int>, tailnum <chr>,
 #> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
 #> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
@@ -356,14 +348,14 @@ c(1, 2, NA) %in% NA
 
flights |> 
   filter(dep_time %in% c(NA, 0800))
 #> # A tibble: 8,803 × 19
-#>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
-#>   <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
-#> 1  2013     1     1      800         800       0    1022    1014       8 DL     
-#> 2  2013     1     1      800         810     -10     949     955      -6 MQ     
-#> 3  2013     1     1       NA        1630      NA      NA    1815      NA EV     
-#> 4  2013     1     1       NA        1935      NA      NA    2240      NA AA     
-#> 5  2013     1     1       NA        1500      NA      NA    1825      NA AA     
-#> 6  2013     1     1       NA         600      NA      NA     901      NA B6     
+#>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
+#>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>  
+#> 1  2013     1     1      800      800       0    1022    1014       8 DL     
+#> 2  2013     1     1      800      810     -10     949     955      -6 MQ     
+#> 3  2013     1     1       NA     1630      NA      NA    1815      NA EV     
+#> 4  2013     1     1       NA     1935      NA      NA    2240      NA AA     
+#> 5  2013     1     1       NA     1500      NA      NA    1825      NA AA     
+#> 6  2013     1     1       NA      600      NA      NA     901      NA B6     
 #> # … with 8,797 more rows, 9 more variables: flight <int>, tailnum <chr>,
 #> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
 #> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
diff --git a/oreilly/missing-values.html b/oreilly/missing-values.html
index 5b008d4..e499f52 100644
--- a/oreilly/missing-values.html
+++ b/oreilly/missing-values.html
@@ -1,13 +1,5 @@
 
-

Missing values

-
- -
- -
- -

You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

- +

Missing values

::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz. :::

Introduction

diff --git a/oreilly/numbers.html b/oreilly/numbers.html index 7d33abe..95dd353 100644 --- a/oreilly/numbers.html +++ b/oreilly/numbers.html @@ -1,13 +1,5 @@
-

Numbers

-
- -
- -
- -

You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

- +

Numbers

::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz. :::

Introduction

@@ -218,14 +210,14 @@ x * c(1, 2, 3)
flights |> 
   filter(month == c(1, 2))
 #> # A tibble: 25,977 × 19
-#>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
-#>   <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
-#> 1  2013     1     1      517         515       2     830     819      11 UA     
-#> 2  2013     1     1      542         540       2     923     850      33 AA     
-#> 3  2013     1     1      554         600      -6     812     837     -25 DL     
-#> 4  2013     1     1      555         600      -5     913     854      19 B6     
-#> 5  2013     1     1      557         600      -3     838     846      -8 B6     
-#> 6  2013     1     1      558         600      -2     849     851      -2 B6     
+#>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
+#>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>  
+#> 1  2013     1     1      517      515       2     830     819      11 UA     
+#> 2  2013     1     1      542      540       2     923     850      33 AA     
+#> 3  2013     1     1      554      600      -6     812     837     -25 DL     
+#> 4  2013     1     1      555      600      -5     913     854      19 B6     
+#> 5  2013     1     1      557      600      -3     838     846      -8 B6     
+#> 6  2013     1     1      558      600      -2     849     851      -2 B6     
 #> # … with 25,971 more rows, 9 more variables: flight <int>, tailnum <chr>,
 #> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
 #> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
@@ -759,8 +751,8 @@ Positions
     fifth_dep = nth(dep_time, 5),
     last_dep = last(dep_time)
   )
-#> `summarise()` has grouped output by 'year', 'month'. You can override using the
-#> `.groups` argument.
+#> `summarise()` has grouped output by 'year', 'month'. You can override using
+#> the `.groups` argument.
 #> # A tibble: 365 × 6
 #> # Groups:   year, month [12]
 #>    year month   day first_dep fifth_dep last_dep
@@ -783,14 +775,14 @@ Positions
   filter(r %in% c(1, max(r)))
 #> # A tibble: 1,195 × 20
 #> # Groups:   year, month, day [365]
-#>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
-#>   <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
-#> 1  2013     1     1      517         515       2     830     819      11 UA     
-#> 2  2013     1     1     2353        2359      -6     425     445     -20 B6     
-#> 3  2013     1     1     2353        2359      -6     418     442     -24 B6     
-#> 4  2013     1     1     2356        2359      -3     425     437     -12 B6     
-#> 5  2013     1     2       42        2359      43     518     442      36 B6     
-#> 6  2013     1     2      458         500      -2     703     650      13 US     
+#>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
+#>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>  
+#> 1  2013     1     1      517      515       2     830     819      11 UA     
+#> 2  2013     1     1     2353     2359      -6     425     445     -20 B6     
+#> 3  2013     1     1     2353     2359      -6     418     442     -24 B6     
+#> 4  2013     1     1     2356     2359      -3     425     437     -12 B6     
+#> 5  2013     1     2       42     2359      43     518     442      36 B6     
+#> 6  2013     1     2      458      500      -2     703     650      13 US     
 #> # … with 1,189 more rows, 10 more variables: flight <int>, tailnum <chr>,
 #> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
 #> #   minute <dbl>, time_hour <dttm>, r <int>, and abbreviated variable names
diff --git a/oreilly/quarto-formats.html b/oreilly/quarto-formats.html
index 134fc59..c4306df 100644
--- a/oreilly/quarto-formats.html
+++ b/oreilly/quarto-formats.html
@@ -1,13 +1,5 @@
 
-

Quarto formats

-
- -
- -
- -

You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

- +

Quarto formats

::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz. :::

Introduction

diff --git a/oreilly/quarto-workflow.html b/oreilly/quarto-workflow.html index fbc3b60..ce9416e 100644 --- a/oreilly/quarto-workflow.html +++ b/oreilly/quarto-workflow.html @@ -1,13 +1,5 @@
-

Quarto workflow

-
- -
- -
- -

You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

-

Earlier, we discussed a basic workflow for capturing your R code where you work interactively in the console, then capture what works in the script editor. Quarto brings together the console and the script editor, blurring the lines between interactive exploration and long-term code capture. You can rapidly iterate within a chunk, editing and re-executing with Cmd/Ctrl + Shift + Enter. When you’re happy, you move on and start a new chunk.

Quarto is also important because it so tightly integrates prose and code. This makes it a great analysis notebook because it lets you develop code and record your thoughts. An analysis notebook shares many of the same goals as a classic lab notebook in the physical sciences. It:

  • Records what you did and why you did it. Regardless of how great your memory is, if you don’t record what you do, there will come a time when you have forgotten important details. Write them down so you don’t forget!

  • +

    Quarto workflow

    ::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz. :::

    Earlier, we discussed a basic workflow for capturing your R code where you work interactively in the console, then capture what works in the script editor. Quarto brings together the console and the script editor, blurring the lines between interactive exploration and long-term code capture. You can rapidly iterate within a chunk, editing and re-executing with Cmd/Ctrl + Shift + Enter. When you’re happy, you move on and start a new chunk.

    Quarto is also important because it so tightly integrates prose and code. This makes it a great analysis notebook because it lets you develop code and record your thoughts. An analysis notebook shares many of the same goals as a classic lab notebook in the physical sciences. It:

    • Records what you did and why you did it. Regardless of how great your memory is, if you don’t record what you do, there will come a time when you have forgotten important details. Write them down so you don’t forget!

    • Supports rigorous thinking. You are more likely to come up with a strong analysis if you record your thoughts as you go, and continue to reflect on them. This also saves you time when you eventually write up your analysis to share with others.

    • Helps others understand your work. It is rare to do data analysis by yourself, and you’ll often be working as part of a team. A lab notebook helps you share not only what you’ve done, but why you did it with your colleagues or lab mates.

    Much of the good advice about using lab notebooks effectively can also be translated to analysis notebooks. We’ve drawn on our own experiences and Colin Purrington’s advice on lab notebooks (https://colinpurrington.com/tips/lab-notebooks) to come up with the following tips:

    • Ensure each notebook has a descriptive title, an evocative file name, and a first paragraph that briefly describes the aims of the analysis.

    • diff --git a/oreilly/quarto.html b/oreilly/quarto.html index 22f0104..7896baf 100644 --- a/oreilly/quarto.html +++ b/oreilly/quarto.html @@ -1,13 +1,5 @@
      -

      Quarto

      -
      - -
      - -
      - -

      You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

      - +

      Quarto

      ::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz. :::

      Introduction

      diff --git a/oreilly/rectangling.html b/oreilly/rectangling.html index 177321c..6fc785a 100644 --- a/oreilly/rectangling.html +++ b/oreilly/rectangling.html @@ -1,29 +1,5 @@
      -

      Data rectangling

      -
      - -
      - -

      -Base R -

      You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

      - -

      It’s possible to put a list in a column of a data.frame, but it’s a lot fiddlier because data.frame() treats a list as a list of columns:

      -
      data.frame(x = list(1:3, 3:5))
      -#>   x.1.3 x.3.5
      -#> 1     1     3
      -#> 2     2     4
      -#> 3     3     5
      -

      You can force data.frame() to treat a list as a list of rows by wrapping it in list I(), but the result doesn’t print particularly well:

      -
      data.frame(
      -  x = I(list(1:2, 3:5)), 
      -  y = c("1, 2", "3, 4, 5")
      -)
      -#>         x       y
      -#> 1    1, 2    1, 2
      -#> 2 3, 4, 5 3, 4, 5
      -

      It’s easier to use list-columns with tibbles because tibble() treats lists like either vectors and the print method has been designed with lists in mind.

      - +

      Data rectangling

      ::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz. :::

      Introduction

      @@ -198,9 +174,7 @@ df

      Similarly, if you View() a data frame in RStudio, you’ll get the standard tabular view, which doesn’t allow you to selectively expand list columns. To explore those fields you’ll need to pull() and view, e.g. df |> pull(z) |> View().

      Base R -

      You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

      - -

      It’s possible to put a list in a column of a data.frame, but it’s a lot fiddlier because data.frame() treats a list as a list of columns:

      +

      It’s possible to put a list in a column of a data.frame, but it’s a lot fiddlier because data.frame() treats a list as a list of columns:

      data.frame(x = list(1:3, 3:5))
       #>   x.1.3 x.3.5
       #> 1     1     3
      @@ -486,15 +460,15 @@ repos
         unnest_longer(json) |> 
         unnest_wider(json) 
       #> # A tibble: 176 × 68
      -#>        id name  full_…¹ owner        private html_…² descr…³ fork  url   forks…⁴
      -#>     <int> <chr> <chr>   <list>       <lgl>   <chr>   <chr>   <lgl> <chr> <chr>  
      -#> 1  6.12e7 after gaborc… <named list> FALSE   https:… Run Co… FALSE http… https:…
      -#> 2  4.05e7 argu… gaborc… <named list> FALSE   https:… Declar… FALSE http… https:…
      -#> 3  3.64e7 ask   gaborc… <named list> FALSE   https:… Friend… FALSE http… https:…
      -#> 4  3.49e7 base… gaborc… <named list> FALSE   https:… Do we … FALSE http… https:…
      -#> 5  6.16e7 cite… gaborc… <named list> FALSE   https:… Test R… TRUE  http… https:…
      -#> 6  3.39e7 clis… gaborc… <named list> FALSE   https:… Unicod… FALSE http… https:…
      -#> # … with 170 more rows, 58 more variables: keys_url <chr>,
      +#>         id name      full_…¹ owner        private html_…² descr…³ fork  url  
      +#>      <int> <chr>     <chr>   <list>       <lgl>   <chr>   <chr>   <lgl> <chr>
      +#> 1 61160198 after     gaborc… <named list> FALSE   https:… Run Co… FALSE http…
      +#> 2 40500181 argufy    gaborc… <named list> FALSE   https:… Declar… FALSE http…
      +#> 3 36442442 ask       gaborc… <named list> FALSE   https:… Friend… FALSE http…
      +#> 4 34924886 baseimpo… gaborc… <named list> FALSE   https:… Do we … FALSE http…
      +#> 5 61620661 citest    gaborc… <named list> FALSE   https:… Test R… TRUE  http…
      +#> 6 33907457 clisymbo… gaborc… <named list> FALSE   https:… Unicod… FALSE http…
      +#> # … with 170 more rows, 59 more variables: forks_url <chr>, keys_url <chr>,
       #> #   collaborators_url <chr>, teams_url <chr>, hooks_url <chr>,
       #> #   issue_events_url <chr>, events_url <chr>, assignees_url <chr>,
       #> #   branches_url <chr>, tags_url <chr>, blobs_url <chr>, git_tags_url <chr>,
      @@ -539,14 +513,14 @@ repos
         unnest_wider(json) |> 
         select(id, full_name, owner, description)
       #> # A tibble: 176 × 4
      -#>         id full_name               owner             description                
      -#>      <int> <chr>                   <list>            <chr>                      
      -#> 1 61160198 gaborcsardi/after       <named list [17]> Run Code in the Background 
      -#> 2 40500181 gaborcsardi/argufy      <named list [17]> Declarative function argum…
      -#> 3 36442442 gaborcsardi/ask         <named list [17]> Friendly CLI interaction i…
      -#> 4 34924886 gaborcsardi/baseimports <named list [17]> Do we get warnings for und…
      -#> 5 61620661 gaborcsardi/citest      <named list [17]> Test R package and repo fo…
      -#> 6 33907457 gaborcsardi/clisymbols  <named list [17]> Unicode symbols for CLI ap…
      +#>         id full_name               owner             description             
      +#>      <int> <chr>                   <list>            <chr>                   
      +#> 1 61160198 gaborcsardi/after       <named list [17]> Run Code in the Backgro…
      +#> 2 40500181 gaborcsardi/argufy      <named list [17]> Declarative function ar…
      +#> 3 36442442 gaborcsardi/ask         <named list [17]> Friendly CLI interactio…
      +#> 4 34924886 gaborcsardi/baseimports <named list [17]> Do we get warnings for …
      +#> 5 61620661 gaborcsardi/citest      <named list [17]> Test R package and repo…
      +#> 6 33907457 gaborcsardi/clisymbols  <named list [17]> Unicode symbols for CLI…
       #> # … with 170 more rows

      You can use this to work back to understand how gh_repos was strucured: each child was a GitHub user containing a list of up to 30 GitHub repositories that they created.

      @@ -572,21 +546,21 @@ repos select(id, full_name, owner, description) |> unnest_wider(owner, names_sep = "_") #> # A tibble: 176 × 20 -#> id full_…¹ owner…² owner…³ owner…⁴ owner…⁵ owner…⁶ owner…⁷ owner…⁸ owner…⁹ -#> <int> <chr> <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> -#> 1 6.12e7 gaborc… gaborc… 660288 https:… "" https:… https:… https:… https:… -#> 2 4.05e7 gaborc… gaborc… 660288 https:… "" https:… https:… https:… https:… -#> 3 3.64e7 gaborc… gaborc… 660288 https:… "" https:… https:… https:… https:… -#> 4 3.49e7 gaborc… gaborc… 660288 https:… "" https:… https:… https:… https:… -#> 5 6.16e7 gaborc… gaborc… 660288 https:… "" https:… https:… https:… https:… -#> 6 3.39e7 gaborc… gaborc… 660288 https:… "" https:… https:… https:… https:… -#> # … with 170 more rows, 10 more variables: owner_gists_url <chr>, -#> # owner_starred_url <chr>, owner_subscriptions_url <chr>, -#> # owner_organizations_url <chr>, owner_repos_url <chr>, -#> # owner_events_url <chr>, owner_received_events_url <chr>, owner_type <chr>, -#> # owner_site_admin <lgl>, description <chr>, and abbreviated variable names -#> # ¹​full_name, ²​owner_login, ³​owner_id, ⁴​owner_avatar_url, ⁵​owner_gravatar_id, -#> # ⁶​owner_url, ⁷​owner_html_url, ⁸​owner_followers_url, ⁹​owner_following_url
+#> id full_name owner…¹ owner…² owner…³ owner…⁴ owner…⁵ owner…⁶ owner…⁷ +#> <int> <chr> <chr> <int> <chr> <chr> <chr> <chr> <chr> +#> 1 61160198 gaborcsar… gaborc… 660288 https:… "" https:… https:… https:… +#> 2 40500181 gaborcsar… gaborc… 660288 https:… "" https:… https:… https:… +#> 3 36442442 gaborcsar… gaborc… 660288 https:… "" https:… https:… https:… +#> 4 34924886 gaborcsar… gaborc… 660288 https:… "" https:… https:… https:… +#> 5 61620661 gaborcsar… gaborc… 660288 https:… "" https:… https:… https:… +#> 6 33907457 gaborcsar… gaborc… 660288 https:… "" https:… https:… https:… +#> # … with 170 more rows, 11 more variables: owner_following_url <chr>, +#> # owner_gists_url <chr>, owner_starred_url <chr>, +#> # owner_subscriptions_url <chr>, owner_organizations_url <chr>, +#> # owner_repos_url <chr>, owner_events_url <chr>, +#> # owner_received_events_url <chr>, owner_type <chr>, +#> # owner_site_admin <lgl>, description <chr>, and abbreviated variable +#> # names ¹​owner_login, ²​owner_id, ³​owner_avatar_url, ⁴​owner_gravatar_id, …

This gives another wide dataset, but you can see that owner appears to contain a lot of additional data about the person who “owns” the repository.

@@ -614,14 +588,14 @@ chars
chars |> 
   unnest_wider(json)
 #> # A tibble: 30 × 18
-#>   url            id name  gender culture born  died  alive titles aliases father
-#>   <chr>       <int> <chr> <chr>  <chr>   <chr> <chr> <lgl> <list> <list>  <chr> 
-#> 1 https://ww…  1022 Theo… Male   "Ironb… "In … ""    TRUE  <chr>  <chr>   ""    
-#> 2 https://ww…  1052 Tyri… Male   ""      "In … ""    TRUE  <chr>  <chr>   ""    
-#> 3 https://ww…  1074 Vict… Male   "Ironb… "In … ""    TRUE  <chr>  <chr>   ""    
-#> 4 https://ww…  1109 Will  Male   ""      ""    "In … FALSE <chr>  <chr>   ""    
-#> 5 https://ww…  1166 Areo… Male   "Norvo… "In … ""    TRUE  <chr>  <chr>   ""    
-#> 6 https://ww…  1267 Chett Male   ""      "At … "In … FALSE <chr>  <chr>   ""    
+#>   url         id name  gender culture born  died  alive titles aliases father
+#>   <chr>    <int> <chr> <chr>  <chr>   <chr> <chr> <lgl> <list> <list>  <chr> 
+#> 1 https:/…  1022 Theo… Male   "Ironb… "In … ""    TRUE  <chr>  <chr>   ""    
+#> 2 https:/…  1052 Tyri… Male   ""      "In … ""    TRUE  <chr>  <chr>   ""    
+#> 3 https:/…  1074 Vict… Male   "Ironb… "In … ""    TRUE  <chr>  <chr>   ""    
+#> 4 https:/…  1109 Will  Male   ""      ""    "In … FALSE <chr>  <chr>   ""    
+#> 5 https:/…  1166 Areo… Male   "Norvo… "In … ""    TRUE  <chr>  <chr>   ""    
+#> 6 https:/…  1267 Chett Male   ""      "At … "In … FALSE <chr>  <chr>   ""    
 #> # … with 24 more rows, and 7 more variables: mother <chr>, spouse <chr>,
 #> #   allegiances <list>, books <list>, povBooks <list>, tvSeries <list>,
 #> #   playedBy <list>
@@ -633,14 +607,14 @@ chars select(id, name, gender, culture, born, died, alive) characters #> # A tibble: 30 × 7 -#> id name gender culture born died alive -#> <int> <chr> <chr> <chr> <chr> <chr> <lgl> -#> 1 1022 Theon Greyjoy Male "Ironborn" "In 278 AC or 279 AC, a… "" TRUE -#> 2 1052 Tyrion Lannister Male "" "In 273 AC, at Casterly… "" TRUE -#> 3 1074 Victarion Greyjoy Male "Ironborn" "In 268 AC or before, a… "" TRUE -#> 4 1109 Will Male "" "" "In … FALSE -#> 5 1166 Areo Hotah Male "Norvoshi" "In 257 AC or before, a… "" TRUE -#> 6 1267 Chett Male "" "At Hag's Mire" "In … FALSE +#> id name gender culture born died alive +#> <int> <chr> <chr> <chr> <chr> <chr> <lgl> +#> 1 1022 Theon Greyjoy Male "Ironborn" "In 278 AC or 279 AC… "" TRUE +#> 2 1052 Tyrion Lannister Male "" "In 273 AC, at Caste… "" TRUE +#> 3 1074 Victarion Greyjoy Male "Ironborn" "In 268 AC or before… "" TRUE +#> 4 1109 Will Male "" "" "In … FALSE +#> 5 1166 Areo Hotah Male "Norvoshi" "In 257 AC or before… "" TRUE +#> 6 1267 Chett Male "" "At Hag's Mire" "In … FALSE #> # … with 24 more rows

There are also many list-columns:

@@ -649,15 +623,15 @@ characters unnest_wider(json) |> select(id, where(is.list)) #> # A tibble: 30 × 8 -#> id titles aliases allegiances books povBooks tvSeries playedBy -#> <int> <list> <list> <list> <list> <list> <list> <list> -#> 1 1022 <chr [3]> <chr [4]> <chr [1]> <chr [3]> <chr [2]> <chr [6]> <chr [1]> -#> 2 1052 <chr [2]> <chr [11]> <chr [1]> <chr [2]> <chr [4]> <chr [6]> <chr [1]> -#> 3 1074 <chr [2]> <chr [1]> <chr [1]> <chr [3]> <chr [2]> <chr [1]> <chr [1]> -#> 4 1109 <chr [1]> <chr [1]> <NULL> <chr [1]> <chr [1]> <chr [1]> <chr [1]> -#> 5 1166 <chr [1]> <chr [1]> <chr [1]> <chr [3]> <chr [2]> <chr [2]> <chr [1]> -#> 6 1267 <chr [1]> <chr [1]> <NULL> <chr [2]> <chr [1]> <chr [1]> <chr [1]> -#> # … with 24 more rows +#> id titles aliases allegiances books povBooks tvSeries playe…¹ +#> <int> <list> <list> <list> <list> <list> <list> <list> +#> 1 1022 <chr [3]> <chr [4]> <chr [1]> <chr [3]> <chr [2]> <chr> <chr> +#> 2 1052 <chr [2]> <chr [11]> <chr [1]> <chr [2]> <chr [4]> <chr> <chr> +#> 3 1074 <chr [2]> <chr [1]> <chr [1]> <chr [3]> <chr [2]> <chr> <chr> +#> 4 1109 <chr [1]> <chr [1]> <NULL> <chr [1]> <chr [1]> <chr> <chr> +#> 5 1166 <chr [1]> <chr [1]> <chr [1]> <chr [3]> <chr [2]> <chr> <chr> +#> 6 1267 <chr [1]> <chr [1]> <NULL> <chr [2]> <chr [1]> <chr> <chr> +#> # … with 24 more rows, and abbreviated variable name ¹​playedBy

Lets explore the titles column. It’s an unnamed list-column, so we’ll unnest it into rows:

@@ -713,14 +687,14 @@ characters |> select(id, name) |> inner_join(titles, by = "id", multiple = "all") #> # A tibble: 53 × 3 -#> id name title -#> <int> <chr> <chr> -#> 1 1022 Theon Greyjoy Prince of Winterfell -#> 2 1022 Theon Greyjoy Captain of Sea Bitch -#> 3 1022 Theon Greyjoy Lord of the Iron Islands (by law of the green lands) -#> 4 1052 Tyrion Lannister Acting Hand of the King (former) -#> 5 1052 Tyrion Lannister Master of Coin (former) -#> 6 1074 Victarion Greyjoy Lord Captain of the Iron Fleet +#> id name title +#> <int> <chr> <chr> +#> 1 1022 Theon Greyjoy Prince of Winterfell +#> 2 1022 Theon Greyjoy Captain of Sea Bitch +#> 3 1022 Theon Greyjoy Lord of the Iron Islands (by law of the green land… +#> 4 1052 Tyrion Lannister Acting Hand of the King (former) +#> 5 1052 Tyrion Lannister Master of Coin (former) +#> 6 1074 Victarion Greyjoy Lord Captain of the Iron Fleet #> # … with 47 more rows

You could imagine creating a table like this for each of the list-columns, then using joins to combine them with the character data as you need it.

@@ -855,15 +829,15 @@ Deeply nested unnest_wider(results) locations #> # A tibble: 7 × 6 -#> city address_components formatted_address geometry place_id types -#> <chr> <list> <chr> <list> <chr> <list> -#> 1 Houston <list [4]> Houston, TX, USA <named list> ChIJAYW… <list> -#> 2 Washington <list [2]> Washington, USA <named list> ChIJ-bD… <list> -#> 3 Washington <list [4]> Washington, DC, USA <named list> ChIJW-T… <list> -#> 4 New York <list [3]> New York, NY, USA <named list> ChIJOwg… <list> -#> 5 Chicago <list [4]> Chicago, IL, USA <named list> ChIJ7cv… <list> -#> 6 Arlington <list [4]> Arlington, TX, USA <named list> ChIJ05g… <list> -#> # … with 1 more row +#> city address_components formatted_address geometry place…¹ types +#> <chr> <list> <chr> <list> <chr> <list> +#> 1 Houston <list [4]> Houston, TX, USA <named list> ChIJAY… <list> +#> 2 Washington <list [2]> Washington, USA <named list> ChIJ-b… <list> +#> 3 Washington <list [4]> Washington, DC, … <named list> ChIJW-… <list> +#> 4 New York <list [3]> New York, NY, USA <named list> ChIJOw… <list> +#> 5 Chicago <list [4]> Chicago, IL, USA <named list> ChIJ7c… <list> +#> 6 Arlington <list [4]> Arlington, TX, U… <named list> ChIJ05… <list> +#> # … with 1 more row, and abbreviated variable name ¹​place_id

Now we can see why two cities got two results: Washington matched both Washington state and Washington, DC, and Arlington matched Arlington, Virginia and Arlington, Texas.

There are few different places we could go from here. We might want to determine the exact location of the match, which is stored in the geometry list-column:

@@ -872,14 +846,14 @@ locations select(city, formatted_address, geometry) |> unnest_wider(geometry) #> # A tibble: 7 × 6 -#> city formatted_address bounds location locati…¹ viewport -#> <chr> <chr> <list> <list> <chr> <list> -#> 1 Houston Houston, TX, USA <named list> <named list> APPROXI… <named list> -#> 2 Washington Washington, USA <named list> <named list> APPROXI… <named list> -#> 3 Washington Washington, DC, USA <named list> <named list> APPROXI… <named list> -#> 4 New York New York, NY, USA <named list> <named list> APPROXI… <named list> -#> 5 Chicago Chicago, IL, USA <named list> <named list> APPROXI… <named list> -#> 6 Arlington Arlington, TX, USA <named list> <named list> APPROXI… <named list> +#> city formatted_address bounds location locat…¹ viewport +#> <chr> <chr> <list> <list> <chr> <list> +#> 1 Houston Houston, TX, USA <named list> <named list> APPROX… <named list> +#> 2 Washington Washington, USA <named list> <named list> APPROX… <named list> +#> 3 Washington Washington, DC, … <named list> <named list> APPROX… <named list> +#> 4 New York New York, NY, USA <named list> <named list> APPROX… <named list> +#> 5 Chicago Chicago, IL, USA <named list> <named list> APPROX… <named list> +#> 6 Arlington Arlington, TX, U… <named list> <named list> APPROX… <named list> #> # … with 1 more row, and abbreviated variable name ¹​location_type

That gives us new bounds (a rectangular region) and location (a point). We can unnest location to see the latitude (lat) and longitude (lng):

@@ -889,14 +863,14 @@ locations unnest_wider(geometry) |> unnest_wider(location) #> # A tibble: 7 × 7 -#> city formatted_address bounds lat lng locati…¹ viewport -#> <chr> <chr> <list> <dbl> <dbl> <chr> <list> -#> 1 Houston Houston, TX, USA <named list> 29.8 -95.4 APPROXI… <named list> -#> 2 Washington Washington, USA <named list> 47.8 -121. APPROXI… <named list> -#> 3 Washington Washington, DC, USA <named list> 38.9 -77.0 APPROXI… <named list> -#> 4 New York New York, NY, USA <named list> 40.7 -74.0 APPROXI… <named list> -#> 5 Chicago Chicago, IL, USA <named list> 41.9 -87.6 APPROXI… <named list> -#> 6 Arlington Arlington, TX, USA <named list> 32.7 -97.1 APPROXI… <named list> +#> city formatted_address bounds lat lng locat…¹ viewport +#> <chr> <chr> <list> <dbl> <dbl> <chr> <list> +#> 1 Houston Houston, TX, USA <named list> 29.8 -95.4 APPROX… <named list> +#> 2 Washington Washington, USA <named list> 47.8 -121. APPROX… <named list> +#> 3 Washington Washington, DC, … <named list> 38.9 -77.0 APPROX… <named list> +#> 4 New York New York, NY, USA <named list> 40.7 -74.0 APPROX… <named list> +#> 5 Chicago Chicago, IL, USA <named list> 41.9 -87.6 APPROX… <named list> +#> 6 Arlington Arlington, TX, U… <named list> 32.7 -97.1 APPROX… <named list> #> # … with 1 more row, and abbreviated variable name ¹​location_type

Extracting the bounds requires a few more steps:

diff --git a/oreilly/regexps.html b/oreilly/regexps.html index 105694a..82021fd 100644 --- a/oreilly/regexps.html +++ b/oreilly/regexps.html @@ -1,13 +1,5 @@
-

Regular expressions

-
- -
- -
- -

You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

- +

Regular expressions

::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz. :::

Introduction

@@ -1006,8 +998,9 @@ Base R

apropos(pattern) searches all objects available from the global environment that match the given pattern. This is useful if you can’t quite remember the name of a function:

apropos("replace")
-#> [1] "%+replace%"       "replace"          "replace_na"       "setReplaceMethod"
-#> [5] "str_replace"      "str_replace_all"  "str_replace_na"   "theme_replace"
+#> [1] "%+replace%" "replace" "replace_na" +#> [4] "setReplaceMethod" "str_replace" "str_replace_all" +#> [7] "str_replace_na" "theme_replace"

list.files(path, pattern) lists all files in path that match a regular expression pattern. For example, you can find all the R Markdown files in the current directory with:

diff --git a/oreilly/spreadsheets.html b/oreilly/spreadsheets.html index 80d6765..6711132 100644 --- a/oreilly/spreadsheets.html +++ b/oreilly/spreadsheets.html @@ -1,13 +1,5 @@
-

Spreadsheets

-
- -
- -
- -

You are reading the work-in-progress second edition of R for Data Science. This chapter is currently a dumping ground for ideas, and we don’t recommend reading it. You can find the complete first edition at https://r4ds.had.co.nz.

- +

Spreadsheets

::: status callout-important You are reading the work-in-progress second edition of R for Data Science. This chapter is currently a dumping ground for ideas, and we don’t recommend reading it. You can find the complete first edition at https://r4ds.had.co.nz. :::

Introduction

@@ -197,16 +189,16 @@ Reading individual sheets
read_excel("data/penguins.xlsx", sheet = "Torgersen Island")
 #> # A tibble: 52 × 8
-#>   species island    bill_length_mm     bill_depth_mm flipp…¹ body_…² sex    year
-#>   <chr>   <chr>     <chr>              <chr>         <chr>   <chr>   <chr> <dbl>
-#> 1 Adelie  Torgersen 39.1               18.7          181     3750    male   2007
-#> 2 Adelie  Torgersen 39.5               17.399999999… 186     3800    fema…  2007
-#> 3 Adelie  Torgersen 40.299999999999997 18            195     3250    fema…  2007
-#> 4 Adelie  Torgersen NA                 NA            NA      NA      NA     2007
-#> 5 Adelie  Torgersen 36.700000000000003 19.3          193     3450    fema…  2007
-#> 6 Adelie  Torgersen 39.299999999999997 20.6          190     3650    male   2007
-#> # … with 46 more rows, and abbreviated variable names ¹​flipper_length_mm,
-#> #   ²​body_mass_g
+#> species island bill_length_mm bill_dep…¹ flipp…² body_…³ sex year +#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> +#> 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007 +#> 2 Adelie Torgersen 39.5 17.399999… 186 3800 fema… 2007 +#> 3 Adelie Torgersen 40.299999999999997 18 195 3250 fema… 2007 +#> 4 Adelie Torgersen NA NA NA NA NA 2007 +#> 5 Adelie Torgersen 36.700000000000003 19.3 193 3450 fema… 2007 +#> 6 Adelie Torgersen 39.299999999999997 20.6 190 3650 male 2007 +#> # … with 46 more rows, and abbreviated variable names ¹​bill_depth_mm, +#> # ²​flipper_length_mm, ³​body_mass_g

Some variables that appear to contain numerical data are read in as characters due to the character string "NA" not being recognized as a true NA.

@@ -214,14 +206,14 @@ Reading individual sheets penguins_torgersen #> # A tibble: 52 × 8 -#> species island bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex year -#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> -#> 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007 -#> 2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007 -#> 3 Adelie Torgersen 40.3 18 195 3250 fema… 2007 -#> 4 Adelie Torgersen NA NA NA NA <NA> 2007 -#> 5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007 -#> 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007 +#> species island bill_length_mm bill_depth_mm flippe…¹ body_…² sex year +#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> +#> 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007 +#> 2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007 +#> 3 Adelie Torgersen 40.3 18 195 3250 fema… 2007 +#> 4 Adelie Torgersen NA NA NA NA <NA> 2007 +#> 5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007 +#> 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007 #> # … with 46 more rows, and abbreviated variable names ¹​flipper_length_mm, #> # ²​body_mass_g
@@ -249,14 +241,14 @@ dim(penguins_dream)
penguins <- bind_rows(penguins_torgersen, penguins_biscoe, penguins_dream)
 penguins
 #> # A tibble: 344 × 8
-#>   species island    bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex    year
-#>   <chr>   <chr>              <dbl>         <dbl>       <dbl>   <dbl> <chr> <dbl>
-#> 1 Adelie  Torgersen           39.1          18.7         181    3750 male   2007
-#> 2 Adelie  Torgersen           39.5          17.4         186    3800 fema…  2007
-#> 3 Adelie  Torgersen           40.3          18           195    3250 fema…  2007
-#> 4 Adelie  Torgersen           NA            NA            NA      NA <NA>   2007
-#> 5 Adelie  Torgersen           36.7          19.3         193    3450 fema…  2007
-#> 6 Adelie  Torgersen           39.3          20.6         190    3650 male   2007
+#>   species island    bill_length_mm bill_depth_mm flippe…¹ body_…² sex    year
+#>   <chr>   <chr>              <dbl>         <dbl>    <dbl>   <dbl> <chr> <dbl>
+#> 1 Adelie  Torgersen           39.1          18.7      181    3750 male   2007
+#> 2 Adelie  Torgersen           39.5          17.4      186    3800 fema…  2007
+#> 3 Adelie  Torgersen           40.3          18        195    3250 fema…  2007
+#> 4 Adelie  Torgersen           NA            NA         NA      NA <NA>   2007
+#> 5 Adelie  Torgersen           36.7          19.3      193    3450 fema…  2007
+#> 6 Adelie  Torgersen           39.3          20.6      190    3650 male   2007
 #> # … with 338 more rows, and abbreviated variable names ¹​flipper_length_mm,
 #> #   ²​body_mass_g
@@ -287,14 +279,14 @@ deaths <- read_excel(deaths_path) #> • `` -> `...6` deaths #> # A tibble: 18 × 6 -#> `Lots of people` ...2 ...3 ...4 ...5 ...6 -#> <chr> <chr> <chr> <chr> <chr> <chr> -#> 1 simply cannot resist writing <NA> <NA> <NA> <NA> some not… -#> 2 at the top <NA> of their sp… -#> 3 or merging <NA> <NA> <NA> cells -#> 4 Name Profession Age Has kids Date of birth Date of … -#> 5 David Bowie musician 69 TRUE 17175 42379 -#> 6 Carrie Fisher actor 60 TRUE 20749 42731 +#> `Lots of people` ...2 ...3 ...4 ...5 ...6 +#> <chr> <chr> <chr> <chr> <chr> <chr> +#> 1 simply cannot resist writing <NA> <NA> <NA> <NA> some … +#> 2 at the top <NA> of their… +#> 3 or merging <NA> <NA> <NA> cells +#> 4 Name Profession Age Has kids Date of birth Date … +#> 5 David Bowie musician 69 TRUE 17175 42379 +#> 6 Carrie Fisher actor 60 TRUE 20749 42731 #> # … with 12 more rows

The top three rows and the bottom four rows are not part of the data frame.

@@ -302,29 +294,30 @@ deaths
read_excel(deaths_path, skip = 4)
 #> # A tibble: 14 × 6
-#>   Name          Profession Age   `Has kids` `Date of birth`     `Date of death`
-#>   <chr>         <chr>      <chr> <chr>      <dttm>              <chr>          
-#> 1 David Bowie   musician   69    TRUE       1947-01-08 00:00:00 42379          
-#> 2 Carrie Fisher actor      60    TRUE       1956-10-21 00:00:00 42731          
-#> 3 Chuck Berry   musician   90    TRUE       1926-10-18 00:00:00 42812          
-#> 4 Bill Paxton   actor      61    TRUE       1955-05-17 00:00:00 42791          
-#> 5 Prince        musician   57    TRUE       1958-06-07 00:00:00 42481          
-#> 6 Alan Rickman  actor      69    FALSE      1946-02-21 00:00:00 42383          
-#> # … with 8 more rows
+#> Name Profession Age `Has kids` `Date of birth` Date of dea…¹ +#> <chr> <chr> <chr> <chr> <dttm> <chr> +#> 1 David Bowie musician 69 TRUE 1947-01-08 00:00:00 42379 +#> 2 Carrie Fisher actor 60 TRUE 1956-10-21 00:00:00 42731 +#> 3 Chuck Berry musician 90 TRUE 1926-10-18 00:00:00 42812 +#> 4 Bill Paxton actor 61 TRUE 1955-05-17 00:00:00 42791 +#> 5 Prince musician 57 TRUE 1958-06-07 00:00:00 42481 +#> 6 Alan Rickman actor 69 FALSE 1946-02-21 00:00:00 42383 +#> # … with 8 more rows, and abbreviated variable name ¹​`Date of death`

We could also set n_max to omit the extraneous rows at the bottom.

read_excel(deaths_path, skip = 4, n_max = 10)
 #> # A tibble: 10 × 6
-#>   Name          Profession   Age Has k…¹ `Date of birth`     `Date of death`    
-#>   <chr>         <chr>      <dbl> <lgl>   <dttm>              <dttm>             
-#> 1 David Bowie   musician      69 TRUE    1947-01-08 00:00:00 2016-01-10 00:00:00
-#> 2 Carrie Fisher actor         60 TRUE    1956-10-21 00:00:00 2016-12-27 00:00:00
-#> 3 Chuck Berry   musician      90 TRUE    1926-10-18 00:00:00 2017-03-18 00:00:00
-#> 4 Bill Paxton   actor         61 TRUE    1955-05-17 00:00:00 2017-02-25 00:00:00
-#> 5 Prince        musician      57 TRUE    1958-06-07 00:00:00 2016-04-21 00:00:00
-#> 6 Alan Rickman  actor         69 FALSE   1946-02-21 00:00:00 2016-01-14 00:00:00
-#> # … with 4 more rows, and abbreviated variable name ¹​`Has kids`
+#> Name Profe…¹ Age Has k…² `Date of birth` `Date of death` +#> <chr> <chr> <dbl> <lgl> <dttm> <dttm> +#> 1 David Bowie musici… 69 TRUE 1947-01-08 00:00:00 2016-01-10 00:00:00 +#> 2 Carrie Fisher actor 60 TRUE 1956-10-21 00:00:00 2016-12-27 00:00:00 +#> 3 Chuck Berry musici… 90 TRUE 1926-10-18 00:00:00 2017-03-18 00:00:00 +#> 4 Bill Paxton actor 61 TRUE 1955-05-17 00:00:00 2017-02-25 00:00:00 +#> 5 Prince musici… 57 TRUE 1958-06-07 00:00:00 2016-04-21 00:00:00 +#> 6 Alan Rickman actor 69 FALSE 1946-02-21 00:00:00 2016-01-14 00:00:00 +#> # … with 4 more rows, and abbreviated variable names ¹​Profession, +#> # ²​`Has kids`

Another approach is using cell ranges. In Excel, the top left cell is A1. As you move across columns to the right, the cell label moves down the alphabet, i.e. B1, C1, etc. And as you move down a column, the number in the cell label increases, i.e. A2, A3, etc.

The data we want to read in starts in cell A5 and ends in cell F15. In spreadsheet notation, this is A5:F15.

diff --git a/oreilly/strings.html b/oreilly/strings.html index 495b1d6..a2532ec 100644 --- a/oreilly/strings.html +++ b/oreilly/strings.html @@ -1,13 +1,5 @@
-

Strings

-
- -
- -
- -

You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

- +

Strings

::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz. :::

Introduction

diff --git a/oreilly/webscraping.html b/oreilly/webscraping.html index 703ce73..1f86955 100644 --- a/oreilly/webscraping.html +++ b/oreilly/webscraping.html @@ -1,10 +1,2 @@
-

Web scraping

-
- -
- -
- -

You are reading the work-in-progress second edition of R for Data Science. This chapter is currently a dumping ground for ideas, and we don’t recommend reading it. You can find the complete first edition at https://r4ds.had.co.nz.

-
+

Web scraping

::: status callout-important You are reading the work-in-progress second edition of R for Data Science. This chapter is currently a dumping ground for ideas, and we don’t recommend reading it. You can find the complete first edition at https://r4ds.had.co.nz. :::

diff --git a/oreilly/workflow-basics.html b/oreilly/workflow-basics.html index 9ca3494..a226c7c 100644 --- a/oreilly/workflow-basics.html +++ b/oreilly/workflow-basics.html @@ -1,13 +1,5 @@
-

Workflow: basics

-
- -
- -
- -

You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

-

You now have some experience running R code. We didn’t give you many details, but you’ve obviously figured out the basics, or you would’ve thrown this book away in frustration! Frustration is natural when you start programming in R, because it is such a stickler for punctuation, and even one character out of place will cause it to complain. But while you should expect to be a little frustrated, take comfort in that this experience is both typical and temporary: it happens to everyone, and the only way to get over it is to keep trying.

Before we go any further, let’s make sure you’ve got a solid foundation in running R code, and that you know about some of the most helpful RStudio features.

+

Workflow: basics

::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz. :::

You now have some experience running R code. We didn’t give you many details, but you’ve obviously figured out the basics, or you would’ve thrown this book away in frustration! Frustration is natural when you start programming in R, because it is such a stickler for punctuation, and even one character out of place will cause it to complain. But while you should expect to be a little frustrated, take comfort in that this experience is both typical and temporary: it happens to everyone, and the only way to get over it is to keep trying.

Before we go any further, let’s make sure you’ve got a solid foundation in running R code, and that you know about some of the most helpful RStudio features.

Coding basics

diff --git a/oreilly/workflow-help.html b/oreilly/workflow-help.html index 7014eb5..30ce576 100644 --- a/oreilly/workflow-help.html +++ b/oreilly/workflow-help.html @@ -1,13 +1,5 @@
-

Workflow: Getting help

-
- -
- -
- -

You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

-

This book is not an island; there is no single resource that will allow you to master R. As you begin to apply the techniques described in this book to your own data, you will soon find questions that we do not answer. This section describes a few tips on how to get help, and to help you keep learning.

+

Workflow: Getting help

::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz. :::

This book is not an island; there is no single resource that will allow you to master R. As you begin to apply the techniques described in this book to your own data, you will soon find questions that we do not answer. This section describes a few tips on how to get help, and to help you keep learning.

Google is your friend

diff --git a/oreilly/workflow-pipes.html b/oreilly/workflow-pipes.html index 112a049..4cf3743 100644 --- a/oreilly/workflow-pipes.html +++ b/oreilly/workflow-pipes.html @@ -1,13 +1,5 @@
-

Workflow: Pipes

-
- -
- -
- -

You are reading the work-in-progress second edition of R for Data Science. This chapter is largely complete and just needs final proof reading. You can find the complete first edition at https://r4ds.had.co.nz.

-

The pipe, |>, is a powerful tool for clearly expressing a sequence of operations that transform an object. We briefly introduced pipes in the previous chapter, but before going too much farther, we want to give a few more details and discuss %>%, a predecessor to |>.

To add the pipe to your code, we recommend using the build-in keyboard shortcut Ctrl/Cmd + Shift + M. You’ll need to make one change to your RStudio options to use |> instead of %>% as shown in #fig-pipe-options; more on %>% shortly.

+

Workflow: Pipes

::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter is largely complete and just needs final proof reading. You can find the complete first edition at https://r4ds.had.co.nz. :::

The pipe, |>, is a powerful tool for clearly expressing a sequence of operations that transform an object. We briefly introduced pipes in the previous chapter, but before going too much farther, we want to give a few more details and discuss %>%, a predecessor to |>.

To add the pipe to your code, we recommend using the build-in keyboard shortcut Ctrl/Cmd + Shift + M. You’ll need to make one change to your RStudio options to use |> instead of %>% as shown in #fig-pipe-options; more on %>% shortly.

Screenshot showing the "Use native pipe operator" option which can be found on the "Editing" panel of the "Code" options.

diff --git a/oreilly/workflow-scripts.html b/oreilly/workflow-scripts.html index 05f51a1..300ab99 100644 --- a/oreilly/workflow-scripts.html +++ b/oreilly/workflow-scripts.html @@ -1,15 +1,5 @@
-

Workflow: scripts and projects

-
- -
- -

-RStudio server -

You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

- -

If you’re using RStudio server, your R session is never restarted by default. When you close your RStudio server tab, it might feel like you’re closing R, but the server actually keeps it running in the background. The next time you return, you’ll be in exactly the same place you left. This makes it even more important to regularly restart R so that you’re starting with a refresh slate.

-

This chapter will introduce you to two very important tools for organizing your code: scripts and projects.

+

Workflow: scripts and projects

::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz. :::

This chapter will introduce you to two very important tools for organizing your code: scripts and projects.

Scripts

@@ -126,9 +116,7 @@ What is the source of truth?

We collectively use this pattern hundreds of times a week.

RStudio server -

You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

- -

If you’re using RStudio server, your R session is never restarted by default. When you close your RStudio server tab, it might feel like you’re closing R, but the server actually keeps it running in the background. The next time you return, you’ll be in exactly the same place you left. This makes it even more important to regularly restart R so that you’re starting with a refresh slate.

+

If you’re using RStudio server, your R session is never restarted by default. When you close your RStudio server tab, it might feel like you’re closing R, but the server actually keeps it running in the background. The next time you return, you’ll be in exactly the same place you left. This makes it even more important to regularly restart R so that you’re starting with a refresh slate.

diff --git a/oreilly/workflow-style.html b/oreilly/workflow-style.html index 01e238e..6a27144 100644 --- a/oreilly/workflow-style.html +++ b/oreilly/workflow-style.html @@ -1,13 +1,5 @@
-

Workflow: code style

-
- -
- -
- -

You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

-

Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread. Even as a very new programmer it’s a good idea to work on your code style. Using a consistent style makes it easier for others (including future-you!) to read your work, and is particularly important if you need to get help from someone else. This chapter will introduce to the most important points of the tidyverse style guide, which is used throughout this book.

Styling your code will feel a bit tedious to start with, but if you practice it, it will soon become second nature. Additionally, there are some great tools to quickly restyle existing code, like the styler package by Lorenz Walthert. Once you’ve installed it with install.packages("styler"), an easy way to use it is via RStudio’s command palette. The command palette lets you use any build-in RStudio command, as well as many addins provided by packages. Open the palette by pressing Cmd/Ctrl + Shift + P, then type “styler” to see all the shortcuts provided by styler. #fig-styler shows the results.

+

Workflow: code style

::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz. :::

Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread. Even as a very new programmer it’s a good idea to work on your code style. Using a consistent style makes it easier for others (including future-you!) to read your work, and is particularly important if you need to get help from someone else. This chapter will introduce to the most important points of the tidyverse style guide, which is used throughout this book.

Styling your code will feel a bit tedious to start with, but if you practice it, it will soon become second nature. Additionally, there are some great tools to quickly restyle existing code, like the styler package by Lorenz Walthert. Once you’ve installed it with install.packages("styler"), an easy way to use it is via RStudio’s command palette. The command palette lets you use any build-in RStudio command, as well as many addins provided by packages. Open the palette by pressing Cmd/Ctrl + Shift + P, then type “styler” to see all the shortcuts provided by styler. #fig-styler shows the results.

A screenshot showing the command palette after typing "styler", showing the four styling tool provided by the package.