diff --git a/README.md b/README.md index 6a01860..048ad0d 100644 --- a/README.md +++ b/README.md @@ -52,7 +52,8 @@ devtools::install_github("hadley/r4ds") To generate book for O'Reilly, build the book then: ```{r} -devtools::load_all("../minibook/"); process_book() +# pak::pak("hadley/htmlbook") +htmlbook::convert_book() html <- list.files("oreilly", pattern = "[.]html$", full.names = TRUE) file.copy(html, "../r-for-data-science-2e/", overwrite = TRUE) @@ -63,6 +64,8 @@ fs::dir_create(unique(dirname(dest))) file.copy(pngs, dest, overwrite = TRUE) ``` +Then commit and push to atlas. + ## Code of Conduct Please note that r4ds uses a [Contributor Code of Conduct](https://contributor-covenant.org/version/2/0/CODE_OF_CONDUCT.html). diff --git a/_common.R b/_common.R index e7af1c7..e935cfe 100644 --- a/_common.R +++ b/_common.R @@ -16,8 +16,9 @@ options( pillar.max_footer_lines = 2, pillar.min_chars = 15, stringr.view_n = 6, - # Activate crayon output - temporarily disabled for quarto - # crayon.enabled = TRUE, + # Temporarily deactivate cli output for quarto + cli.num_colors = 0, + cli.hyperlink = FALSE, pillar.bold = TRUE, width = 77 # 80 - 3 for #> comment ) diff --git a/base-R.qmd b/base-R.qmd index 3371376..d93e098 100644 --- a/base-R.qmd +++ b/base-R.qmd @@ -210,7 +210,7 @@ This function was the inspiration for much of dplyr's syntax. 2. Why is `x[-which(x > 0)]` not the same as `x[x <= 0]`? Read the documentation for `which()` and do some experiments to figure it out. -## Selecting a single element `$` and `[[` {#sec-subset-one} +## Selecting a single element with `$` and `[[` {#sec-subset-one} `[`, which selects many elements, is paired with `[[` and `$`, which extract a single element. In this section, we'll show you how to use `[[` and `$` to pull columns out of data frames, discuss a couple more differences between `data.frames` and tibbles, and emphasize some important differences between `[` and `[[` when used with lists. diff --git a/intro.qmd b/intro.qmd index 8aaafbf..8e39028 100644 --- a/intro.qmd +++ b/intro.qmd @@ -365,7 +365,7 @@ knitr::kable(df, format = "markdown") ``` ```{r} -#| eval: false +#| include: false cli:::ruler() ``` diff --git a/oreilly/EDA.html b/oreilly/EDA.html index 8e762d6..a55cd74 100644 --- a/oreilly/EDA.html +++ b/oreilly/EDA.html @@ -1,6 +1,6 @@

Exploratory data analysis

-
+

Introduction

This chapter will show you how to use visualization and transformation to explore your data in a systematic way, a task that statisticians call exploratory data analysis, or EDA for short. EDA is an iterative cycle. You:

@@ -10,7 +10,7 @@ Introduction

EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends. As your exploration continues, you will home in on a few particularly productive areas that you’ll eventually write up and communicate to others.

EDA is an important part of any data analysis, even if the questions are handed to you on a platter, because you always need to investigate the quality of your data. Data cleaning is just one application of EDA: you ask questions about whether your data meets your expectations or not. To do data cleaning, you’ll need to deploy all the tools of EDA: visualization, transformation, and modelling.

-
+

Prerequisites

In this chapter we’ll combine what you’ve learned about dplyr and ggplot2 to interactively ask questions, answer them with data, and then ask new questions.

@@ -137,7 +137,7 @@ unusual

It’s good practice to repeat your analysis with and without the outliers. If they have minimal effect on the results, and you can’t figure out why they’re there, it’s reasonable to omit them, and move on. However, if they have a substantial effect on your results, you shouldn’t drop them without justification. You’ll need to figure out what caused them (e.g. a data entry error) and disclose that you removed them in your write-up.

-
+

Exercises

  1. Explore the distribution of each of the x, y, and z variables in diamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.

  2. @@ -198,7 +198,7 @@ Unusual values

    However this plot isn’t great because there are many more non-cancelled flights than cancelled flights. In the next section we’ll explore some techniques for improving this comparison.

    -
    +

    Exercises

    1. What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference in how missing values are handled in histograms and bar charts?

    2. @@ -217,9 +217,7 @@ A categorical and a numerical variable

      For example, let’s explore how the price of a diamond varies with its quality (measured by cut) using geom_freqpoly():

      ggplot(diamonds, aes(x = price)) + 
      -  geom_freqpoly(aes(color = cut), binwidth = 500, size = 0.75)
      -#> Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
      -#> ℹ Please use `linewidth` instead.
      + geom_freqpoly(aes(color = cut), binwidth = 500, linewidth = 0.75)

      A frequency polygon of prices of diamonds where each cut of carat (Fair, Good, Very Good, Premium, and Ideal) is represented with a different color line. The x-axis ranges from 0 to 30000 and the y-axis ranges from 0 to 5000. The lines overlap a great deal, suggesting similar frequency distributions of prices of diamonds. One notable feature is that Ideal diamonds have the highest peak around 1500.

      @@ -235,7 +233,7 @@ A categorical and a numerical variable

      To make the comparison easier we need to swap what is displayed on the y-axis. Instead of displaying count, we’ll display the density, which is the count standardized so that the area under each frequency polygon is one.

      ggplot(diamonds, aes(x = price, y = after_stat(density))) + 
      -  geom_freqpoly(aes(color = cut), binwidth = 500, size = 0.75)
      + geom_freqpoly(aes(color = cut), binwidth = 500, linewidth = 0.75)

      A frequency polygon of densities of prices of diamonds where each cut of carat (Fair, Good, Very Good, Premium, and Ideal) is represented with a different color line. The x-axis ranges from 0 to 20000. The lines overlap a great deal, suggesting similar density distributions of prices of diamonds. One notable feature is that all but Fair diamonds have high peaks around a price of 1500 and Fair diamonds have a higher mean than others.

      @@ -279,7 +277,7 @@ A categorical and a numerical variable
      -
      +

      Exercises

      1. Use what you’ve learned to improve the visualization of the departure times of cancelled vs. non-cancelled flights.

      2. @@ -291,7 +289,7 @@ Exercises
      -
      +

      Two categorical variables

      To visualize the covariation between categorical variables, you’ll need to count the number of observations for each combination of levels of these categorical variables. One way to do that is to rely on the built-in geom_count():

      @@ -330,7 +328,7 @@ Two categorical variables

      If the categorical variables are unordered, you might want to use the seriation package to simultaneously reorder the rows and columns in order to more clearly reveal interesting patterns. For larger plots, you might want to try the heatmaply package, which creates interactive plots.

      -
      +

      Exercises

      1. How could you rescale the count dataset above to more clearly show the distribution of cut within color, or color within cut?

      2. @@ -340,7 +338,7 @@ Exercises
      -
      +

      Two numerical variables

      You’ve already seen one great way to visualize the covariation between two numerical variables: draw a scatterplot with geom_point(). You can see covariation as a pattern in the points. For example, you can see an exponential relationship between the carat size and price of a diamond.

      @@ -390,7 +388,7 @@ ggplot(smaller, aes(x = carat, y = price)) + -
      +

      Exercises

      1. Instead of summarizing the conditional distribution with a boxplot, you could use a frequency polygon. What do you need to consider when using cut_width() vs. cut_number()? How does that impact a visualization of the 2d distribution of carat and price?

      2. @@ -464,7 +462,7 @@ ggplot(diamonds_aug, aes(x = carat, y = .resid)) +

        We’re not discussing modelling in this book because understanding what models are and how they work is easiest once you have tools of data wrangling and programming in hand.

      -
      +

      Summary

      In this chapter you’ve learned a variety of tools to help you understand the variation within your data. You’ve seen technique that work with a single variable at a time and with a pair of variables. This might seem painful restrictive if you have tens or hundreds of variables in your data, but they’re foundation upon which all other techniques are built.

      diff --git a/oreilly/arrow.html b/oreilly/arrow.html index 602aea7..539a817 100644 --- a/oreilly/arrow.html +++ b/oreilly/arrow.html @@ -1,13 +1,13 @@

      Arrow

      -
      +

      Introduction

      CSV files are designed to be easily read by humans. They’re a good interchange format because they’re very simple and they can be read by every tool under the sun. But CSV files aren’t very efficient: you have to do quite a lot of work to read the data into R. In this chapter, you’ll learn about a powerful alternative: the parquet format, an open standards-based format widely used by big data systems.

      We’ll pair parquet files with Apache Arrow, a multi-language toolbox designed for efficient analysis and transport of large data sets. We’ll use Apache Arrow via the the arrow package, which provides a dplyr backend allowing you to analyze larger-than-memory datasets using familiar dplyr syntax. As an additional benefit, arrow is extremely fast: you’ll see some examples later in the chapter.

      Both arrow and dbplyr provide dplyr backends, so you might wonder when to use each. In many cases, the choice is made for you, as in the data is already in a database or in parquet files, and you’ll want to work with it as is. But if you’re starting with your own data (perhaps CSV files), you can either load it into a database or convert it to parquet. In general, it’s hard to know what will work best, so in the early stages of your analysis we’d encourage you to try both and pick the one that works the best for you.

      -
      +

      Prerequisites

      In this chapter, we’ll continue to use the tidyverse, particularly dplyr, but we’ll pair it with the arrow package which is designed specifically for working with large data.

      @@ -272,7 +272,7 @@ Using dbplyr with arrow
      -
      +

      Summary

      In this chapter, you’ve been given a taste of the arrow package, which provides a dplyr backend for working with large on-disk datasets. It can work with CSV files, its much much faster if you convert your data to parquet. Parquet is a binary data format that’s designed specifically for data analysis on modern computers. Far fewer tools can work with parquet files compared to CSV, but it’s partitioned, compressed, and columnar structure makes it much more efficient to analyze.

      diff --git a/oreilly/base-R.html b/oreilly/base-R.html index c4dc744..5ec88b9 100644 --- a/oreilly/base-R.html +++ b/oreilly/base-R.html @@ -1,6 +1,6 @@

      A field guide to base R

      To finish off the programming section, we’re going to give you a quick tour of the most important base R functions that we don’t otherwise discuss in the book. These tools are particularly useful as you do more programming and will help you read code that you’ll encounter in the wild.

      This is a good place to remind you that the tidyverse is not the only way to solve data science problems. We teach the tidyverse in this book because tidyverse packages share a common design philosophy, which increases the consistency across functions, making each new function or package a little easier to learn and use. It’s not possible to use the tidyverse without using base R, so we’ve actually already taught you a lot of base R functions: from library() to load packages, to sum() and mean() for numeric summaries, to the factor, date, and POSIXct data types, and of course all the basic operators like +, -, /, *, |, &, and !. What we haven’t focused on so far is base R workflows, so we will highlight a few of those in this chapter.

      After you read this book you’ll learn other approaches to the same problems using base R, data.table, and other packages. You’ll certainly encounter these other approaches when you start reading R code written by other people, particularly if you’re using StackOverflow. It’s 100% okay to write code that uses a mix of approaches, and don’t let anyone tell you otherwise!

      In this chapter, we’ll focus on four big topics: subsetting with [, subsetting with [[ and $, the apply family of functions, and for loops. To finish off, we’ll briefly discuss two important plotting functions.

      -
      +

      Prerequisites

      @@ -10,7 +10,7 @@ Prerequisites

      -Selecting multiple elements with[ +Selecting multiple elements with [

      [ is used to extract sub-components from vectors and data frames, and is called like x[i] or x[i, j]. In this section, we’ll introduce you to the power of [, first showing you how you can use it with vectors, then how the same principles extend in a straightforward way to two-dimensional (2d) structures like data frames. We’ll then help you cement that knowledge by showing how various dplyr verbs are special cases of [.

      @@ -188,7 +188,7 @@ df |> subset(x > 1, c(y, z))

      This function was the inspiration for much of dplyr’s syntax.

      -
      +

      Exercises

      1. @@ -203,7 +203,7 @@ Exercises

        -Selecting a single element$ and [[ +Selecting a single element with $ and [[

        [, which selects many elements, is paired with [[ and $, which extract a single element. In this section, we’ll show you how to use [[ and $ to pull columns out of data frames, discuss a couple more differences between data.frames and tibbles, and emphasize some important differences between [ and [[ when used with lists.

        @@ -284,7 +284,7 @@ tb$z

        For this reason we sometimes joke that tibbles are lazy and surly: they do less and complain more.

        -
        +

        Lists

        [[ and $ are also really important for working with lists, and it’s important to understand how they differ from [. Lets illustrate the differences with a list named l:

        @@ -372,7 +372,7 @@ df[["x"]]
      -
      +

      Exercises

      1. What happens when you use [[ with a positive integer that’s bigger than the length of the vector? What happens when you subset with a name that doesn’t exist?

      2. @@ -515,7 +515,7 @@ plot(diamonds$carat, diamonds$price)

        Note that base plotting functions work with vectors, so you need to pull columns out of the data frame using $ or some other technique.

      -
      +

      Summary

      In this chapter, we’ve shown you a selection of base R functions useful for subsetting and iteration. Compared to approaches discussed elsewhere in the book, these functions tend to have more of a “vector” flavor than a “data frame” flavor because base R functions tend to take individual vectors, rather than a data frame and some column specification. This often makes life easier for programming and so becomes more important as you write more functions and begin to write your own packages.

      diff --git a/oreilly/communication.html b/oreilly/communication.html index fbb68a2..5914978 100644 --- a/oreilly/communication.html +++ b/oreilly/communication.html @@ -1,28 +1,18 @@

      Communication

      -
      +

      Introduction

      In #chp-EDA, you learned how to use plots as tools for exploration. When you make exploratory plots, you know—even before looking—which variables the plot will display. You made each plot for a purpose, could quickly look at it, and then move on to the next plot. In the course of most analyses, you’ll produce tens or hundreds of plots, most of which are immediately thrown away.

      Now that you understand your data, you need to communicate your understanding to others. Your audience will likely not share your background knowledge and will not be deeply invested in the data. To help others quickly build up a good mental model of the data, you will need to invest considerable effort in making your plots as self-explanatory as possible. In this chapter, you’ll learn some of the tools that ggplot2 provides to do so.

      This chapter focuses on the tools you need to create good graphics. We assume that you know what you want, and just need to know how to do it. For that reason, we highly recommend pairing this chapter with a good general visualization book. We particularly like The Truthful Art, by Albert Cairo. It doesn’t teach the mechanics of creating visualizations, but instead focuses on what you need to think about in order to create effective graphics.

      -
      +

      Prerequisites

      In this chapter, we’ll focus once again on ggplot2. We’ll also use a little dplyr for data manipulation, scales to override the default breaks, labels, transformations and palettes, and a few ggplot2 extension packages, including ggrepel (https://ggrepel.slowkow.com) by Kamil Slowikowski and patchwork (https://patchwork.data-imaginist.com) by Thomas Lin Pedersen. Don’t forget that you’ll need to install those packages with install.packages() if you don’t already have them.

      library(tidyverse)
      -#> ── Attaching core tidyverse packages ──────────────── tidyverse 1.3.2.9000 ──
      -#> ✔ dplyr     1.0.99.9000     ✔ readr     2.1.3      
      -#> ✔ forcats   0.5.2.9000      ✔ stringr   1.5.0.9000 
      -#> ✔ ggplot2   3.4.0.9000      ✔ tibble    3.1.8      
      -#> ✔ lubridate 1.9.0           ✔ tidyr     1.2.1.9001 
      -#> ✔ purrr     1.0.1           
      -#> ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──
      -#> ✖ dplyr::filter() masks stats::filter()
      -#> ✖ dplyr::lag()    masks stats::lag()
      -#> ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
       library(ggrepel)
       library(patchwork)
      @@ -91,7 +81,7 @@ ggplot(df, aes(x, y)) + -
      +

      Exercises

      1. Create one plot on the fuel economy data with customized title, subtitle, caption, x, y, and color labels.

      2. @@ -280,12 +270,12 @@ ggplot(mpg, aes(x = displ, y = hwy)) + #> decreasing fuel economy.

        Remember, in addition to geom_text(), you have many other geoms in ggplot2 available to help annotate your plot. A couple ideas:

        -
        • Use geom_hline() and geom_vline() to add reference lines. We often make them thick (size = 2) and white (color = white), and draw them underneath the primary data layer. That makes them easy to see, without drawing attention away from the data.

        • +
          • Use geom_hline() and geom_vline() to add reference lines. We often make them thick (linewidth = 2) and white (color = white), and draw them underneath the primary data layer. That makes them easy to see, without drawing attention away from the data.

          • Use geom_rect() to draw a rectangle around points of interest. The boundaries of the rectangle are defined by aesthetics xmin, xmax, ymin, ymax.

          • Use geom_segment() with the arrow argument to draw attention to a point with an arrow. Use aesthetics x and y to define the starting location, and xend and yend to define the end location.

          The only limit is your imagination (and your patience with positioning annotations to be aesthetically pleasing)!

          -
          +

          Exercises

          1. Use geom_text() with infinite positions to place text at the four corners of the plot.

          2. @@ -603,7 +593,7 @@ mpg |> -

            You can also set the limits on individual scales. Reducing the limits is basically equivalent to subsetting the data. It is generally more useful if you want expand the limits, for example, to match scales across different plots. For example, if we extract two classes of cars and plot them separately, it’s difficult to compare the plots because all three scales (the x-axis, the y-axis, and the color aesthetic) have different ranges.

            +

            You can also set the limits on individual scales. Reducing the limits is basically equivalent to subsetting the data. It is generally more useful if you want to expand the limits, for example, to match scales across different plots. For example, if we extract two classes of cars and plot them separately, it’s difficult to compare the plots because all three scales (the x-axis, the y-axis, and the color aesthetic) have different ranges.

            suv <- mpg |> filter(class == "suv")
             compact <- mpg |> filter(class == "compact")
            @@ -655,7 +645,7 @@ ggplot(compact, aes(x = displ, y = hwy, color = drv)) +
             

            In this particular case, you could have simply used faceting, but this technique is useful more generally, if for instance, you want to spread plots over multiple pages of a report.

          -
          +

          Exercises

          1. @@ -740,7 +730,7 @@ Themes

            For an overview of all theme() components, see help with ?theme. The ggplot2 book is also a great place to go for the full details on theming.

            -
            +

            Exercises

            1. Pick a theme offered by the ggthemes package and apply it to the last plot you made.
            2. @@ -808,14 +798,14 @@ p5 <- ggplot(mpg, aes(x = cty, y = hwy, color = drv)) + guides = "collect", heights = c(1, 3, 2, 4) ) & - theme(legend.position = "bottom") + theme(legend.position = "top")

              Five plots laid out such that first two plots are next to each other. Plots three and four are underneath them. And the fifth plot stretches under them. The patchworked plot is titled "City and highway mileage for cars with different drive trains" and captioned "Source: Source: https://fueleconomy.gov". The first two plots are side-by-side box plots. Plots 3 and 4 are density plots. And the fifth plot is a faceted scatterplot. Each of these plots show geoms colored by drive train, but the patchworked plot has only one legend that applies to all of them, above the plots and beneath the title.

              If you’d like to learn more about combining and layout out multiple plots with patchwork, we recommend looking through the guides on the package website: https://patchwork.data-imaginist.com.

              -
              +

              Exercises

              1. @@ -848,7 +838,7 @@ p3 <- ggplot(mpg, aes(x = cty, y = hwy)) +
              -
              +

              Summary

              In this chapter you’ve learned about adding plot labels such as title, subtitle, caption as well as modifying default axis labels, using annotation to add informational text to your plot or to highlight specific data points, customizing the axis scales, and changing the theme of your plot. You’ve also learned about combining multiple plots in a single graph using both simple and complex plot layouts.

              diff --git a/oreilly/data-import.html b/oreilly/data-import.html index 9e78a80..552599b 100644 --- a/oreilly/data-import.html +++ b/oreilly/data-import.html @@ -1,12 +1,12 @@

              Data import

              -
              +

              Introduction

              Working with data provided by R packages is a great way to learn data science tools, but you want to apply what you’ve learned to your own data at some point. In this chapter, you’ll learn the basics of reading data files into R.

              Specifically, this chapter will focus on reading plain-text rectangular files. We’ll start with practical advice for handling features like column names, types, and missing data. You will then learn about reading data from multiple files at once and writing data from R to a file. Finally, you’ll learn how to handcraft data frames in R.

              -
              +

              Prerequisites

              In this chapter, you’ll learn how to load flat files in R with the readr package, which is part of the core tidyverse.

              @@ -257,7 +257,7 @@ Other file types
            3. read_log() reads Apache-style log files.

      -
      +

      Exercises

      1. What function would you use to read a file where fields were separated with “|”?

      2. @@ -372,9 +372,9 @@ Missing values, column types, and problems
        problems(df)
         #> # A tibble: 1 × 5
        -#>     row   col expected actual file                                     
        -#>   <int> <int> <chr>    <chr>  <chr>                                    
        -#> 1     3     1 a double .      /private/tmp/Rtmp1nE0XP/file11b88112257a4
        +#> row col expected actual file +#> <int> <int> <chr> <chr> <chr> +#> 1 3 1 a double . /private/tmp/Rtmpx37bAU/filec1bb57d587a7

        This tells us that there was a problem in row 3, col 1 where readr expected a double but got a .. That suggests this dataset uses . for missing values. So then we set na = ".", the automatic guessing succeeds, giving us the numeric column that we want:

        @@ -584,7 +584,7 @@ Data entry

        We’ll use tibble() and tribble() later in the book to construct small examples to demonstrate how various functions work.

      -
      +

      Summary

      In this chapter, you’ve learned how to load CSV files with read_csv() and to do your own data entry with tibble() and tribble(). You’ve learned how csv files work, some of the problems you might encounter, and how to overcome them. We’ll come to data import a few times in this book: #chp-spreadsheets from Excel and googlesheets, #chp-databases will show you how to load data from databases, #chp-arrow from parquet files, #chp-rectangling from JSON, and #chp-webscraping from websites.

      diff --git a/oreilly/data-tidy.html b/oreilly/data-tidy.html index a785b44..60df1b3 100644 --- a/oreilly/data-tidy.html +++ b/oreilly/data-tidy.html @@ -1,6 +1,6 @@

      Data tidying

      -
      +

      Introduction

      @@ -14,7 +14,7 @@ Introduction

      In this chapter, you will learn a consistent way to organize your data in R using a system called tidy data. Getting your data into this format requires some work up front, but that work pays off in the long term. Once you have tidy data and the tidy tools provided by packages in the tidyverse, you will spend much less time munging data from one representation to another, allowing you to spend more time on the data questions you care about.

      In this chapter, you’ll first learn the definition of tidy data and see it applied to a simple toy dataset. Then we’ll dive into the primary tool you’ll use for tidying data: pivoting. Pivoting allows you to change the form of your data without changing any of the values. We’ll finish with a discussion of usefully untidy data and how you can create it if needed.

      -
      +

      Prerequisites

      In this chapter, we’ll focus on tidyr, a package that provides a bunch of tools to help tidy up your messy datasets. tidyr is a member of the core tidyverse.

      @@ -35,7 +35,7 @@ Tidy data
      table1
       #> # A tibble: 6 × 4
       #>   country      year  cases population
      -#>   <chr>       <int>  <int>      <int>
      +#>   <chr>       <dbl>  <dbl>      <dbl>
       #> 1 Afghanistan  1999    745   19987071
       #> 2 Afghanistan  2000   2666   20595360
       #> 3 Brazil       1999  37737  172006362
      @@ -45,7 +45,7 @@ Tidy data
       table2
       #> # A tibble: 12 × 4
       #>   country      year type           count
      -#>   <chr>       <int> <chr>          <int>
      +#>   <chr>       <dbl> <chr>          <dbl>
       #> 1 Afghanistan  1999 cases            745
       #> 2 Afghanistan  1999 population  19987071
       #> 3 Afghanistan  2000 cases           2666
      @@ -56,7 +56,7 @@ table2
       table3
       #> # A tibble: 6 × 3
       #>   country      year rate             
      -#> * <chr>       <int> <chr>            
      +#>   <chr>       <dbl> <chr>            
       #> 1 Afghanistan  1999 745/19987071     
       #> 2 Afghanistan  2000 2666/20595360    
       #> 3 Brazil       1999 37737/172006362  
      @@ -68,14 +68,14 @@ table3
       table4a # cases
       #> # A tibble: 3 × 3
       #>   country     `1999` `2000`
      -#> * <chr>        <int>  <int>
      +#>   <chr>        <dbl>  <dbl>
       #> 1 Afghanistan    745   2666
       #> 2 Brazil       37737  80488
       #> 3 China       212258 213766
       table4b # population
       #> # A tibble: 3 × 3
       #>   country         `1999`     `2000`
      -#> * <chr>            <int>      <int>
      +#>   <chr>            <dbl>      <dbl>
       #> 1 Afghanistan   19987071   20595360
       #> 2 Brazil       172006362  174504898
       #> 3 China       1272915272 1280428583
      @@ -106,7 +106,7 @@ table1 |> ) #> # A tibble: 6 × 5 #> country year cases population rate -#> <chr> <int> <int> <int> <dbl> +#> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 Afghanistan 1999 745 19987071 0.373 #> 2 Afghanistan 2000 2666 20595360 1.29 #> 3 Brazil 1999 37737 172006362 2.19 @@ -119,7 +119,7 @@ table1 |> count(year, wt = cases) #> # A tibble: 2 × 2 #> year n -#> <int> <int> +#> <dbl> <dbl> #> 1 1999 250740 #> 2 2000 296920 @@ -133,7 +133,7 @@ ggplot(table1, aes(x = year, y = cases)) + -
      +

      Exercises

      1. Using prose, describe how the variables and observations are organised in each of the sample tables.

      2. @@ -166,21 +166,16 @@ Data in column names
        billboard
         #> # A tibble: 317 × 79
        -#>   artist   track date.entered   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8
        -#>   <chr>    <chr> <date>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
        -#> 1 2 Pac    Baby… 2000-02-26      87    82    72    77    87    94    99    NA
        -#> 2 2Ge+her  The … 2000-09-02      91    87    92    NA    NA    NA    NA    NA
        -#> 3 3 Doors… Kryp… 2000-04-08      81    70    68    67    66    57    54    53
        -#> 4 3 Doors… Loser 2000-10-21      76    76    72    69    67    65    55    59
        -#> 5 504 Boyz Wobb… 2000-04-15      57    34    25    17    17    31    36    49
        -#> 6 98^0     Give… 2000-08-19      51    39    34    26    26    19     2     2
        -#> # … with 311 more rows, and 68 more variables: wk9 <dbl>, wk10 <dbl>,
        -#> #   wk11 <dbl>, wk12 <dbl>, wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>,
        -#> #   wk17 <dbl>, wk18 <dbl>, wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>,
        -#> #   wk23 <dbl>, wk24 <dbl>, wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>,
        -#> #   wk29 <dbl>, wk30 <dbl>, wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>,
        -#> #   wk35 <dbl>, wk36 <dbl>, wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>,
        -#> #   wk41 <dbl>, wk42 <dbl>, wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, …
        +#> artist track date.entered wk1 wk2 wk3 wk4 wk5 +#> <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> +#> 1 2 Pac Baby Don't Cry (Ke… 2000-02-26 87 82 72 77 87 +#> 2 2Ge+her The Hardest Part O… 2000-09-02 91 87 92 NA NA +#> 3 3 Doors Down Kryptonite 2000-04-08 81 70 68 67 66 +#> 4 3 Doors Down Loser 2000-10-21 76 76 72 69 67 +#> 5 504 Boyz Wobble Wobble 2000-04-15 57 34 25 17 17 +#> 6 98^0 Give Me Just One N… 2000-08-19 51 39 34 26 26 +#> # … with 311 more rows, and 71 more variables: wk6 <dbl>, wk7 <dbl>, +#> # wk8 <dbl>, wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>, wk13 <dbl>, …

        In this dataset, each observation is a song. The first three columns (artist, track and date.entered) are variables that describe the song. Then we have 76 columns (wk1-wk76) that describe the rank of the song in each week. Here, the column names are one variable (the week) and the cell values are another (the rank).

        To tidy this data, we’ll use pivot_longer(). After the data, there are three key arguments:

        @@ -339,21 +334,16 @@ Many variables in column names
        who2
         #> # A tibble: 7,240 × 58
        -#>   country     year sp_m_014 sp_m_1524 sp_m_2534 sp_m_3544 sp_m_4554 sp_m_5564
        -#>   <chr>      <dbl>    <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
        -#> 1 Afghanist…  1980       NA        NA        NA        NA        NA        NA
        -#> 2 Afghanist…  1981       NA        NA        NA        NA        NA        NA
        -#> 3 Afghanist…  1982       NA        NA        NA        NA        NA        NA
        -#> 4 Afghanist…  1983       NA        NA        NA        NA        NA        NA
        -#> 5 Afghanist…  1984       NA        NA        NA        NA        NA        NA
        -#> 6 Afghanist…  1985       NA        NA        NA        NA        NA        NA
        -#> # … with 7,234 more rows, and 50 more variables: sp_m_65 <dbl>,
        -#> #   sp_f_014 <dbl>, sp_f_1524 <dbl>, sp_f_2534 <dbl>, sp_f_3544 <dbl>,
        -#> #   sp_f_4554 <dbl>, sp_f_5564 <dbl>, sp_f_65 <dbl>, sn_m_014 <dbl>,
        -#> #   sn_m_1524 <dbl>, sn_m_2534 <dbl>, sn_m_3544 <dbl>, sn_m_4554 <dbl>,
        -#> #   sn_m_5564 <dbl>, sn_m_65 <dbl>, sn_f_014 <dbl>, sn_f_1524 <dbl>,
        -#> #   sn_f_2534 <dbl>, sn_f_3544 <dbl>, sn_f_4554 <dbl>, sn_f_5564 <dbl>,
        -#> #   sn_f_65 <dbl>, ep_m_014 <dbl>, ep_m_1524 <dbl>, ep_m_2534 <dbl>, …
        +#> country year sp_m_014 sp_m_1524 sp_m_2534 sp_m_3544 sp_m_4554 +#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> +#> 1 Afghanistan 1980 NA NA NA NA NA +#> 2 Afghanistan 1981 NA NA NA NA NA +#> 3 Afghanistan 1982 NA NA NA NA NA +#> 4 Afghanistan 1983 NA NA NA NA NA +#> 5 Afghanistan 1984 NA NA NA NA NA +#> 6 Afghanistan 1985 NA NA NA NA NA +#> # … with 7,234 more rows, and 51 more variables: sp_m_5564 <dbl>, +#> # sp_m_65 <dbl>, sp_f_014 <dbl>, sp_f_1524 <dbl>, sp_f_2534 <dbl>, …

        This dataset records information about tuberculosis data collected by the WHO. There are two columns that are already variables and are easy to interpret: country and year. They are followed by 56 columns like sp_m_014, ep_m_4554, and rel_m_3544. If you stare at these columns for long enough, you’ll notice there’s a pattern. Each column name is made up of three pieces separated by _. The first piece, sp/rel/ep, describes the method used for the diagnosis, the second piece, m/f is the gender, and the third piece, 014/1524/2535/3544/4554/65 is the age range.

        So in this case we have six variables: two variables are already columns, three variables are contained in the column name, and one variable is in the cell name. This requires two changes to our call to pivot_longer(): names_to gets a vector of column names and names_sep describes how to split the variable name up into pieces:

        @@ -479,16 +469,16 @@ Widening data values_from = prf_rate ) #> # A tibble: 500 × 9 -#> org_pac_id org_nm measure_title CAHPS_GRP_1 CAHPS_GRP_2 CAHPS_GRP_3 -#> <chr> <chr> <chr> <dbl> <dbl> <dbl> -#> 1 0446157747 USC CARE MEDI… CAHPS for MI… 63 NA NA -#> 2 0446157747 USC CARE MEDI… CAHPS for MI… NA 87 NA -#> 3 0446157747 USC CARE MEDI… CAHPS for MI… NA NA 86 -#> 4 0446157747 USC CARE MEDI… CAHPS for MI… NA NA NA -#> 5 0446157747 USC CARE MEDI… CAHPS for MI… NA NA NA -#> 6 0446157747 USC CARE MEDI… CAHPS for MI… NA NA NA -#> # … with 494 more rows, and 3 more variables: CAHPS_GRP_5 <dbl>, -#> # CAHPS_GRP_8 <dbl>, CAHPS_GRP_12 <dbl> +#> org_pac_id org_nm measure_title CAHPS_GRP_1 CAHPS_GRP_2 +#> <chr> <chr> <chr> <dbl> <dbl> +#> 1 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS… 63 NA +#> 2 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS… NA 87 +#> 3 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS… NA NA +#> 4 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS… NA NA +#> 5 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS… NA NA +#> 6 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS… NA NA +#> # … with 494 more rows, and 4 more variables: CAHPS_GRP_3 <dbl>, +#> # CAHPS_GRP_5 <dbl>, CAHPS_GRP_8 <dbl>, CAHPS_GRP_12 <dbl>

        The output doesn’t look quite right; we still seem to have multiple rows for each organization. That’s because, by default, pivot_wider() will attempt to preserve all the existing columns including measure_title which has six distinct observations for each organisations. To fix this problem we need to tell pivot_wider() which columns identify each row; in this case those are the variables starting with "org":

        @@ -515,7 +505,7 @@ Widening data

        -How doespivot_wider() work?

        +How does pivot_wider() work?

        To understand how pivot_wider() works, let’s again start with a very simple dataset:

        df <- tribble(
        @@ -849,7 +839,7 @@ Pragmatic computation
         
      -
      +

      Summary

      In this chapter you learned about tidy data: data that has variables in columns and observations in rows. Tidy data makes working in the tidyverse easier, because it’s a consistent structure understood by most functions: the main challenge is data from whatever structure you receive it in to a tidy format. To that end, you learn about pivot_longer() and pivot_wider() which allow you to tidy up many untidy datasets. Of course, tidy data can’t solve every problem so we also showed you some places were you might want to deliberately untidy your data into order to present to humans, feed into statistical models, or just pragmatically get shit done. If you particularly enjoyed this chapter and want to learn more about the underlying theory, you can learn more about the history and theoretical underpinnings in the Tidy Data paper published in the Journal of Statistical Software.

      diff --git a/oreilly/data-transform.html b/oreilly/data-transform.html index b7067ad..ab0fda8 100644 --- a/oreilly/data-transform.html +++ b/oreilly/data-transform.html @@ -1,12 +1,12 @@

      Data transformation

      -
      +

      Introduction

      Visualisation is an important tool for generating insight, but it’s rare that you get the data in exactly the right form you need for it. Often you’ll need to create some new variables or summaries to see the most important patterns, or maybe you just want to rename the variables or reorder the observations to make the data a little easier to work with. You’ll learn how to do all that (and more!) in this chapter, which will introduce you to data transformation using the dplyr package and a new dataset on flights that departed New York City in 2013.

      The goal of this chapter is to give you an overview of all the key tools for transforming a data frame. We’ll start with functions that operate on rows and then columns of a data frame. We will then introduce the ability to work with groups. We will end the chapter with a case study that showcases these functions in action and we’ll come back to the functions in more detail in later chapters, as we start to dig into specific types of data (e.g. numbers, strings, dates).

      -
      +

      Prerequisites

      In this chapter we’ll focus on the dplyr package, another core member of the tidyverse. We’ll illustrate the key ideas using data from the nycflights13 package, and use ggplot2 to help us understand the data.

      @@ -15,14 +15,14 @@ Prerequisites library(tidyverse) #> ── Attaching core tidyverse packages ──────────────── tidyverse 1.3.2.9000 ── #> ✔ dplyr 1.0.99.9000 ✔ readr 2.1.3 -#> ✔ forcats 0.5.2.9000 ✔ stringr 1.5.0.9000 +#> ✔ forcats 0.5.2 ✔ stringr 1.5.0 #> ✔ ggplot2 3.4.0.9000 ✔ tibble 3.1.8 -#> ✔ lubridate 1.9.0 ✔ tidyr 1.2.1.9001 +#> ✔ lubridate 1.9.0 ✔ tidyr 1.3.0 #> ✔ purrr 1.0.1 #> ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ── #> ✖ dplyr::filter() masks stats::filter() #> ✖ dplyr::lag() masks stats::lag() -#> ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors +#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

      Take careful note of the conflicts message that’s printed when you load the tidyverse. It tells you that dplyr overwrites some functions in base R. If you want to use the base version of these functions after loading dplyr, you’ll need to use their full names: stats::filter() and stats::lag(). So far we’ve mostly ignored which package a function comes from because most of the time it doesn’t matter. However, knowing the package can help you find help and find related functions, so when we need to be precise about which function a package comes from, we’ll use the same syntax as R: packagename::functionname().

      @@ -43,9 +43,7 @@ nycflights13 #> 5 2013 1 1 554 600 -6 812 837 #> 6 2013 1 1 554 558 -4 740 728 #> # … with 336,770 more rows, and 11 more variables: arr_delay <dbl>, -#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, -#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, -#> # time_hour <dttm> +#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, …

      If you’ve used R before, you might notice that this data frame prints a little differently to other data frames you’ve seen. That’s because it’s a tibble, a special type of data frame used by the tidyverse to avoid some common gotchas. The most important difference is the way it prints: tibbles are designed for large datasets, so they only show the first few rows and only the columns that fit on one screen. There are a few options to see everything. If you’re using RStudio, the most convenient is probably View(flights), which will open an interactive scrollable and filterable view. Otherwise you can use print(flights, width = Inf) to show all columns, or use call glimpse():

      @@ -103,7 +101,7 @@ Rows

      -filter() +filter()

      filter() allows you to keep rows based on the values of the columnsLater, you’ll learn about the slice_*() family which allows you to choose rows based on their positions.. The first argument is the data frame. The second and subsequent arguments are the conditions that must be true to keep the row. For example, we could find all flights that arrived more than 120 minutes (two hours) late:

      @@ -119,9 +117,7 @@ Rows #> 5 2013 1 1 1505 1310 115 1638 1431 #> 6 2013 1 1 1525 1340 105 1831 1626 #> # … with 10,028 more rows, and 11 more variables: arr_delay <dbl>, -#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, -#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, -#> # time_hour <dttm> +#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, …

      As well as > (greater than), you can use >= (greater than or equal to), < (less than), <= (less than or equal to), == (equal to), and != (not equal to). You can also use & (and) or | (or) to combine multiple conditions:

      @@ -138,9 +134,7 @@ flights |> #> 5 2013 1 1 554 600 -6 812 837 #> 6 2013 1 1 554 558 -4 740 728 #> # … with 836 more rows, and 11 more variables: arr_delay <dbl>, -#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, -#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, -#> # time_hour <dttm> +#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, … # Flights that departed in January or February flights |> @@ -155,9 +149,7 @@ flights |> #> 5 2013 1 1 554 600 -6 812 837 #> 6 2013 1 1 554 558 -4 740 728 #> # … with 51,949 more rows, and 11 more variables: arr_delay <dbl>, -#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, -#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, -#> # time_hour <dttm> +#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, …

      There’s a useful shortcut when you’re combining | and ==: %in%. It keeps rows where the variable equals one of the values on the right:

      @@ -174,9 +166,7 @@ flights |> #> 5 2013 1 1 554 600 -6 812 837 #> 6 2013 1 1 554 558 -4 740 728 #> # … with 51,949 more rows, and 11 more variables: arr_delay <dbl>, -#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, -#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, -#> # time_hour <dttm> +#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, …

      We’ll come back to these comparisons and logical operators in more detail in #chp-logicals.

      When you run filter() dplyr executes the filtering operation, creating a new data frame, and then prints it. It doesn’t modify the existing flights dataset because dplyr functions never modify their inputs. To save the result, you need to use the assignment operator, <-:

      @@ -208,7 +198,7 @@ Common mistakes

      -arrange() +arrange()

      arrange() changes the order of the rows based on the value of the columns. It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns. For example, the following code sorts by the departure time, which is spread over four columns.

      @@ -224,9 +214,7 @@ Common mistakes #> 5 2013 1 1 554 600 -6 812 837 #> 6 2013 1 1 554 558 -4 740 728 #> # … with 336,770 more rows, and 11 more variables: arr_delay <dbl>, -#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, -#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, -#> # time_hour <dttm> +#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, …

      You can use desc() to re-order by a column in descending order. For example, this code shows the most delayed flights:

      @@ -242,9 +230,7 @@ Common mistakes #> 5 2013 7 22 845 1600 1005 1044 1815 #> 6 2013 4 10 1100 1900 960 1342 2211 #> # … with 336,770 more rows, and 11 more variables: arr_delay <dbl>, -#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, -#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, -#> # time_hour <dttm> +#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, …

      You can combine arrange() and filter() to solve more complex problems. For example, we could look for the flights that were most delayed on arrival that left on roughly on time:

      @@ -261,17 +247,15 @@ Common mistakes #> 5 2013 9 19 648 641 7 1035 810 #> 6 2013 4 18 655 700 -5 1213 950 #> # … with 239,103 more rows, and 11 more variables: arr_delay <dbl>, -#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, -#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, -#> # time_hour <dttm> +#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, …

      -distinct() +distinct()

      -

      distinct() finds all the unique rows in a dataset, so in a technical sense, it primarily operates on the rows. Most of the time, however, you’ll want to the distinct combination of some variables, so you can also optionally supply column names:

      +

      distinct() finds all the unique rows in a dataset, so in a technical sense, it primarily operates on the rows. Most of the time, however, you’ll want the distinct combination of some variables, so you can also optionally supply column names:

      # This would remove any duplicate rows if there were any
       flights |> 
      @@ -286,9 +270,7 @@ flights |>
       #> 5  2013     1     1      554            600        -6      812            837
       #> 6  2013     1     1      554            558        -4      740            728
       #> # … with 336,770 more rows, and 11 more variables: arr_delay <dbl>,
      -#> #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
      -#> #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
      -#> #   time_hour <dttm>
      +#> #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, …
       
       # This finds all unique origin and destination pairs.
       flights |> 
      @@ -307,7 +289,7 @@ flights |>
       

      Note that if you want to find the number of duplicates, or rows that weren’t duplicated, you’re better off swapping distinct() for count() and then filtering as needed.

      -
      +

      Exercises

      1. @@ -334,7 +316,7 @@ Columns

        -mutate() +mutate()

        The job of mutate() is to add new columns that are calculated from the existing columns. In the transform chapters, you’ll learn a large set of functions that you can use to manipulate different types of variables. For now, we’ll stick with basic algebra, which allows us to compute the gain, how much time a delayed flight made up in the air, and the speed in miles per hour:

        @@ -353,9 +335,7 @@ Columns #> 5 2013 1 1 554 600 -6 812 837 #> 6 2013 1 1 554 558 -4 740 728 #> # … with 336,770 more rows, and 13 more variables: arr_delay <dbl>, -#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, -#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, -#> # time_hour <dttm>, gain <dbl>, speed <dbl> +#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, …

        By default, mutate() adds new columns on the right hand side of your dataset, which makes it difficult to see what’s happening here. We can use the .before argument to instead add the variables to the left hand sideRemember that in RStudio, the easiest way to see a dataset with many columns is View().:

        @@ -375,9 +355,7 @@ Columns #> 5 19 394. 2013 1 1 554 600 -6 812 #> 6 -16 288. 2013 1 1 554 558 -4 740 #> # … with 336,770 more rows, and 12 more variables: sched_arr_time <int>, -#> # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, -#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, -#> # minute <dbl>, time_hour <dttm> +#> # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, …

        The . is a sign that .before is an argument to the function, not the name of a new variable. You can also use .after to add after a variable, and in both .before and .after you can use the variable name instead of a position. For example, we could add the new variables after day:

        @@ -397,14 +375,12 @@ Columns #> 5 2013 1 1 19 394. 554 600 -6 812 #> 6 2013 1 1 -16 288. 554 558 -4 740 #> # … with 336,770 more rows, and 12 more variables: sched_arr_time <int>, -#> # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, -#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, -#> # minute <dbl>, time_hour <dttm> +#> # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, …

        Alternatively, you can control which variables are kept with the .keep argument. A particularly useful argument is "used" which allows you to see the inputs and outputs from your calculations:

        flights |> 
        -  mutate(,
        +  mutate(
             gain = dep_delay - arr_delay,
             hours = air_time / 60,
             gain_per_hour = gain / hours,
        @@ -425,7 +401,7 @@ Columns
         
         

        -select() +select()

        It’s not uncommon to get datasets with hundreds or even thousands of variables. In this situation, the first challenge is often just focusing on the variables you’re interested in. select() allows you to rapidly zoom in on a useful subset using operations based on the names of the variables. select() is not terribly useful with the flights data because we only have 19 variables, but you can still get the general idea of how it works:

        @@ -470,8 +446,7 @@ flights |> #> 5 554 600 -6 812 837 -25 DL #> 6 554 558 -4 740 728 12 UA #> # … with 336,770 more rows, and 9 more variables: flight <int>, -#> # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, -#> # hour <dbl>, minute <dbl>, time_hour <dttm> +#> # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, … # Select all columns that are characters flights |> @@ -516,7 +491,7 @@ flights |>

        -rename() +rename()

        If you just want to keep all the existing variables and just want to rename a few, you can use rename() instead of select():

        @@ -532,9 +507,7 @@ flights |> #> 5 2013 1 1 554 600 -6 812 837 #> 6 2013 1 1 554 558 -4 740 728 #> # … with 336,770 more rows, and 11 more variables: arr_delay <dbl>, -#> # carrier <chr>, flight <int>, tail_num <chr>, origin <chr>, dest <chr>, -#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, -#> # time_hour <dttm>
        +#> # carrier <chr>, flight <int>, tail_num <chr>, origin <chr>, dest <chr>, …

        It works exactly the same way as select(), but keeps all the variables that aren’t explicitly selected.

        If you have a bunch of inconsistently named columns and it would be painful to fix them all by hand, check out janitor::clean_names() which provides some useful automated cleaning.

        @@ -542,7 +515,7 @@ flights |>

        -relocate() +relocate()

        Use relocate() to move variables around. You might want to collect related variables together or move important variables to the front. By default relocate() moves variables to the front:

        @@ -558,9 +531,7 @@ flights |> #> 5 2013-01-01 06:00:00 116 2013 1 1 554 600 #> 6 2013-01-01 05:00:00 150 2013 1 1 554 558 #> # … with 336,770 more rows, and 12 more variables: dep_delay <dbl>, -#> # arr_time <int>, sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, -#> # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>, -#> # hour <dbl>, minute <dbl> +#> # arr_time <int>, sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, …

        But you can use the same .before and .after arguments as mutate() to choose where to put them:

        @@ -576,9 +547,7 @@ flights |> #> 5 600 -6 812 837 -25 DL 461 #> 6 558 -4 740 728 12 UA 1696 #> # … with 336,770 more rows, and 12 more variables: tailnum <chr>, -#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, -#> # minute <dbl>, time_hour <dttm>, year <int>, month <int>, day <int>, -#> # dep_time <int> +#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, … flights |> relocate(starts_with("arr"), .before = dep_time) #> # A tibble: 336,776 × 19 @@ -591,13 +560,11 @@ flights |> #> 5 2013 1 1 812 -25 554 600 -6 #> 6 2013 1 1 740 12 554 558 -4 #> # … with 336,770 more rows, and 11 more variables: sched_arr_time <int>, -#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, -#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, -#> # time_hour <dttm> +#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, …
        -
        +

        Exercises

        @@ -629,7 +596,7 @@ Groups

        -group_by() +group_by()

        Use group_by() to divide your dataset into groups meaningful for your analysis:

        @@ -646,16 +613,14 @@ Groups #> 5 2013 1 1 554 600 -6 812 837 #> 6 2013 1 1 554 558 -4 740 728 #> # … with 336,770 more rows, and 11 more variables: arr_delay <dbl>, -#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, -#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, -#> # time_hour <dttm> +#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, …

        group_by() doesn’t change the data but, if you look closely at the output, you’ll notice that it’s now “grouped by” month. This means subsequent operations will now work “by month”. group_by() doesn’t do anything by itself; instead it changes the behavior of the subsequent verbs.

        -summarize() +summarize()

        The most important grouped operation is a summary, which collapses each group to a single row. In dplyr, this is operation is performed by summarize()Or summarise(), if you prefer British English., as shown by the following example, which computes the average departure delay by month:

        @@ -717,7 +682,7 @@ Groups

        -Theslice_ functions

        +The slice_ functions

        There are five handy functions that allow you pick off specific rows within each group:

        • df |> slice_head(n = 1) takes the first row from each group.
        • @@ -745,9 +710,7 @@ Theslice_ functions #> 5 2013 7 22 2257 759 898 121 1026 #> 6 2013 7 10 2056 1505 351 2347 1758 #> # … with 102 more rows, and 11 more variables: arr_delay <dbl>, -#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, -#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, -#> # time_hour <dttm> +#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, …

        This is similar to computing the max delay with summarize(), but you get the whole row instead of the single summary:

        @@ -791,9 +754,7 @@ daily #> 5 2013 1 1 554 600 -6 812 837 #> 6 2013 1 1 554 558 -4 740 728 #> # … with 336,770 more rows, and 11 more variables: arr_delay <dbl>, -#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, -#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, -#> # time_hour <dttm> +#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, …

        When you summarize a tibble grouped by more than one variable, each summary peels off the last group. In hindsight, this wasn’t great way to make this function work, but it’s difficult to change without breaking existing code. To make it obvious what’s happening, dplyr displays a message that tells you how you can change this behavior:

        @@ -834,7 +795,7 @@ Ungrouping

        As you can see, when you summarize an ungrouped data frame, you get a single row back because dplyr treats all the rows in an ungrouped data frame as belonging to one group.

        -
        +

        Exercises

        1. Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights |> group_by(carrier, dest) |> summarize(n()))

        2. @@ -996,7 +957,7 @@ batters

          You can find a good explanation of this problem and how to overcome it at http://varianceexplained.org/r/empirical_bayes_baseball/ and https://www.evanmiller.org/how-not-to-sort-by-average-rating.html.

        -
        +

        Summary

        In this chapter, you’ve learned the tools that dplyr provides for working with data frames. The tools are roughly grouped into three categories: those that manipulate the rows (like filter() and arrange(), those that manipulate the columns (like select() and mutate()), and those that manipulate groups (like group_by() and summarize()). In this chapter, we’ve focused on these “whole data frame” tools, but you haven’t yet learned much about what you can do with the individual variable. We’ll come back to that in the Transform part of the book, where each chapter will give you tools for a specific type of variable.

        diff --git a/oreilly/data-visualize.html b/oreilly/data-visualize.html index 749f481..dabf4d3 100644 --- a/oreilly/data-visualize.html +++ b/oreilly/data-visualize.html @@ -1,6 +1,6 @@

        Data visualization

        -
        +

        Introduction

        @@ -9,30 +9,30 @@ Introduction

        R has several systems for making graphs, but ggplot2 is one of the most elegant and most versatile. ggplot2 implements the grammar of graphics, a coherent system for describing and building graphs. With ggplot2, you can do more and faster by learning one system and applying it in many places.

        This chapter will teach you how to visualize your data using ggplot2. We will start by creating a simple scatterplot and use that to introduce aesthetic mappings and geometric objects – the fundamental building blocks of ggplot2. We will then walk you through visualizing distributions of single variables as well as visualizing relationships between two or more variables. We’ll finish off with saving your plots and troubleshooting tips.

        -
        +

        Prerequisites

        -

        This chapter focuses on ggplot2, one of the core packages in the tidyverse. To access the datasets, help pages, and functions used in this chapter, load the tidyverse by running this code:

        +

        This chapter focuses on ggplot2, one of the core packages in the tidyverse. To access the datasets, help pages, and functions used in this chapter, load the tidyverse by running:

        library(tidyverse)
         #> ── Attaching core tidyverse packages ──────────────── tidyverse 1.3.2.9000 ──
         #> ✔ dplyr     1.0.99.9000     ✔ readr     2.1.3      
        -#> ✔ forcats   0.5.2.9000      ✔ stringr   1.5.0.9000 
        +#> ✔ forcats   0.5.2           ✔ stringr   1.5.0      
         #> ✔ ggplot2   3.4.0.9000      ✔ tibble    3.1.8      
        -#> ✔ lubridate 1.9.0           ✔ tidyr     1.2.1.9001 
        +#> ✔ lubridate 1.9.0           ✔ tidyr     1.3.0      
         #> ✔ purrr     1.0.1           
         #> ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──
         #> ✖ dplyr::filter() masks stats::filter()
         #> ✖ dplyr::lag()    masks stats::lag()
        -#> ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
        +#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
        -

        That one line of code loads the core tidyverse; packages which you will use in almost every data analysis. It also tells you which functions from the tidyverse conflict with functions in base R (or from other packages you might have loaded).

        +

        That one line of code loads the core tidyverse; the packages that you will use in almost every data analysis. It also tells you which functions from the tidyverse conflict with functions in base R (or from other packages you might have loaded)You can eliminate that message and force conflict resolution to happen on demand by using the conflicted package, which becomes more important as you load more packages. You can learn more about conflicted at https://conflicted.r-lib.org..

        If you run this code and get the error message there is no package called 'tidyverse', you’ll need to first install it, then run library() once again.

        install.packages("tidyverse")
         library(tidyverse)
        -

        You only need to install a package once, but you need to reload it every time you start a new session.

        +

        You only need to install a package once, but you need to load it every time you start a new session.

        In addition to tidyverse, we will also use the palmerpenguins package, which includes the penguins dataset containing body measurements for penguins on three islands in the Palmer Archipelago.

        library(palmerpenguins)
        @@ -47,20 +47,21 @@ First steps

        -Thepenguins data frame

        +The penguins data frame

        You can test your answer with the penguins data frame found in palmerpenguins (a.k.a. palmerpenguins::penguins). A data frame is a rectangular collection of variables (in the columns) and observations (in the rows). penguins contains 344 observations collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTERHorst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/. doi: 10.5281/zenodo.3960218..

        penguins
         #> # A tibble: 344 × 8
        -#>   species island   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
        -#>   <fct>   <fct>             <dbl>         <dbl>             <int>       <int>
        -#> 1 Adelie  Torgers…           39.1          18.7               181        3750
        -#> 2 Adelie  Torgers…           39.5          17.4               186        3800
        -#> 3 Adelie  Torgers…           40.3          18                 195        3250
        -#> 4 Adelie  Torgers…           NA            NA                  NA          NA
        -#> 5 Adelie  Torgers…           36.7          19.3               193        3450
        -#> 6 Adelie  Torgers…           39.3          20.6               190        3650
        -#> # … with 338 more rows, and 2 more variables: sex <fct>, year <int>
        +#> species island bill_length_mm bill_depth_mm flipper_length_mm +#> <fct> <fct> <dbl> <dbl> <int> +#> 1 Adelie Torgersen 39.1 18.7 181 +#> 2 Adelie Torgersen 39.5 17.4 186 +#> 3 Adelie Torgersen 40.3 18 195 +#> 4 Adelie Torgersen NA NA NA +#> 5 Adelie Torgersen 36.7 19.3 193 +#> 6 Adelie Torgersen 39.3 20.6 190 +#> # … with 338 more rows, and 3 more variables: body_mass_g <int>, sex <fct>, +#> # year <int>

        This data frame contains 8 columns. For an alternative view, where you can see all variables and the first few observations of each variable, use glimpse(). Or, if you’re in RStudio, run View(penguins) to open an interactive data viewer.

        @@ -239,7 +240,7 @@ Adding aesthetics and layers

        We finally have a plot that perfectly matches our “ultimate goal”!

        -
        +

        Exercises

        1. How many rows are in penguins? How many columns?

        2. @@ -410,7 +411,7 @@ ggplot(penguins, aes(x = body_mass_g)) +
        -
        +

        Exercises

        1. Make a bar plot of species of penguins, where you assign species to the y aesthetic. How is this plot different?

        2. @@ -479,7 +480,7 @@ A numerical and a categorical variable
        3. Otherwise, we set the value of an aesthetic.
        -
        +

        Two categorical variables

        We can use segmented bar plots to visualize the distribution between two categorical variables. In creating this bar chart, we map the variable we want to divide the data into first to the x aesthetic and the variable we then further want to divide each group into to the fill aesthetic.

        @@ -498,7 +499,7 @@ ggplot(penguins, aes(x = island, fill = species)) +
        -
        +

        Two numerical variables

        So far you’ve learned about scatterplots (created with geom_point()) and smooth curves (created with geom_smooth()) for visualizing the relationship between two numerical variables. A scatterplot is probably the most commonly used plot for visualizing the relationship between two variables.

        @@ -535,7 +536,7 @@ Three or more variables

        You will learn about many other geoms for visualizing distributions of variables and relationships between them in #chp-layers.

        -
        +

        Exercises

        1. Which variables in mpg are categorical? Which variables are continuous? (Hint: type ?mpg to read the documentation for the dataset). How can you see this information when you run mpg?

        2. @@ -576,7 +577,7 @@ ggsave(filename = "my-plot.png")

          If you don’t specify the width and height they will be taken from the dimensions of the current plotting device. For reproducible code, you’ll want to specify them. You can learn more about ggsave() in the documentation.

          Generally, however, we recommend that you assemble your final reports using Quarto, a reproducible authoring system that allows you to interleave your code and your prose and automatically include your plots in your write-ups. You will learn more about Quarto in #chp-quarto.

          -
          +

          Exercises

          1. @@ -607,7 +608,7 @@ Common problems

            If that doesn’t help, carefully read the error message. Sometimes the answer will be buried there! But when you’re new to R, even if the answer is in the error message, you might not yet know how to understand it. Another great tool is Google: try googling the error message, as it’s likely someone else has had the same problem, and has gotten help online.

          -
          +

          Summary

          In this chapter, you’ve learned the basics of data visualization with ggplot2. We started with the basic idea that underpins ggplot2: a visualization is a mapping from variables in your data to aesthetic properties like position, color, size and shape. You then learned about increasing the complexity and improving the presentation of your plots layer-by-layer. You also learned about commonly used plots for visualizing the distribution of a single variable as well as for visualizing relationships between two or more variables, by levering additional aesthetic mappings and/or splitting your plot into small multiples using faceting.

          diff --git a/oreilly/databases.html b/oreilly/databases.html index 1bc3dd3..17b94c7 100644 --- a/oreilly/databases.html +++ b/oreilly/databases.html @@ -1,12 +1,12 @@

          Databases

          -
          +

          Introduction

          A huge amount of data lives in databases, so it’s essential that you know how to access it. Sometimes you can ask someone to download a snapshot into a .csv for you, but this gets painful quickly: every time you need to make a change you’ll have to communicate with another human. You want to be able to reach into the database directly to get the data you need, when you need it.

          In this chapter, you’ll first learn the basics of the DBI package: how to use it to connect to a database and then retrieve data with a SQLSQL is either pronounced “s”-“q”-“l” or “sequel”. query. SQL, short for structured query language, is the lingua franca of databases, and is an important language for all data scientists to learn. That said, we’re not going to start with SQL, but instead we’ll teach you dbplyr, which can translate your dplyr code to the SQL. We’ll use that as way to teach you some of the most important features of SQL. You won’t become a SQL master by the end of the chapter, but you will be able to identify the most important components and understand what they do.

          -
          +

          Prerequisites

          In this chapter, we’ll introduce DBI and dbplyr. DBI is a low-level interface that connects to databases and executes SQL; dbplyr is a high-level interface that translates your dplyr code to SQL queries then executes them with DBI.

          @@ -148,7 +148,7 @@ as_tibble(dbGetQuery(con, sql))

          You’ll need to be a little careful with dbGetQuery() since it can potentially return more data than you have memory. We won’t discuss it further here, but if you’re dealing with very large datasets it’s possible to deal with a “page” of data at a time by using dbSendQuery() to get a “result set” which you can page through by calling dbFetch() until dbHasCompleted() returns TRUE.

          -
          +

          Other functions

          There are lots of other functions in DBI that you might find useful if you’re managing your own data (like dbWriteTable() which we used in #sec-load-data), but we’re going to skip past them in the interest of staying focused on working with data that already lives in a database.

          @@ -164,7 +164,7 @@ dbplyr basics
          diamonds_db <- tbl(con, "diamonds")
           diamonds_db
           #> # Source:   table<diamonds> [?? x 10]
          -#> # Database: DuckDB 0.6.1 [root@Darwin 22.1.0:R 4.2.1/:memory:]
          +#> # Database: DuckDB 0.6.1 [root@Darwin 22.2.0:R 4.2.1/:memory:]
           #>   carat cut       color clarity depth table price     x     y     z
           #>   <dbl> <fct>     <fct> <fct>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
           #> 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
          @@ -175,25 +175,24 @@ diamonds_db
           #> 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
           #> # … with more rows
      -
      +
      +
      -
      - -

      There are two other common ways to interact with a database. First, many corporate databases are very large so you need some hierarchy to keep all the tables organised. In that case you might need to supply a schema, or a catalog and a schema, in order to pick the table you’re interested in:

      +

      There are two other common ways to interact with a database. First, many corporate databases are very large so you need some hierarchy to keep all the tables organised. In that case you might need to supply a schema, or a catalog and a schema, in order to pick the table you’re interested in:

      +
      diamonds_db <- tbl(con, in_schema("sales", "diamonds"))
       diamonds_db <- tbl(con, in_catalog("north_america", "sales", "diamonds"))
      -

      Other times you might want to use your own SQL query as a starting point:

      +
      +

      Other times you might want to use your own SQL query as a starting point:

      +
      diamonds_db <- tbl(con, sql("SELECT * FROM diamonds"))
      -

      Note that while SQL is a standard, it is extremely complex and no database follows it exactly. While the main components that we’ll focus on in this book are very similar between DBMSs, there are many minor variations. Fortunately, dbplyr is designed to handle this problem and generates different translations for different databases. It’s not perfect, but it’s continually improving, and if you hit a problem you can file an issue on GitHub to help us do better.

      - -

      In the examples above note that "year" and "type" are wrapped in double quotes. That’s because these are reserved words in duckdb, so dbplyr quotes them to avoid any potential confusion between column/table names and SQL operators.

      When working with other databases you’re likely to see every variable name quotes because only a handful of client packages, like duckdb, know what all the reserved words are, so they quote everything to be safe.

      SELECT "tailnum", "type", "manufacturer", "model", "year"
      -FROM "planes"

      Some other database systems use backticks instead of quotes:

      SELECT `tailnum`, `type`, `manufacturer`, `model`, `year`
      -FROM `planes`
      +
      +

      This object is lazy; when you use dplyr verbs on it, dplyr doesn’t do any work: it just records the sequence of operations that you want to perform and only performs them when needed. For example, take the following pipeline:

      @@ -203,7 +202,7 @@ FROM `planes`
      big_diamonds_db #> # Source: SQL [?? x 5] -#> # Database: DuckDB 0.6.1 [root@Darwin 22.1.0:R 4.2.1/:memory:] +#> # Database: DuckDB 0.6.1 [root@Darwin 22.2.0:R 4.2.1/:memory:] #> carat cut color clarity price #> <dbl> <fct> <fct> <fct> <int> #> 1 1.54 Premium E VS2 15002 @@ -304,25 +303,16 @@ planes |> show_query()
      • In SQL, case doesn’t matter: you can write select, SELECT, or even SeLeCt. In this book we’ll stick with the common convention of writing SQL keywords in uppercase to distinguish them from table or variables names.
      • In SQL, order matters: you must always write the clauses in the order SELECT, FROM, WHERE, GROUP BY, ORDER BY. Confusingly, this order doesn’t match how the clauses actually evaluated which is first FROM, then WHERE, GROUP BY, SELECT, and ORDER BY.

      The following sections explore each clause in more detail.

      -
      +
      +
      -
      - -

      There are two other common ways to interact with a database. First, many corporate databases are very large so you need some hierarchy to keep all the tables organised. In that case you might need to supply a schema, or a catalog and a schema, in order to pick the table you’re interested in:

      -
      diamonds_db <- tbl(con, in_schema("sales", "diamonds"))
      -diamonds_db <- tbl(con, in_catalog("north_america", "sales", "diamonds"))
      -

      Other times you might want to use your own SQL query as a starting point:

      -
      diamonds_db <- tbl(con, sql("SELECT * FROM diamonds"))
      -
      -

      Note that while SQL is a standard, it is extremely complex and no database follows it exactly. While the main components that we’ll focus on in this book are very similar between DBMSs, there are many minor variations. Fortunately, dbplyr is designed to handle this problem and generates different translations for different databases. It’s not perfect, but it’s continually improving, and if you hit a problem you can file an issue on GitHub to help us do better.

      -

      In the examples above note that "year" and "type" are wrapped in double quotes. That’s because these are reserved words in duckdb, so dbplyr quotes them to avoid any potential confusion between column/table names and SQL operators.

      When working with other databases you’re likely to see every variable name quotes because only a handful of client packages, like duckdb, know what all the reserved words are, so they quote everything to be safe.

      SELECT "tailnum", "type", "manufacturer", "model", "year"
      -FROM "planes"

      Some other database systems use backticks instead of quotes:

      SELECT `tailnum`, `type`, `manufacturer`, `model`, `year`
      -FROM `planes`
      +
      +
      @@ -356,26 +346,23 @@ planes |> #> FROM planes

      This example also shows you how SQL does renaming. In SQL terminology renaming is called aliasing and is done with AS. Note that unlike mutate(), the old name is on the left and the new name is on the right.

      -
      +
      +
      +

      In the examples above note that "year" and "type" are wrapped in double quotes. That’s because these are reserved words in duckdb, so dbplyr quotes them to avoid any potential confusion between column/table names and SQL operators.

      +

      When working with other databases you’re likely to see every variable name quotes because only a handful of client packages, like duckdb, know what all the reserved words are, so they quote everything to be safe.

      +
      SELECT "tailnum", "type", "manufacturer", "model", "year"
      +FROM "planes"
      +

      Some other database systems use backticks instead of quotes:

      +
      SELECT `tailnum`, `type`, `manufacturer`, `model`, `year`
      +FROM `planes`
      +
      - -

      There are two other common ways to interact with a database. First, many corporate databases are very large so you need some hierarchy to keep all the tables organised. In that case you might need to supply a schema, or a catalog and a schema, in order to pick the table you’re interested in:

      -
      diamonds_db <- tbl(con, in_schema("sales", "diamonds"))
      -diamonds_db <- tbl(con, in_catalog("north_america", "sales", "diamonds"))
      -

      Other times you might want to use your own SQL query as a starting point:

      -
      diamonds_db <- tbl(con, sql("SELECT * FROM diamonds"))
      -

      Note that while SQL is a standard, it is extremely complex and no database follows it exactly. While the main components that we’ll focus on in this book are very similar between DBMSs, there are many minor variations. Fortunately, dbplyr is designed to handle this problem and generates different translations for different databases. It’s not perfect, but it’s continually improving, and if you hit a problem you can file an issue on GitHub to help us do better.

      - -

      In the examples above note that "year" and "type" are wrapped in double quotes. That’s because these are reserved words in duckdb, so dbplyr quotes them to avoid any potential confusion between column/table names and SQL operators.

      When working with other databases you’re likely to see every variable name quotes because only a handful of client packages, like duckdb, know what all the reserved words are, so they quote everything to be safe.

      SELECT "tailnum", "type", "manufacturer", "model", "year"
      -FROM "planes"

      Some other database systems use backticks instead of quotes:

      SELECT `tailnum`, `type`, `manufacturer`, `model`, `year`
      -FROM `planes`
      -

      The translations for mutate() are similarly straightforward: each variable becomes a new expression in SELECT:

      flights |> 
      @@ -461,7 +448,7 @@ flights |>
       #> Use `na.rm = TRUE` to silence this warning
       #> This warning is displayed once every 8 hours.
       #> # Source:   SQL [?? x 2]
      -#> # Database: DuckDB 0.6.1 [root@Darwin 22.1.0:R 4.2.1/:memory:]
      +#> # Database: DuckDB 0.6.1 [root@Darwin 22.2.0:R 4.2.1/:memory:]
       #>   dest   delay
       #>   <chr>  <dbl>
       #> 1 ATL   11.3  
      @@ -552,7 +539,7 @@ Subqueries
       

      Sometimes dbplyr will create a subquery where it’s not needed because it doesn’t yet know how to optimize that translation. As dbplyr improves over time, these cases will get rarer but will probably never go away.

      -
      +

      Joins

      If you’re familiar with dplyr’s joins, SQL joins are very similar. Here’s a simple example:

      @@ -597,7 +584,7 @@ Other verbs

      dbplyr also translates other verbs like distinct(), slice_*(), and intersect(), and a growing selection of tidyr functions like pivot_longer() and pivot_wider(). The easiest way to see the full set of what’s currently available is to visit the dbplyr website: https://dbplyr.tidyverse.org/reference/.

      -
      +

      Exercises

      1. What is distinct() translated to? How about head()?

      2. @@ -731,7 +718,7 @@ flights |>

        dbplyr also translates common string and date-time manipulation functions, which you can learn about in vignette("translation-function", package = "dbplyr"). dbplyr’s translations are certainly not perfect, and there are many R functions that aren’t translated yet, but dbplyr does a surprisingly good job covering the functions that you’ll use most of the time.

      -
      +

      Summary

      In this chapter you learned how to access data from databases. We focused on dbplyr, a dplyr “backend” that allows you to write the dplyr code you’re familiar with, and have it be automatically translated to SQL. We used that translation to teach you a little SQL; it’s important to learn some SQL because it’s the most commonly used language for working with data and knowing some will it easier for you to communicate with other data folks who don’t use R. If you’ve finished this chapter and would like to learn more about SQL. We have two recommendations:

      diff --git a/oreilly/datetimes.html b/oreilly/datetimes.html index bc013ce..808b107 100644 --- a/oreilly/datetimes.html +++ b/oreilly/datetimes.html @@ -1,6 +1,6 @@

      Dates and times

      -
      +

      Introduction

      This chapter will show you how to work with dates and times in R. At first glance, dates and times seem simple. You use them all the time in your regular life, and they don’t seem to cause much confusion. However, the more you learn about dates and times, the more complicated they seem to get!

      @@ -8,14 +8,12 @@ Introduction

      Dates and times are hard because they have to reconcile two physical phenomena (the rotation of the Earth and its orbit around the sun) with a whole raft of geopolitical phenomena including months, time zones, and DST. This chapter won’t teach you every last detail about dates and times, but it will give you a solid grounding of practical skills that will help you with common data analysis challenges.

      We’ll begin by showing you how to create date-times from various inputs, and then once you’ve got a date-time, how you can extract components like year, month, and day. We’ll then dive into the tricky topic of working with time spans, which come in a variety of flavors depending on what you’re trying to do. We’ll conclude with a brief discussion of the additional challenges posed by time zones.

      -
      +

      Prerequisites

      -

      This chapter will focus on the lubridate package, which makes it easier to work with dates and times in R. lubridate is not part of core tidyverse because you only need it when you’re working with dates/times. We will also need nycflights13 for practice data.

      +

      This chapter will focus on the lubridate package, which makes it easier to work with dates and times in R. As of the latest tidyverse release, lubridate is part of core tidyverse so. We will also need nycflights13 for practice data.

      library(tidyverse)
      -
      -library(lubridate)
       library(nycflights13)
      @@ -33,9 +31,9 @@ Creating date/times

      To get the current date or date-time you can use today() or now():

      today()
      -#> [1] "2023-01-12"
      +#> [1] "2023-01-26"
       now()
      -#> [1] "2023-01-12 17:04:08 CST"
      +#> [1] "2023-01-26 10:32:54 CST"

      Otherwise, the following sections describe the four ways you’re likely to create a date/time:

      • While reading a file with readr.
      • @@ -281,9 +279,9 @@ From other types

        You may want to switch between a date-time and a date. That’s the job of as_datetime() and as_date():

        as_datetime(today())
        -#> [1] "2023-01-12 UTC"
        +#> [1] "2023-01-26 UTC"
         as_date(now())
        -#> [1] "2023-01-12"
        +#> [1] "2023-01-26"

        Sometimes you’ll get date/times as numeric offsets from the “Unix Epoch”, 1970-01-01. If the offset is in seconds, use as_datetime(); if it’s in days, use as_date().

        @@ -294,7 +292,7 @@ as_date(365 * 10 + 2)
      -
      +

      Exercises

      1. @@ -474,7 +472,7 @@ update(ymd("2023-02-01"), hour = 400)
      -
      +

      Exercises

      1. How does the distribution of flight times within a day change over the course of the year?

      2. @@ -507,12 +505,12 @@ Durations
        # How old is Hadley?
         h_age <- today() - ymd("1979-10-14")
         h_age
        -#> Time difference of 15796 days
        +#> Time difference of 15810 days

        A difftime class object records a time span of seconds, minutes, hours, days, or weeks. This ambiguity can make difftimes a little painful to work with, so lubridate provides an alternative which always uses seconds: the duration.

        as.duration(h_age)
        -#> [1] "1364774400s (~43.25 years)"
        +#> [1] "1365984000s (~43.29 years)"

        Durations come with a bunch of convenient constructors:

        @@ -530,7 +528,7 @@ dweeks(3) dyears(1) #> [1] "31557600s (~1 years)"
        -

        Durations always record the time span in seconds. Larger units are created by converting minutes, hours, days, weeks, and years to seconds: 60 seconds in a minute, 60 minutes in an hour, 24 hours in a day, and 7 days in a week. Larger time units are more problematic. A year is uses the “average” number of days in a year, i.e. 365.25. There’s no way to convert a month to a duration, because there’s just too much variation.

        +

        Durations always record the time span in seconds. Larger units are created by converting minutes, hours, days, weeks, and years to seconds: 60 seconds in a minute, 60 minutes in an hour, 24 hours in a day, and 7 days in a week. Larger time units are more problematic. A year uses the “average” number of days in a year, i.e. 365.25. There’s no way to convert a month to a duration, because there’s just too much variation.

        You can add and multiply durations:

        2 * dyears(1)
        @@ -545,14 +543,14 @@ last_year <- today() - dyears(1)

        However, because durations represent an exact number of seconds, sometimes you might get an unexpected result:

        -
        one_pm <- ymd_hms("2026-03-12 13:00:00", tz = "America/New_York")
        +
        one_am <- ymd_hms("2026-03-08 01:00:00", tz = "America/New_York")
         
        -one_pm
        -#> [1] "2026-03-12 13:00:00 EDT"
        -one_pm + ddays(1)
        -#> [1] "2026-03-13 13:00:00 EDT"
        +one_am +#> [1] "2026-03-08 01:00:00 EST" +one_am + ddays(1) +#> [1] "2026-03-09 02:00:00 EDT"
        -

        Why is one day after 1pm March 12, 2pm March 13? If you look carefully at the date you might also notice that the time zones have changed. March 12 only has 23 hours because it’s when DST starts, so if we add a full days worth of seconds we end up with a different time.

        +

        Why is one day after 1am March 8, 2am March 9? If you look carefully at the date you might also notice that the time zones have changed. March 8 only has 23 hours because it’s when DST starts, so if we add a full days worth of seconds we end up with a different time.

      @@ -560,10 +558,10 @@ one_pm + ddays(1) Periods

      To solve this problem, lubridate provides periods. Periods are time spans but don’t have a fixed length in seconds, instead they work with “human” times, like days and months. That allows them to work in a more intuitive way:

      -
      one_pm
      -#> [1] "2026-03-12 13:00:00 EDT"
      -one_pm + days(1)
      -#> [1] "2026-03-13 13:00:00 EDT"
      +
      one_am
      +#> [1] "2026-03-08 01:00:00 EST"
      +one_am + days(1)
      +#> [1] "2026-03-09 01:00:00 EDT"

      Like durations, periods can be created with a number of friendly constructor functions.

      @@ -591,10 +589,10 @@ ymd("2024-01-01") + years(1) #> [1] "2025-01-01" # Daylight Savings Time -one_pm + ddays(1) -#> [1] "2026-03-13 13:00:00 EDT" -one_pm + days(1) -#> [1] "2026-03-13 13:00:00 EDT" +one_am + ddays(1) +#> [1] "2026-03-09 02:00:00 EDT" +one_am + days(1) +#> [1] "2026-03-09 01:00:00 EDT"

      Let’s use periods to fix an oddity related to our flight dates. Some planes appear to have arrived at their destination before they departed from New York City.

      @@ -668,7 +666,7 @@ y2024 / days(1)
      -
      +

      Exercises

      1. Explain days(overnight * 1) to someone who has just started learning R. How does it work?

      2. @@ -694,7 +692,7 @@ Time zones

        And see the complete list of all time zone names with OlsonNames():

        length(OlsonNames())
        -#> [1] 596
        +#> [1] 597
         head(OlsonNames())
         #> [1] "Africa/Abidjan"     "Africa/Accra"       "Africa/Addis_Ababa"
         #> [4] "Africa/Algiers"     "Africa/Asmara"      "Africa/Asmera"
        @@ -755,7 +753,7 @@ x4b - x4
      -
      +

      Summary

      This chapter has introduced you to the tools that lubridate provides to help you work with date-time data. Working with dates and times can seem harder than necessary, but hopefully this chapter has helped you see why — date-times are more complex than they seem at first glance, and handling every possible situation adds complexity. Even if your data never crosses a day light savings boundary or involves a leap year, the functions need to be able to handle it.

      diff --git a/oreilly/factors.html b/oreilly/factors.html index cacee89..5da74f3 100644 --- a/oreilly/factors.html +++ b/oreilly/factors.html @@ -1,12 +1,12 @@

      Factors

      -
      +

      Introduction

      Factors are used for categorical variables, variables that have a fixed and known set of possible values. They are also useful when you want to display character vectors in a non-alphabetical order.

      We’ll start by motivating why factors are needed for data analysis and how you can create them with factor(). We’ll then introduce you to the gss_cat dataset which contains a bunch of categorical variables to experiment with. You’ll then use that dataset to practice modifying the order and values of factors, before we finish up with a discussion of ordered factors.

      -
      +

      Prerequisites

      Base R provides some basic tools for creating and manipulating factors. We’ll supplement these with the forcats package, which is part of the core tidyverse. It provides tools for dealing with categorical variables (and it’s an anagram of factors!) using a wide range of helpers for working with factors.

      @@ -114,15 +114,16 @@ General Social Survey
      gss_cat
       #> # A tibble: 21,483 × 9
      -#>    year marital         age race  rincome        partyid  relig denom tvhours
      -#>   <int> <fct>         <int> <fct> <fct>          <fct>    <fct> <fct>   <int>
      -#> 1  2000 Never married    26 White $8000 to 9999  Ind,nea… Prot… Sout…      12
      -#> 2  2000 Divorced         48 White $8000 to 9999  Not str… Prot… Bapt…      NA
      -#> 3  2000 Widowed          67 White Not applicable Indepen… Prot… No d…       2
      -#> 4  2000 Never married    39 White Not applicable Ind,nea… Orth… Not …       4
      -#> 5  2000 Divorced         25 White Not applicable Not str… None  Not …       1
      -#> 6  2000 Married          25 White $20000 - 24999 Strong … Prot… Sout…      NA
      -#> # … with 21,477 more rows
      +#> year marital age race rincome partyid +#> <int> <fct> <int> <fct> <fct> <fct> +#> 1 2000 Never married 26 White $8000 to 9999 Ind,near rep +#> 2 2000 Divorced 48 White $8000 to 9999 Not str republican +#> 3 2000 Widowed 67 White Not applicable Independent +#> 4 2000 Never married 39 White Not applicable Ind,near rep +#> 5 2000 Divorced 25 White Not applicable Not str democrat +#> 6 2000 Married 25 White $20000 - 24999 Strong democrat +#> # … with 21,477 more rows, and 3 more variables: relig <fct>, denom <fct>, +#> # tvhours <int>

      (Remember, since this dataset is provided by a package, you can get more information about the variables with ?gss_cat.)

      When factors are stored in a tibble, you can’t see their levels so easily. One way to view them is with count():

      @@ -136,14 +137,6 @@ General Social Survey #> 2 Black 3129 #> 3 White 16395 -

      Or with a bar chart:

      -
      -
      ggplot(gss_cat, aes(x = race)) +
      -  geom_bar()
      -
      -

      A bar chart showing the distribution of race. There are ~2000 records with race "Other", 3000 with race "Black", and other 15,000 with race "White".

      -
      -

      When working with factors, the two most common operations are changing the order of the levels, and changing the values of the levels. Those operations are described in the sections below.

      @@ -171,7 +164,7 @@ Modifying factor order ggplot(relig_summary, aes(x = tvhours, y = relig)) + geom_point()
      -

      A scatterplot of with tvhours on the x-axis and religion on the y-axis. The y-axis is ordered seemingly aribtrarily making it hard to get any sense of overall pattern.

      +

      A scatterplot of with tvhours on the x-axis and religion on the y-axis. The y-axis is ordered seemingly aribtrarily making it hard to get any sense of overall pattern.

      It is hard to read this plot because there’s no overall pattern. We can improve it by reordering the levels of relig using fct_reorder(). fct_reorder() takes three arguments:

      @@ -184,7 +177,7 @@ ggplot(relig_summary, aes(x = tvhours, y = relig)) +
      ggplot(relig_summary, aes(x = tvhours, y = fct_reorder(relig, tvhours))) +
         geom_point()
      -

      The same scatterplot as above, but now the religion is displayed in increasing order of tvhours. "Other eastern" has the fewest tvhours under 2, and "Don't know" has the highest (over 5).

      +

      The same scatterplot as above, but now the religion is displayed in increasing order of tvhours. "Other eastern" has the fewest tvhours under 2, and "Don't know" has the highest (over 5).

      Reordering religion makes it much easier to see that people in the “Don’t know” category watch much more TV, and Hinduism & Other Eastern religions watch much less.

      @@ -210,7 +203,7 @@ ggplot(relig_summary, aes(x = tvhours, y = relig)) + ggplot(rincome_summary, aes(x = age, y = fct_reorder(rincome, age))) + geom_point()
      -

      A scatterplot with age on the x-axis and income on the y-axis. Income has been reordered in order of average age which doesn't make much sense. One section of the y-axis goes from $6000-6999, then <$1000, then $8000-9999.

      +

      A scatterplot with age on the x-axis and income on the y-axis. Income has been reordered in order of average age which doesn't make much sense. One section of the y-axis goes from $6000-6999, then <$1000, then $8000-9999.

      Here, arbitrarily reordering the levels isn’t a good idea! That’s because rincome already has a principled order that we shouldn’t mess with. Reserve fct_reorder() for factors whose levels are arbitrarily ordered.

      @@ -219,20 +212,13 @@ ggplot(rincome_summary, aes(x = age, y = fct_reorder(rincome, age))) +
      ggplot(rincome_summary, aes(x = age, y = fct_relevel(rincome, "Not applicable"))) +
         geom_point()
      -

      The same scatterplot but now "Not Applicable" is displayed at the bottom of the y-axis. Generally there is a positive association between income and age, and the income band with the highest average age is "Not applicable".

      +

      The same scatterplot but now "Not Applicable" is displayed at the bottom of the y-axis. Generally there is a positive association between income and age, and the income band with the highethst average age is "Not applicable".

      Why do you think the average age for “Not applicable” is so high?

      Another type of reordering is useful when you are coloring the lines on a plot. fct_reorder2(f, x, y) reorders the factor f by the y values associated with the largest x values. This makes the plot easier to read because the colors of the line at the far right of the plot will line up with the legend.

      -
      #|
      -#|     Rearranging the legend makes the plot easier to read because the
      -#|     legend colors now match the order of the lines on the far right 
      -#|     of the plot. You can see some unsuprising patterns: the proportion
      -#|     never marred decreases with age, married forms an upside down U 
      -#|     shape, and widowed starts off low but increases steeply after age
      -#|     60.
      -by_age <- gss_cat |>
      +
      by_age <- gss_cat |>
         filter(!is.na(age)) |>
         count(age, marital) |>
         group_by(age) |>
      @@ -249,10 +235,10 @@ ggplot(by_age, aes(x = age, y = prop, color = fct_reorder2(marital, age, prop)))
       
      -

      A line plot with age on the x-axis and proportion on the y-axis. There is one line for each category of marital status: no answer, never married, separated, divorced, widowed, and married. It is a little hard to read the plot because the order of the legend is unrelated to the lines on the plot.

      +

      A line plot with age on the x-axis and proportion on the y-axis. There is one line for each category of marital status: no answer, never married, separated, divorced, widowed, and married. It is a little hard to read the plot because the order of the legend is unrelated to the lines on the plot. Rearranging the legend makes the plot easier to read because the legend colors now match the order of the lines on the far right of the plot. You can see some unsuprising patterns: the proportion never marred decreases with age, married forms an upside down U shape, and widowed starts off low but increases steeply after age 60.

      -

      A line plot with age on the x-axis and proportion on the y-axis. There is one line for each category of marital status: no answer, never married, separated, divorced, widowed, and married. It is a little hard to read the plot because the order of the legend is unrelated to the lines on the plot.

      +

      A line plot with age on the x-axis and proportion on the y-axis. There is one line for each category of marital status: no answer, never married, separated, divorced, widowed, and married. It is a little hard to read the plot because the order of the legend is unrelated to the lines on the plot. Rearranging the legend makes the plot easier to read because the legend colors now match the order of the lines on the far right of the plot. You can see some unsuprising patterns: the proportion never marred decreases with age, married forms an upside down U shape, and widowed starts off low but increases steeply after age 60.

      @@ -264,11 +250,11 @@ ggplot(by_age, aes(x = age, y = prop, color = fct_reorder2(marital, age, prop))) ggplot(aes(x = marital)) + geom_bar()
      -

      A bar char of marital status ordered in from least to most common: no answer (~0), separated (~1,000), widowed (~2,000), divorced (~3,000), never married (~5,000), married (~10,000).

      +

      A bar char of marital status ordered in from least to most common: no answer (~0), separated (~1,000), widowed (~2,000), divorced (~3,000), never married (~5,000), married (~10,000).

      -
      +

      Exercises

      1. There are some suspiciously high numbers in tvhours. Is the mean a good summary?

      2. @@ -402,7 +388,7 @@ Modifying factor levels

        Read the documentation to learn about fct_lump_min() and fct_lump_prop() which are useful in other cases.

        -
        +

        Exercises

        1. How have the proportions of people identifying as Democrat, Republican, and Independent changed over time?

        2. @@ -426,7 +412,7 @@ Ordered factors

          Given the arguable utility of these differences, we don’t generally recommend using ordered factors.

        -
        +

        Summary

        This chapter introduced you to the handy forcats package for working with factors, introducing you to the most commonly used functions. forcats contains a wide range of other helpers that we didn’t have space to discuss here, so whenever you’re facing a factor analysis challenge that you haven’t encountered before, I highly recommend skimming the reference index to see if there’s a canned function that can help solve your problem.

        diff --git a/oreilly/functions.html b/oreilly/functions.html index b4b810f..0b0f517 100644 --- a/oreilly/functions.html +++ b/oreilly/functions.html @@ -1,6 +1,6 @@

        Functions

        -
        +

        Introduction

        One of the best ways to improve your reach as a data scientist is to write functions. Functions allow you to automate common tasks in a more powerful and general way than copy-and-pasting. Writing a function has three big advantages over using copy-and-paste:

        @@ -13,7 +13,7 @@ Introduction
      3. Plot functions that take a data frame as input and return a plot as output.
      4. Each of these sections include many examples to help you generalize the patterns that you see. These examples wouldn’t be possible without the help of folks of twitter, and we encourage follow the links in the comment to see original inspirations. You might also want to read the original motivating tweets for general functions and plotting functions to see even more functions.

        -
        +

        Prerequisites

        We’ll wrap up a variety of functions from around the tidyverse. We’ll also use nycflights13 as a source of familiar data to use our functions with.

        @@ -273,13 +273,18 @@ mape <- function(actual, predicted) {

        RStudio -

        Once you start writing functions, there are two RStudio shortcuts that are super useful:

        • To find the definition of a function that you’ve written, place the cursor on the name of the function and press F2.

        • + + + +

          Once you start writing functions, there are two RStudio shortcuts that are super useful:

          +
          • To find the definition of a function that you’ve written, place the cursor on the name of the function and press F2.

          • To quickly jump to a function, press Ctrl + . to open the fuzzy file and function finder and type the first few letters of your function name. You can also navigate to files, Quarto sections, and more, making it a very handy navigation tool.

          • -
        + +
        -
        +

        Exercises

        1. @@ -610,7 +615,7 @@ diamonds |> count_wide(c(clarity, color), cut)

          While our examples have mostly focused on dplyr, tidy evaluation also underpins tidyr, and if you look at the pivot_wider() docs you can see that names_from uses tidy-selection.

        -
        +

        Exercises

        1. @@ -691,9 +696,6 @@ diamonds |> histogram(carat, 0.1)
          diamonds |> 
             histogram(carat, 0.1) +
             labs(x = "Size (in carats)", y = "Number of diamonds")
          -
          -

          -
          @@ -706,15 +708,13 @@ linearity_check <- function(df, x, y) { df |> ggplot(aes(x = {{ x }}, y = {{ y }})) + geom_point() + - geom_smooth(method = "loess", color = "red", se = FALSE) + - geom_smooth(method = "lm", color = "blue", se = FALSE) + geom_smooth(method = "loess", formula = y ~ x, color = "red", se = FALSE) + + geom_smooth(method = "lm", formula = y ~ x, color = "blue", se = FALSE) } starwars |> filter(mass < 1000) |> - linearity_check(mass, height) -#> `geom_smooth()` using formula = 'y ~ x' -#> `geom_smooth()` using formula = 'y ~ x' + linearity_check(mass, height)

          @@ -837,15 +837,6 @@ density <- function(color, facets, binwidth = 0.1) { density() density(cut) density(cut, clarity) -
          -

          -
          -
          -

          -
          -
          -

          -
          @@ -880,7 +871,7 @@ diamonds |> histogram(carat, 0.1)

          You can use the same approach in any other place where you want to supply a string in a ggplot2 plot.

        -
        +

        Exercises

        Build up a rich plotting function by incrementally implementing each of the steps below:

        @@ -926,7 +917,7 @@ density <- function(color, facets, binwidth = 0.1) {

        As you can see we recommend putting extra spaces inside of {{ }}. This makes it very obvious that something unusual is happening.

        -
        +

        Exercises

        1. @@ -946,7 +937,7 @@ f3 <- function(x, y) {
        -
        +

        Summary

        In this chapter, you learned how to write functions for three useful scenarios: creating a vector, creating a data frames, or creating a plot. Along the way you saw many examples, which hopefully started to get your creative juices flowing, and gave you some ideas for where functions might help your analysis code.

        diff --git a/oreilly/intro.html b/oreilly/intro.html index aa44f84..61b138c 100644 --- a/oreilly/intro.html +++ b/oreilly/intro.html @@ -57,7 +57,7 @@ Python, Julia, and friends
        -
        +

        Prerequisites

        We’ve made a few assumptions about what you already know to get the most out of this book. You should be generally numerically literate, and it’s helpful if you have some programming experience already. If you’ve never programmed before, you might find Hands on Programming with R by Garrett to be a valuable adjunct to this book.

        @@ -99,16 +99,16 @@ The tidyverse
        library(tidyverse)
         #> ── Attaching core tidyverse packages ──────────────── tidyverse 1.3.2.9000 ──
         #> ✔ dplyr     1.0.99.9000     ✔ readr     2.1.3      
        -#> ✔ forcats   0.5.2.9000      ✔ stringr   1.5.0.9000 
        +#> ✔ forcats   0.5.2           ✔ stringr   1.5.0      
         #> ✔ ggplot2   3.4.0.9000      ✔ tibble    3.1.8      
        -#> ✔ lubridate 1.9.0           ✔ tidyr     1.2.1.9001 
        +#> ✔ lubridate 1.9.0           ✔ tidyr     1.3.0      
         #> ✔ purrr     1.0.1           
         #> ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──
         #> ✖ dplyr::filter() masks stats::filter()
         #> ✖ dplyr::lag()    masks stats::lag()
        -#> ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
        +#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors -

        This tells you that tidyverse loads eight packages: ggplot2, tibble, tidyr, readr, purrr, dplyr, stringr, and forcats. These are considered the core of the tidyverse because you’ll use them in almost every analysis.

        +

        This tells you that tidyverse loads nine packages: dplyr, forcats, ggplot2, lubridate, purrr, readr, stringr, tibble, tidyr. These are considered the core of the tidyverse because you’ll use them in almost every analysis.

        Packages in the tidyverse change fairly frequently. You can check whether updates are available and optionally install them by running tidyverse_update().

        @@ -116,11 +116,16 @@ The tidyverse

        Other packages

        There are many other excellent packages that are not part of the tidyverse because they solve problems in a different domain or are designed with a different set of underlying principles. This doesn’t make them better or worse, just different. In other words, the complement to the tidyverse is not the messyverse but many other universes of interrelated packages. As you tackle more data science projects with R, you’ll learn new packages and new ways of thinking about data.

        -

        In this book, we’ll use five data packages from outside the tidyverse:

        +

        We’ll use many packages from outside the tidyverse in this book. For example, we use the following four data packages to provide interesting applications:

        -
        install.packages(c("gapminder", "Lahman", "nycflights13", "palmerpenguins", "wakefield"))
        +
        install.packages(c("babynames", "gapminder", "nycflights13", "palmerpenguins"))
        -

        These packages provide data on world development, baseball, airline flights, and body measurements of penguins that we’ll use to illustrate key data science ideas, while the final one helps generate random data sets.

        +

        We’ll also use a selection of other packages for one off examples. You don’t need to install them now, just remember that whenever you see an error like this:

        +
        +
        library(ggrepel)
        +#> Error in library(ggrepel) : there is no package called ‘ggrepel’
        +
        +

        You need to run install.packages("ggrepel") to install the package.

        @@ -177,17 +182,17 @@ Colophon 1.1.0.9000 local dbplyr -2.2.1.9000 +2.3.0.9000 local dplyr 1.0.99.9000 -Github (tidyverse/dplyr@f4bece54fb56e10d7ae6a3bb27f2afedd65683ca) +Github (tidyverse/dplyr@6a1d46965a0f3ac180456e16bbe004755ec8488e) dtplyr 1.2.2 CRAN (R 4.2.0) forcats -0.5.2.9000 -local +0.5.2 +CRAN (R 4.2.0) ggplot2 3.4.0.9000 Github (tidyverse/ggplot2@4fea51b1eb2cdacebeacf425627dcbc1d61a5d3e) @@ -246,14 +251,14 @@ Colophon 1.0.3 CRAN (R 4.2.0) stringr -1.5.0.9000 -Github (tidyverse/stringr@e4601f7fdb125faafbd028cb9e32d23ef2d1efed) +1.5.0 +CRAN (R 4.2.0) tibble 3.1.8 CRAN (R 4.2.0) tidyr -1.2.1.9001 -local +1.3.0 +CRAN (R 4.2.0) tidyverse 1.3.2.9000 Github (tidyverse/tidyverse@aeabcde8c6ae435f16b5173682d5667d292829fb) @@ -261,9 +266,6 @@ Colophon 1.3.3 CRAN (R 4.2.0) -
        -
        cli:::ruler()
        -
        diff --git a/oreilly/iteration.html b/oreilly/iteration.html index 0a39f1f..8fef9bc 100644 --- a/oreilly/iteration.html +++ b/oreilly/iteration.html @@ -1,6 +1,6 @@

        Iteration

        -
        +

        Introduction

        In this chapter, you’ll learn tools for iteration, repeatedly performing the same action on different objects. Iteration in R generally tends to look rather different from other programming languages because so much of it is implicit and we get it for free. For example, if you want to double a numeric vector x in R, you can just write 2 * x. In most other languages, you’d need to explicitly double each element of x using some sort of for loop.

        @@ -13,17 +13,19 @@ Introduction unnest_wider() and unnest_longer() create new rows and columns for each element of a list-column.

        Now it’s time to learn some more general tools, often called functional programming tools because they are built around functions that take other functions as inputs. Learning functional programming can easily veer into the abstract, but in this chapter we’ll keep things concrete by focusing on three common tasks: modifying multiple columns, reading multiple files, and saving multiple objects.

        -
        +

        Prerequisites

        -
        +
        +
        -
        +

        This chapter relies on features only found in dplyr 1.1.0, which is still in development. If you want to live life on the edge you can get the dev version with devtools::install_github(c( "tidyverse/dplyr")).

        -

        This chapter relies on features only found in purrr 1.0.0 and dplyr 1.1.0, which are still in development. If you want to live life on the edge you can get the dev version with devtools::install_github(c("tidyverse/purrr", "tidyverse/dplyr")).

        +
        +

        In this chapter, we’ll focus on tools provided by dplyr and purrr, both core members of the tidyverse. You’ve seen dplyr before, but purrr is new. We’re just going to use a couple of purrr functions in this chapter, but it’s a great package to explore as you improve your programming skills.

        @@ -73,7 +75,7 @@ Modifying multiple columns

        -Selecting columns with.cols +Selecting columns with .cols

        The first argument to across(), .cols, selects the columns to transform. This uses the same specifications as select(), #sec-select, so you can use functions like starts_with() and ends_with() to select columns based on their name.

        There are two additional selection techniques that are particularly useful for across(): everything() and where(). everything() is straightforward: it selects every (non-grouping) column:

        @@ -316,12 +318,10 @@ df_miss |> filter(if_all(a:d, is.na))

        -across() in functions

        +across() in functions

        across() is particularly useful to program with because it allows you to operate on multiple columns. For example, Jacob Scott uses this little helper which wraps a bunch of lubridate function to expand all date columns into year, month, and day columns:

        -
        library(lubridate)
        -
        -expand_dates <- function(df) {
        +
        expand_dates <- function(df) {
           df |> 
             mutate(
               across(where(is.Date), list(year = year, month = month, day = mday))
        @@ -382,7 +382,7 @@ diamonds |>
         
         

        -Vspivot_longer() +Vs pivot_longer()

        Before we go on, it’s worth pointing out an interesting connection between across() and pivot_longer() (#sec-pivoting). In many cases, you perform the same calculations by first pivoting the data and then performing the operations by group rather than by column. For example, take this multi-function summary:

        @@ -472,7 +472,7 @@ df_long |>

        If needed, you could pivot_wider() this back to the original form.

        -
        +

        Exercises

        1. Compute the number of unique values in each column of palmerpenguins::penguins.

        2. @@ -535,7 +535,7 @@ paths
        -
        +

        Lists

        Now that we have these 12 paths, we could call read_excel() 12 times to get 12 data frames:

        @@ -575,7 +575,7 @@ gapminder_2007 <- readxl::read_excel("data/gapminder/2007.xlsx")

        -purrr::map() and list_rbind() +purrr::map() and list_rbind()

        The code to collect those data frames in a list “by hand” is basically just as tedious to type as code that reads the files one-by-one. Happily, we can use purrr::map() to make even better use of our paths vector. map() is similar toacross(), but instead of doing something to each column in a data frame, it does something to each element of a vector.map(x, f) is shorthand for:

        @@ -919,7 +919,7 @@ DBI::dbCreateTable(con, "gapminder", template)
        con |> tbl("gapminder")
         #> # Source:   table<gapminder> [0 x 6]
        -#> # Database: DuckDB 0.6.1 [root@Darwin 22.1.0:R 4.2.1/:memory:]
        +#> # Database: DuckDB 0.6.1 [root@Darwin 22.2.0:R 4.2.1/:memory:]
         #> # … with 6 variables: country <chr>, continent <chr>, lifeExp <dbl>,
         #> #   pop <dbl>, gdpPercap <dbl>, year <dbl>
        @@ -932,7 +932,7 @@ DBI::dbCreateTable(con, "gapminder", template) DBI::dbAppendTable(con, "gapminder", df) }
        -

        Now we need to call append_csv() once for each element of paths. That’s certainly possible with map():

        +

        Now we need to call append_file() once for each element of paths. That’s certainly possible with map():

        paths |> map(append_file)
        @@ -946,7 +946,7 @@ DBI::dbCreateTable(con, "gapminder", template) tbl("gapminder") |> count(year) #> # Source: SQL [?? x 2] -#> # Database: DuckDB 0.6.1 [root@Darwin 22.1.0:R 4.2.1/:memory:] +#> # Database: DuckDB 0.6.1 [root@Darwin 22.2.0:R 4.2.1/:memory:] #> year n #> <dbl> <dbl> #> 1 1952 142 @@ -1071,7 +1071,7 @@ ggsave(by_clarity$path[[8]], by_clarity$plot[[8]], width = 6, height = 6)
        -
        +

        Summary

        In this chapter, you’ve seen how to use explicit iteration to solve three problems that come up frequently when doing data science: manipulating multiple columns, reading multiple files, and saving multiple outputs. But in general, iteration is a super power: if you know the right iteration technique, you can easily go from fixing one problem to fixing all the problems. Once you’ve mastered the techniques in this chapter, we highly recommend learning more by reading the Functionals chapter of Advanced R and consulting the purrr website.

        diff --git a/oreilly/joins.html b/oreilly/joins.html index 9f1e82e..417dae3 100644 --- a/oreilly/joins.html +++ b/oreilly/joins.html @@ -1,6 +1,6 @@

        Joins

        -
        +

        Introduction

        It’s rare that a data analysis involves only a single data frame. Typically you have many data frames, and you must join them together to answer the questions that you’re interested in. This chapter will introduce you to two important types of joins:

        @@ -8,7 +8,7 @@ Introduction
      5. Filtering joins, which filter observations from one data frame based on whether or not they match an observation in another.
      6. We’ll begin by discussing keys, the variables used to connect a pair of data frames in a join. We cement the theory with an examination of the keys in the nycflights13 datasets, then use that knowledge to start joining data frames together. Next we’ll discuss how joins work, focusing on their action on the rows. We’ll finish up with a discussion of non-equi-joins, a family of joins that provide a more flexible way of matching keys than the default equality relationship.

        -
        +

        Prerequisites

        In this chapter, we’ll explore the five related datasets from nycflights13 using the join functions from dplyr.

        @@ -22,7 +22,7 @@ library(nycflights13)

        Keys

        -

        To understand joins, you need to first understand how two tables can be connected through a pair of keys, with on each table. In this section, you’ll learn about the two types of key and see examples of both in the datasets of the nycflights13 package. You’ll also learn how to check that your keys are valid, and what to do if your table lacks a key.

        +

        To understand joins, you need to first understand how two tables can be connected through a pair of keys, within each table. In this section, you’ll learn about the two types of key and see examples of both in the datasets of the nycflights13 package. You’ll also learn how to check that your keys are valid, and what to do if your table lacks a key.

        @@ -46,51 +46,52 @@ Primary and foreign keys

      7. airports records data about each airport. You can identify each airport by its three letter airport code, making faa the primary key.

        -
        +
        airports
         #> # A tibble: 1,458 × 8
        -#>   faa   name                             lat   lon   alt    tz dst   tzone   
        -#>   <chr> <chr>                          <dbl> <dbl> <dbl> <dbl> <chr> <chr>   
        -#> 1 04G   Lansdowne Airport               41.1 -80.6  1044    -5 A     America…
        -#> 2 06A   Moton Field Municipal Airport   32.5 -85.7   264    -6 A     America…
        -#> 3 06C   Schaumburg Regional             42.0 -88.1   801    -6 A     America…
        -#> 4 06N   Randall Airport                 41.4 -74.4   523    -5 A     America…
        -#> 5 09J   Jekyll Island Airport           31.1 -81.4    11    -5 A     America…
        -#> 6 0A9   Elizabethton Municipal Airport  36.4 -82.2  1593    -5 A     America…
        -#> # … with 1,452 more rows
        +#> faa name lat lon alt tz dst +#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> +#> 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A +#> 2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A +#> 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A +#> 4 06N Randall Airport 41.4 -74.4 523 -5 A +#> 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A +#> 6 0A9 Elizabethton Municipal Airpo… 36.4 -82.2 1593 -5 A +#> # … with 1,452 more rows, and 1 more variable: tzone <chr>
      8. planes records data about each plane. You can identify a plane by its tail number, making tailnum the primary key.

        -
        +
        planes
         #> # A tibble: 3,322 × 9
        -#>   tailnum  year type            manufacturer model engines seats speed engine
        -#>   <chr>   <int> <chr>           <chr>        <chr>   <int> <int> <int> <chr> 
        -#> 1 N10156   2004 Fixed wing mul… EMBRAER      EMB-…       2    55    NA Turbo…
        -#> 2 N102UW   1998 Fixed wing mul… AIRBUS INDU… A320…       2   182    NA Turbo…
        -#> 3 N103US   1999 Fixed wing mul… AIRBUS INDU… A320…       2   182    NA Turbo…
        -#> 4 N104UW   1999 Fixed wing mul… AIRBUS INDU… A320…       2   182    NA Turbo…
        -#> 5 N10575   2002 Fixed wing mul… EMBRAER      EMB-…       2    55    NA Turbo…
        -#> 6 N105UW   1999 Fixed wing mul… AIRBUS INDU… A320…       2   182    NA Turbo…
        -#> # … with 3,316 more rows
        +#> tailnum year type manufacturer model engines +#> <chr> <int> <chr> <chr> <chr> <int> +#> 1 N10156 2004 Fixed wing multi… EMBRAER EMB-145XR 2 +#> 2 N102UW 1998 Fixed wing multi… AIRBUS INDUSTR… A320-214 2 +#> 3 N103US 1999 Fixed wing multi… AIRBUS INDUSTR… A320-214 2 +#> 4 N104UW 1999 Fixed wing multi… AIRBUS INDUSTR… A320-214 2 +#> 5 N10575 2002 Fixed wing multi… EMBRAER EMB-145LR 2 +#> 6 N105UW 1999 Fixed wing multi… AIRBUS INDUSTR… A320-214 2 +#> # … with 3,316 more rows, and 3 more variables: seats <int>, +#> # speed <int>, engine <chr>
      9. weather records data about the weather at the origin airports. You can identify each observation by the combination of location and time, making origin and time_hour the compound primary key.

        -
        +
        weather
         #> # A tibble: 26,115 × 15
        -#>   origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
        -#>   <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>
        -#> 1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4 
        -#> 2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06
        -#> 3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5 
        -#> 4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7 
        -#> 5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7 
        -#> 6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5 
        -#> # … with 26,109 more rows, and 5 more variables: wind_gust <dbl>,
        -#> #   precip <dbl>, pressure <dbl>, visib <dbl>, time_hour <dttm>
        +#> origin year month day hour temp dewp humid wind_dir +#> <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> +#> 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 +#> 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 +#> 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 +#> 4 EWR 2013 1 1 4 39.9 28.0 62.2 250 +#> 5 EWR 2013 1 1 5 39.0 28.0 64.4 260 +#> 6 EWR 2013 1 1 6 37.9 28.0 67.2 240 +#> # … with 26,109 more rows, and 6 more variables: wind_speed <dbl>, +#> # wind_gust <dbl>, precip <dbl>, pressure <dbl>, visib <dbl>, …
      10. A foreign key is a variable (or set of variables) that corresponds to a primary key in another table. For example:

        @@ -139,23 +140,20 @@ weather |> filter(is.na(tailnum)) #> # A tibble: 0 × 9 #> # … with 9 variables: tailnum <chr>, year <int>, type <chr>, -#> # manufacturer <chr>, model <chr>, engines <int>, seats <int>, -#> # speed <int>, engine <chr> +#> # manufacturer <chr>, model <chr>, engines <int>, seats <int>, … weather |> filter(is.na(time_hour) | is.na(origin)) #> # A tibble: 0 × 15 #> # … with 15 variables: origin <chr>, year <int>, month <int>, day <int>, -#> # hour <int>, temp <dbl>, dewp <dbl>, humid <dbl>, wind_dir <dbl>, -#> # wind_speed <dbl>, wind_gust <dbl>, precip <dbl>, pressure <dbl>, -#> # visib <dbl>, time_hour <dttm> +#> # hour <int>, temp <dbl>, dewp <dbl>, humid <dbl>, wind_dir <dbl>, …

        Surrogate keys

        -

        So far we haven’t talked about the primary key for flights. It’s not super important here, because there are no data frames that use it as a foreign key, but it’s still useful to consider because it’s easier to work with observations if have some way to describe them to others.

        +

        So far we haven’t talked about the primary key for flights. It’s not super important here, because there are no data frames that use it as a foreign key, but it’s still useful to consider because it’s easier to work with observations if we have some way to describe them to others.

        After a little thinking and experimentation, we determined that there are three variables that together uniquely identify each flight:

        flights |> 
        @@ -190,14 +188,12 @@ flights2
         #> 5     5  2013     1     1      554            600        -6      812
         #> 6     6  2013     1     1      554            558        -4      740
         #> # … with 336,770 more rows, and 12 more variables: sched_arr_time <int>,
        -#> #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
        -#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
        -#> #   minute <dbl>, time_hour <dttm>
        +#> # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, …

        Surrogate keys can be particular useful when communicating to other humans: it’s much easier to tell someone to take a look at flight 2001 than to say look at UA430 which departed 9am 2013-01-03.

        -
        +

        Exercises

        1. We forgot to draw the relationship between weather and airports in #fig-flights-relationships. What is the relationship and how should it appear in the diagram?

        2. @@ -211,7 +207,7 @@ Exercises

          Basic joins

          -

          Now that you understand how data frames are connected via keys, we can start using joins to better understand the flights dataset. dplyr provides six join functions: left_join(), inner_join(), right_join(), semi_join(), and anti_join(). They all have the same interface: they take a pair of data frames (x and y) and return a data frame. The order of the rows and columns in the output is primarily determined by x.

          +

          Now that you understand how data frames are connected via keys, we can start using joins to better understand the flights dataset. dplyr provides six join functions: left_join(), inner_join(), right_join(), semi_join(), anti_join(), and full_join(). They all have the same interface: they take a pair of data frames (x and y) and return a data frame. The order of the rows and columns in the output is primarily determined by x.

          In this section, you’ll learn how to use one mutating join, left_join(), and two filtering joins, semi_join() and anti_join(). In the next section, you’ll learn exactly how these functions work, and about the remaining inner_join(), right_join() and full_join().

          @@ -271,15 +267,15 @@ flights2 left_join(planes |> select(tailnum, type, engines, seats)) #> Joining with `by = join_by(tailnum)` #> # A tibble: 336,776 × 9 -#> year time_hour origin dest tailnum carrier type engines seats -#> <int> <dttm> <chr> <chr> <chr> <chr> <chr> <int> <int> -#> 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA Fixed… 2 149 -#> 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA Fixed… 2 149 -#> 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA Fixed… 2 178 -#> 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 Fixed… 2 200 -#> 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL Fixed… 2 178 -#> 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA Fixed… 2 191 -#> # … with 336,770 more rows +#> year time_hour origin dest tailnum carrier type +#> <int> <dttm> <chr> <chr> <chr> <chr> <chr> +#> 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA Fixed wing multi en… +#> 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA Fixed wing multi en… +#> 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA Fixed wing multi en… +#> 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 Fixed wing multi en… +#> 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL Fixed wing multi en… +#> 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA Fixed wing multi en… +#> # … with 336,770 more rows, and 2 more variables: engines <int>, seats <int>

          When left_join() fails to find a match for a row in x, it fills in the new variables with missing values. For example, there’s no information about the plane with tail number N3ALAA so the type, engines, and seats will be missing:

          @@ -326,16 +322,16 @@ Specifying join keys
          flights2 |> 
             left_join(planes, join_by(tailnum))
           #> # A tibble: 336,776 × 14
          -#>   year.x time_hour           origin dest  tailnum carrier year.y type        
          -#>    <int> <dttm>              <chr>  <chr> <chr>   <chr>    <int> <chr>       
          -#> 1   2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA        1999 Fixed wing …
          -#> 2   2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA        1998 Fixed wing …
          -#> 3   2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA        1990 Fixed wing …
          -#> 4   2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6        2012 Fixed wing …
          -#> 5   2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL        1991 Fixed wing …
          -#> 6   2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA        2012 Fixed wing …
          -#> # … with 336,770 more rows, and 6 more variables: manufacturer <chr>,
          -#> #   model <chr>, engines <int>, seats <int>, speed <int>, engine <chr>
          +#> year.x time_hour origin dest tailnum carrier year.y +#> <int> <dttm> <chr> <chr> <chr> <chr> <int> +#> 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA 1999 +#> 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA 1998 +#> 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA 1990 +#> 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 2012 +#> 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL 1991 +#> 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA 2012 +#> # … with 336,770 more rows, and 7 more variables: type <chr>, +#> # manufacturer <chr>, model <chr>, engines <int>, seats <int>, …

          Note that the year variables are disambiguated in the output with a suffix (year.x and year.y), which tells you whether the variable came from the x or y argument. You can override the default suffixes with the suffix argument.

          join_by(tailnum) is short for join_by(tailnum == tailnum). It’s important to know about this fuller form for two reasons. Firstly, it describes the relationship between the two tables: the keys must be equal. That’s why this type of join is often called an equi-join. You’ll learn about non-equi-joins in #sec-non-equi-joins.

          @@ -344,30 +340,30 @@ Specifying join keys
          flights2 |> 
             left_join(airports, join_by(dest == faa))
           #> # A tibble: 336,776 × 13
          -#>    year time_hour           origin dest  tailnum carrier name       lat   lon
          -#>   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr>    <dbl> <dbl>
          -#> 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      George …  30.0 -95.3
          -#> 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      George …  30.0 -95.3
          -#> 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      Miami I…  25.8 -80.3
          -#> 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      <NA>      NA    NA  
          -#> 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      Hartsfi…  33.6 -84.4
          -#> 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      Chicago…  42.0 -87.9
          -#> # … with 336,770 more rows, and 4 more variables: alt <dbl>, tz <dbl>,
          -#> #   dst <chr>, tzone <chr>
          +#>    year time_hour           origin dest  tailnum carrier name                
          +#>   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr>               
          +#> 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      George Bush Interco…
          +#> 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      George Bush Interco…
          +#> 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      Miami Intl          
          +#> 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      <NA>                
          +#> 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      Hartsfield Jackson …
          +#> 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      Chicago Ohare Intl  
          +#> # … with 336,770 more rows, and 6 more variables: lat <dbl>, lon <dbl>,
          +#> #   alt <dbl>, tz <dbl>, dst <chr>, tzone <chr>
           
           flights2 |> 
             left_join(airports, join_by(origin == faa))
           #> # A tibble: 336,776 × 13
          -#>    year time_hour           origin dest  tailnum carrier name       lat   lon
          -#>   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr>    <dbl> <dbl>
          -#> 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      Newark …  40.7 -74.2
          -#> 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      La Guar…  40.8 -73.9
          -#> 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      John F …  40.6 -73.8
          -#> 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      John F …  40.6 -73.8
          -#> 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      La Guar…  40.8 -73.9
          -#> 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      Newark …  40.7 -74.2
          -#> # … with 336,770 more rows, and 4 more variables: alt <dbl>, tz <dbl>,
          -#> #   dst <chr>, tzone <chr>
          +#> year time_hour origin dest tailnum carrier name +#> <int> <dttm> <chr> <chr> <chr> <chr> <chr> +#> 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA Newark Liberty Intl +#> 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA La Guardia +#> 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA John F Kennedy Intl +#> 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 John F Kennedy Intl +#> 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL La Guardia +#> 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA Newark Liberty Intl +#> # … with 336,770 more rows, and 6 more variables: lat <dbl>, lon <dbl>, +#> # alt <dbl>, tz <dbl>, dst <chr>, tzone <chr>

          In older code you might see a different way of specifying the join keys, using a character vector:

          • @@ -396,17 +392,17 @@ Filtering joins
            airports |> 
               semi_join(flights2, join_by(faa == dest))
             #> # A tibble: 101 × 8
            -#>   faa   name                               lat    lon   alt    tz dst   tzone
            -#>   <chr> <chr>                            <dbl>  <dbl> <dbl> <dbl> <chr> <chr>
            -#> 1 ABQ   Albuquerque International Sunpo…  35.0 -107.   5355    -7 A     Amer…
            -#> 2 ACK   Nantucket Mem                     41.3  -70.1    48    -5 A     Amer…
            -#> 3 ALB   Albany Intl                       42.7  -73.8   285    -5 A     Amer…
            -#> 4 ANC   Ted Stevens Anchorage Intl        61.2 -150.    152    -9 A     Amer…
            -#> 5 ATL   Hartsfield Jackson Atlanta Intl   33.6  -84.4  1026    -5 A     Amer…
            -#> 6 AUS   Austin Bergstrom Intl             30.2  -97.7   542    -6 A     Amer…
            +#>   faa   name                     lat    lon   alt    tz dst   tzone          
            +#>   <chr> <chr>                  <dbl>  <dbl> <dbl> <dbl> <chr> <chr>          
            +#> 1 ABQ   Albuquerque Internati…  35.0 -107.   5355    -7 A     America/Denver 
            +#> 2 ACK   Nantucket Mem           41.3  -70.1    48    -5 A     America/New_Yo…
            +#> 3 ALB   Albany Intl             42.7  -73.8   285    -5 A     America/New_Yo…
            +#> 4 ANC   Ted Stevens Anchorage…  61.2 -150.    152    -9 A     America/Anchor…
            +#> 5 ATL   Hartsfield Jackson At…  33.6  -84.4  1026    -5 A     America/New_Yo…
            +#> 6 AUS   Austin Bergstrom Intl   30.2  -97.7   542    -6 A     America/Chicago
             #> # … with 95 more rows
            -

            Anti-joins are the opposite: they return all rows in x that don’t have a match in y. They’re useful for finding missing values that are implicit in the data, the topic of #sec-missing-implicit. Implicitly missing values don’t show up as NAs but instead only exist as an absence. For example, we can find rows that as missing from airports by looking for flights that don’t have a matching destination airport:

            +

            Anti-joins are the opposite: they return all rows in x that don’t have a match in y. They’re useful for finding missing values that are implicit in the data, the topic of #sec-missing-implicit. Implicitly missing values don’t show up as NAs but instead only exist as an absence. For example, we can find rows that are missing from airports by looking for flights that don’t have a matching destination airport:

            flights2 |> 
               anti_join(airports, join_by(dest == faa)) |> 
            @@ -437,7 +433,7 @@ Filtering joins
             
          -
          +

          Exercises

          1. Find the 48 hours (over the course of the whole year) that have the worst delays. Cross-reference it with the weather data. Can you see any patterns?

          2. @@ -655,15 +651,15 @@ Allow multiple rows plane_flights #> # A tibble: 284,170 × 9 -#> tailnum type engines seats year time_hour origin dest carrier -#> <chr> <chr> <int> <int> <int> <dttm> <chr> <chr> <chr> -#> 1 N10156 Fixed… 2 55 2013 2013-01-10 06:00:00 EWR PIT EV -#> 2 N10156 Fixed… 2 55 2013 2013-01-10 10:00:00 EWR CHS EV -#> 3 N10156 Fixed… 2 55 2013 2013-01-10 15:00:00 EWR MSP EV -#> 4 N10156 Fixed… 2 55 2013 2013-01-11 06:00:00 EWR CMH EV -#> 5 N10156 Fixed… 2 55 2013 2013-01-11 11:00:00 EWR MCI EV -#> 6 N10156 Fixed… 2 55 2013 2013-01-11 18:00:00 EWR PWM EV -#> # … with 284,164 more rows +#> tailnum type engines seats year time_hour origin +#> <chr> <chr> <int> <int> <int> <dttm> <chr> +#> 1 N10156 Fixed wing multi en… 2 55 2013 2013-01-10 06:00:00 EWR +#> 2 N10156 Fixed wing multi en… 2 55 2013 2013-01-10 10:00:00 EWR +#> 3 N10156 Fixed wing multi en… 2 55 2013 2013-01-10 15:00:00 EWR +#> 4 N10156 Fixed wing multi en… 2 55 2013 2013-01-11 06:00:00 EWR +#> 5 N10156 Fixed wing multi en… 2 55 2013 2013-01-11 11:00:00 EWR +#> 6 N10156 Fixed wing multi en… 2 55 2013 2013-01-11 18:00:00 EWR +#> # … with 284,164 more rows, and 2 more variables: dest <chr>, carrier <chr>
          @@ -814,19 +810,19 @@ Rolling joins

          Now imagine that you have a table of employee birthdays:

          employees <- tibble(
          -  name = wakefield::name(100),
          +  name = sample(babynames::babynames$name, 100),
             birthday = lubridate::ymd("2022-01-01") + (sample(365, 100, replace = TRUE) - 1)
           )
           employees
           #> # A tibble: 100 × 2
          -#>   name       birthday  
          -#>   <variable> <date>    
          -#> 1 Lindzy     2022-08-11
          -#> 2 Santania   2022-03-01
          -#> 3 Gardell    2022-03-04
          -#> 4 Cyrille    2022-11-15
          -#> 5 Kynli      2022-07-09
          -#> 6 Sever      2022-02-03
          +#>   name    birthday  
          +#>   <chr>   <date>    
          +#> 1 Case    2022-09-13
          +#> 2 Shonnie 2022-03-30
          +#> 3 Burnard 2022-01-10
          +#> 4 Omer    2022-11-25
          +#> 5 Hillel  2022-07-30
          +#> 6 Curlie  2022-12-11
           #> # … with 94 more rows

          And for each employee we want to find the first party date that comes after (or on) their birthday. We can express that with a rolling join:

          @@ -834,27 +830,22 @@ employees
          employees |> 
             left_join(parties, join_by(closest(birthday >= party)))
           #> # A tibble: 100 × 4
          -#>   name       birthday       q party     
          -#>   <variable> <date>     <int> <date>    
          -#> 1 Lindzy     2022-08-11     3 2022-07-11
          -#> 2 Santania   2022-03-01     1 2022-01-10
          -#> 3 Gardell    2022-03-04     1 2022-01-10
          -#> 4 Cyrille    2022-11-15     4 2022-10-03
          -#> 5 Kynli      2022-07-09     2 2022-04-04
          -#> 6 Sever      2022-02-03     1 2022-01-10
          +#>   name    birthday       q party     
          +#>   <chr>   <date>     <int> <date>    
          +#> 1 Case    2022-09-13     3 2022-07-11
          +#> 2 Shonnie 2022-03-30     1 2022-01-10
          +#> 3 Burnard 2022-01-10     1 2022-01-10
          +#> 4 Omer    2022-11-25     4 2022-10-03
          +#> 5 Hillel  2022-07-30     3 2022-07-11
          +#> 6 Curlie  2022-12-11     4 2022-10-03
           #> # … with 94 more rows

          There is, however, one problem with this approach: the folks with birthdays before January 10 don’t get a party:

          employees |> 
             anti_join(parties, join_by(closest(birthday >= party)))
          -#> # A tibble: 4 × 2
          -#>   name       birthday  
          -#>   <variable> <date>    
          -#> 1 Janeida    2022-01-04
          -#> 2 Aires      2022-01-07
          -#> 3 Mikalya    2022-01-06
          -#> 4 Carlynn    2022-01-08
          +#> # A tibble: 0 × 2 +#> # … with 2 variables: name <chr>, birthday <date>

          To resolve that issue we’ll need to tackle the problem a different way, with overlap joins.

          @@ -910,19 +901,19 @@ parties
          employees |> 
             inner_join(parties, join_by(between(birthday, start, end)), unmatched = "error")
           #> # A tibble: 100 × 6
          -#>   name       birthday       q party      start      end       
          -#>   <variable> <date>     <int> <date>     <date>     <date>    
          -#> 1 Lindzy     2022-08-11     3 2022-07-11 2022-07-11 2022-10-02
          -#> 2 Santania   2022-03-01     1 2022-01-10 2022-01-01 2022-04-03
          -#> 3 Gardell    2022-03-04     1 2022-01-10 2022-01-01 2022-04-03
          -#> 4 Cyrille    2022-11-15     4 2022-10-03 2022-10-03 2022-12-31
          -#> 5 Kynli      2022-07-09     2 2022-04-04 2022-04-04 2022-07-10
          -#> 6 Sever      2022-02-03     1 2022-01-10 2022-01-01 2022-04-03
          +#>   name    birthday       q party      start      end       
          +#>   <chr>   <date>     <int> <date>     <date>     <date>    
          +#> 1 Case    2022-09-13     3 2022-07-11 2022-07-11 2022-10-02
          +#> 2 Shonnie 2022-03-30     1 2022-01-10 2022-01-01 2022-04-03
          +#> 3 Burnard 2022-01-10     1 2022-01-10 2022-01-01 2022-04-03
          +#> 4 Omer    2022-11-25     4 2022-10-03 2022-10-03 2022-12-31
          +#> 5 Hillel  2022-07-30     3 2022-07-11 2022-07-11 2022-10-02
          +#> 6 Curlie  2022-12-11     4 2022-10-03 2022-10-03 2022-12-31
           #> # … with 94 more rows
          -
          +

          Exercises

          1. @@ -951,7 +942,7 @@ x |> full_join(y, by = "key", keep = TRUE)
          -
          +

          Summary

          In this chapter, you’ve learned how to use mutating and filtering joins to combine data from a pair of data frames. Along the way you learned how to identify keys, and the difference between primary and foreign keys. You also understand how joins work and how to figure out how many rows the output will have. Finally, you’ve gained a glimpse into the power of non-equi-joins and seen a few interesting use cases.

          diff --git a/oreilly/layers.html b/oreilly/layers.html index bc6bf69..0f74fc9 100644 --- a/oreilly/layers.html +++ b/oreilly/layers.html @@ -1,28 +1,18 @@

          Layers

          -
          +

          Introduction

          In the #chp-data-visualize, you learned much more than just how to make scatterplots, bar charts, and boxplots. You learned a foundation that you can use to make any type of plot with ggplot2.

          In this chapter, you’ll expand on that foundation as you learn about the layered grammar of graphics. We’ll start with a deeper dive into aesthetic mappings, geometric objects, and facets. Then, you will learn about statistical transformations ggplot2 makes under the hood when creating a plot. These transformations are used to calculate new values to plot, such as the heights of bars in a bar plot or medians in a box plot. You will also learn about position adjustments, which modify how geoms are displayed in your plots. Finally, we’ll briefly introduce coordinate systems.

          We will not cover every single function and option for each of these layers, but we will walk you through the most important and commonly used functionality provided by ggplot2 as well as introduce you to packages that extend ggplot2.

          -
          +

          Prerequisites

          This chapter focuses on ggplot2, one of the core packages in the tidyverse. To access the datasets, help pages, and functions used in this chapter, load the tidyverse by running this code:

          -
          library(tidyverse)
          -#> ── Attaching core tidyverse packages ──────────────── tidyverse 1.3.2.9000 ──
          -#> ✔ dplyr     1.0.99.9000     ✔ readr     2.1.3      
          -#> ✔ forcats   0.5.2.9000      ✔ stringr   1.5.0.9000 
          -#> ✔ ggplot2   3.4.0.9000      ✔ tibble    3.1.8      
          -#> ✔ lubridate 1.9.0           ✔ tidyr     1.2.1.9001 
          -#> ✔ purrr     1.0.1           
          -#> ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──
          -#> ✖ dplyr::filter() masks stats::filter()
          -#> ✖ dplyr::lag()    masks stats::lag()
          -#> ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
          +
          library(tidyverse)
          @@ -37,15 +27,15 @@ Aesthetic mappings
          mpg
           #> # A tibble: 234 × 11
          -#>   manufacturer model displ  year   cyl trans    drv     cty   hwy fl    class
          -#>   <chr>        <chr> <dbl> <int> <int> <chr>    <chr> <int> <int> <chr> <chr>
          -#> 1 audi         a4      1.8  1999     4 auto(l5) f        18    29 p     comp…
          -#> 2 audi         a4      1.8  1999     4 manual(… f        21    29 p     comp…
          -#> 3 audi         a4      2    2008     4 manual(… f        20    31 p     comp…
          -#> 4 audi         a4      2    2008     4 auto(av) f        21    30 p     comp…
          -#> 5 audi         a4      2.8  1999     6 auto(l5) f        16    26 p     comp…
          -#> 6 audi         a4      2.8  1999     6 manual(… f        18    26 p     comp…
          -#> # … with 228 more rows
          +#> manufacturer model displ year cyl trans drv cty hwy fl +#> <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> +#> 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p +#> 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p +#> 3 audi a4 2 2008 4 manual(m6) f 20 31 p +#> 4 audi a4 2 2008 4 auto(av) f 21 30 p +#> 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p +#> 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p +#> # … with 228 more rows, and 1 more variable: class <chr>

          Among the variables in mpg are:

          1. displ: A car’s engine size, in liters. A numerical variable.

          2. @@ -134,7 +124,7 @@ ggplot(mpg, aes(x = displ, y = hwy, alpha = class)) +

            So far we have discussed aesthetics that we can map or set in a scatterplot, when using a point geom. You can learn more about all possible aesthetic mappings in the aesthetic specifications vignette at https://ggplot2.tidyverse.org/articles/ggplot2-specs.html.

            The specific aesthetics you can use for a plot depend on the geom you use to represent the data. In the next section we dive deeper into geoms.

            -
            +

            Exercises

            1. Create a scatterplot of hwy vs. displ where the points are pink filled in triangles.

            2. @@ -285,7 +275,7 @@ ggplot(mpg, aes(x = hwy, y = drv, fill = drv, color = drv)) +

              The best place to get a comprehensive overview of all of the geoms ggplot2 offers, as well as all functions in the package, is the reference page: https://ggplot2.tidyverse.org/reference. To learn more about any single geom, use the help (e.g. ?geom_smooth).

              -
              +

              Exercises

              1. What geom would you use to draw a line chart? A boxplot? A histogram? An area chart?

              2. @@ -361,7 +351,7 @@ Facets -
                +

                Exercises

                1. What happens if you facet on a continuous variable?

                2. @@ -502,7 +492,7 @@ ggplot(cut_frequencies, aes(x = cut, y = freq)) +

                ggplot2 provides more than 20 stats for you to use. Each stat is a function, so you can get help in the usual way, e.g. ?stat_bin.

                -
                +

                Exercises

                1. What is the default geom associated with stat_summary()? How could you rewrite the previous plot to use that geom function instead of the stat function?

                2. @@ -608,7 +598,7 @@ ggplot(diamonds, aes(x = cut, color = clarity)) +

                  Adding randomness seems like a strange way to improve your plot, but while it makes your graph less accurate at small scales, it makes your graph more revealing at large scales. Because this is such a useful operation, ggplot2 comes with a shorthand for geom_point(position = "jitter"): geom_jitter().

                  To learn more about a position adjustment, look up the help page associated with each adjustment: ?position_dodge, ?position_fill, ?position_identity, ?position_jitter, and ?position_stack.

                  -
                  +

                  Exercises

                  1. @@ -681,7 +671,7 @@ bar + coord_polar()
                  2. -
                    +

                    Exercises

                    1. Turn a stacked bar chart into a pie chart using coord_polar().

                    2. @@ -726,7 +716,7 @@ The layered grammar of graphics

                      If you’d like to learn more about the theoretical underpinnings of ggplot2, you might enjoy reading “The Layered Grammar of Graphics”, the scientific paper that describes the theory of ggplot2 in detail.

                    -
                    +

                    Summary

                    In this chapter you learned about the layered grammar of graphics starting with aesthetics and geometries to build a simple plot, facets for splitting the plot into subsets, statistics for understanding how geoms are calculated, position adjustments for controlling the fine details of position when geoms might otherwise overlap, and coordinate systems allow you fundamentally change what x and y mean. One layer we have not yet touched on is theme, which we will introduce in #sec-themes.

                    diff --git a/oreilly/logicals.html b/oreilly/logicals.html index 0b6dd83..5399dda 100644 --- a/oreilly/logicals.html +++ b/oreilly/logicals.html @@ -1,12 +1,12 @@

                    Logical vectors

                    -
                    +

                    Introduction

                    In this chapter, you’ll learn tools for working with logical vectors. Logical vectors are the simplest type of vector because each element can only be one of three possible values: TRUE, FALSE, and NA. It’s relatively rare to find logical vectors in your raw data, but you’ll create and manipulate them in the course of almost every analysis.

                    We’ll begin by discussing the most common way of creating logical vectors: with numeric comparisons. Then you’ll learn about how you can use Boolean algebra to combine different logical vectors, as well as some useful summaries. We’ll finish off with if_else() and case_when(), two useful functions for making conditional changes powered by logical vectors.

                    -
                    +

                    Prerequisites

                    Most of the functions you’ll learn about in this chapter are provided by base R, so we don’t need the tidyverse, but we’ll still load it so we can use mutate(), filter(), and friends to work with data frames. We’ll also continue to draw examples from the nycflights13 dataset.

                    @@ -56,9 +56,7 @@ Comparisons #> 5 2013 1 1 606 610 -4 837 845 #> 6 2013 1 1 607 607 0 858 915 #> # … with 172,280 more rows, and 11 more variables: arr_delay <dbl>, -#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, -#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, -#> # time_hour <dttm> +#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, …

                    It’s useful to know that this is a shortcut and you can explicitly create the underlying logical variables with mutate():

                    @@ -151,17 +149,14 @@ x == y filter(dep_time == NA) #> # A tibble: 0 × 19 #> # … with 19 variables: year <int>, month <int>, day <int>, dep_time <int>, -#> # sched_dep_time <int>, dep_delay <dbl>, arr_time <int>, -#> # sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>, -#> # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, -#> # hour <dbl>, minute <dbl>, time_hour <dttm> +#> # sched_dep_time <int>, dep_delay <dbl>, arr_time <int>, …

                    Instead we’ll need a new tool: is.na().

                    -is.na() +is.na()

                    is.na(x) works with any type of vector and returns TRUE for missing values and FALSE for everything else:

                    @@ -186,9 +181,7 @@ is.na(c("a", NA, "b")) #> 5 2013 1 2 NA 1540 NA NA 1747 #> 6 2013 1 2 NA 1620 NA NA 1746 #> # … with 8,249 more rows, and 11 more variables: arr_delay <dbl>, -#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, -#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, -#> # time_hour <dttm> +#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, …

                    is.na() can also be useful in arrange(). arrange() usually puts all the missing values at the end but you can override this default by first sorting by is.na():

                    @@ -205,9 +198,7 @@ is.na(c("a", NA, "b")) #> 5 2013 1 1 554 600 -6 812 837 #> 6 2013 1 1 554 558 -4 740 728 #> # … with 836 more rows, and 11 more variables: arr_delay <dbl>, -#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, -#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, -#> # time_hour <dttm> +#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, … flights |> filter(month == 1, day == 1) |> @@ -222,14 +213,12 @@ flights |> #> 5 2013 1 1 517 515 2 830 819 #> 6 2013 1 1 533 529 4 850 830 #> # … with 836 more rows, and 11 more variables: arr_delay <dbl>, -#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, -#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, -#> # time_hour <dttm> +#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, …

                    We’ll come back to cover missing values in more depth in #chp-missing-values.

                    -
                    +

                    Exercises

                    1. How does dplyr::near() work? Type near to see the source code.
                    2. @@ -295,9 +284,7 @@ Order of operations #> 5 2013 1 1 554 600 -6 812 837 #> 6 2013 1 1 554 558 -4 740 728 #> # … with 336,770 more rows, and 11 more variables: arr_delay <dbl>, -#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, -#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, -#> # time_hour <dttm> +#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, …

                      This code doesn’t error but it also doesn’t seem to have worked. What’s going on? Here, R first evaluates month == 11 creating a logical vector, which we call nov. It computes nov | 12. When you use a number with a logical operator it converts everything apart from 0 to TRUE, so this is equivalent to nov | TRUE which will always be TRUE, so every row will be selected:

                      @@ -322,7 +309,7 @@ Order of operations

                      -%in% +%in%

                      An easy way to avoid the problem of getting your ==s and |s in the right order is to use %in%. x %in% y returns a logical vector the same length as x that is TRUE whenever a value in x is anywhere in y .

                      @@ -357,13 +344,11 @@ c(1, 2, NA) %in% NA #> 5 2013 1 1 NA 1500 NA NA 1825 #> 6 2013 1 1 NA 600 NA NA 901 #> # … with 8,797 more rows, and 11 more variables: arr_delay <dbl>, -#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, -#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, -#> # time_hour <dttm> +#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, …
                      -
                      +

                      Exercises

                      1. Find all flights where arr_delay is missing but dep_delay is not. Find all flights where neither arr_time nor sched_arr_time are missing, but arr_delay is.
                      2. @@ -496,7 +481,7 @@ Logical subsetting

                        Also note the difference in the group size: in the first chunk n() gives the number of delayed flights per day; in the second, n() gives the total number of flights.

                      -
                      +

                      Exercises

                      1. What will sum(is.na(x)) tell you? How about mean(is.na(x))?
                      2. @@ -511,7 +496,7 @@ Conditional transformations

                        -if_else() +if_else()

                        If you want to use one value when a condition is TRUE and another value when it’s FALSE, you can use dplyr::if_else()dplyr’s if_else() is very similar to base R’s ifelse(). There are two main advantages of if_else()over ifelse(): you can choose what should happen to missing values, and if_else() is much more likely to give you a meaningful error if you variables have incompatible types.. You’ll always use the first three argument of if_else(). The first argument, condition, is a logical vector, the second, true, gives the output when the condition is true, and the third, false, gives the output if the condition is false.

                        Let’s begin with a simple example of labeling a numeric vector as either “+ve” or “-ve”:

                        @@ -547,12 +532,13 @@ if_else(is.na(x1), y1, x1)

                        -case_when() +case_when()

                        dplyr’s case_when() is inspired by SQL’s CASE statement and provides a flexible way of performing different computations for different conditions. It has a special syntax that unfortunately looks like nothing else you’ll use in the tidyverse. It takes pairs that look like condition ~ output. condition must be a logical vector; when it’s TRUE, output will be used.

                        This means we could recreate our previous nested if_else() as follows:

                        -
                        case_when(
                        +
                        x <- c(-3:3, NA)
                        +case_when(
                           x == 0   ~ "0",
                           x < 0    ~ "-ve", 
                           x > 0    ~ "+ve",
                        @@ -582,7 +568,7 @@ if_else(is.na(x1), y1, x1)
                         
                        case_when(
                           x > 0 ~ "+ve",
                        -  x > 3 ~ "big"
                        +  x > 2 ~ "big"
                         )
                         #> [1] NA    NA    NA    NA    "+ve" "+ve" "+ve" NA
                        @@ -595,8 +581,8 @@ if_else(is.na(x1), y1, x1) arr_delay < -30 ~ "very early", arr_delay < -15 ~ "early", abs(arr_delay) <= 15 ~ "on time", - arr_delay > 15 ~ "late", - arr_delay > 60 ~ "very late", + arr_delay < 60 ~ "late", + arr_delay < Inf ~ "very late", ), .keep = "used" ) @@ -611,6 +597,7 @@ if_else(is.na(x1), y1, x1) #> 6 12 on time #> # … with 336,770 more rows
                        +

                        Be wary when writing this sort of complex case_when() statement; my first two attempts used a mix of < and > and I kept accidentally creating overlapping conditions.

                        @@ -639,7 +626,7 @@ case_when(
                        -
                        +

                        Summary

                        The definition of a logical vector is simple because each value must be either TRUE, FALSE, or NA. But logical vectors provide a huge amount of power. In this chapter, you learned how to create logical vectors with >, <, <=, =>, ==, !=, and is.na(), how to combine them with !, &, and |, and how to summarize them with any(), all(), sum(), and mean(). You also learned the powerful if_else() and case_when() functions that allow you to return values depending on the value of a logical vector.

                        diff --git a/oreilly/missing-values.html b/oreilly/missing-values.html index 360f92f..4427f78 100644 --- a/oreilly/missing-values.html +++ b/oreilly/missing-values.html @@ -1,12 +1,12 @@

                        Missing values

                        -
                        +

                        Introduction

                        You’ve already learned the basics of missing values earlier in the book. You first saw them in #chp-data-visualize where they resulted in a warning when making a plot as well as in #sec-summarize where they interfered with computing summary statistics, and you learned about their infectious nature and how to check for their presence in #sec-na-comparison. Now we’ll come back to them in more depth, so you can learn more of the details.

                        We’ll start by discussing some general tools for working with missing values recorded as NAs. We’ll then explore the idea of implicitly missing values, values are that are simply absent from your data, and show some tools you can use to make them explicit. We’ll finish off with a related discussion of empty groups, caused by factor levels that don’t appear in the data.

                        -
                        +

                        Prerequisites

                        The functions for working with missing data mostly come from dplyr and tidyr, which are core members of the tidyverse.

                        @@ -173,11 +173,11 @@ Complete

                        In some cases, the complete set of observations can’t be generated by a simple combination of variables. In that case, you can do manually what complete() does for you: create a data frame that contains all the rows that should exist (using whatever combination of techniques you need), then combine it with your original dataset with dplyr::full_join().

                        -
                        +

                        Joins

                        This brings us to another important way of revealing implicitly missing observations: joins. You’ll learn more about joins in #chp-joins, but we wanted to quickly mention them to you here since you can often only know that values are missing from one dataset when you compare it another.

                        -

                        dplyr::anti_join(x, y) is a particularly useful tool here because it selects only the rows in x that don’t have a match in y. For example, we can use two anti_join()s reveal to reveal that we’re missing information for four airports and 722 planes mentioned in flights:

                        +

                        dplyr::anti_join(x, y) is a particularly useful tool here because it selects only the rows in x that don’t have a match in y. For example, we can use two anti_join()s to reveal that we’re missing information for four airports and 722 planes mentioned in flights:

                        library(nycflights13)
                         
                        @@ -210,7 +210,7 @@ flights |>
                         
                        -
                        +

                        Exercises

                        1. Can you find any relationship between the carrier and the rows that appear to be missing from planes?
                        2. @@ -323,7 +323,7 @@ length(x2)

                          The main drawback of this approach is that you get an NA for the count, even though you know that it should be zero.

                        -
                        +

                        Summary

                        Missing values are weird! Sometimes they’re recorded as an explicit NA but other times you only notice them by their absence. This chapter has given you some tools for working with explicit missing values, tools for uncovering implicit missing values, and discussed some of the ways that implicit can become explicit and vice versa.

                        diff --git a/oreilly/numbers.html b/oreilly/numbers.html index ee34f92..d833353 100644 --- a/oreilly/numbers.html +++ b/oreilly/numbers.html @@ -1,22 +1,24 @@

                        Numbers

                        -
                        +

                        Introduction

                        Numeric vectors are the backbone of data science, and you’ve already used them a bunch of times earlier in the book. Now it’s time to systematically survey what you can do with them in R, ensuring that you’re well situated to tackle any future problem involving numeric vectors.

                        We’ll start by giving you a couple of tools to make numbers if you have strings, and then going into a little more detail of count(). Then we’ll dive into various numeric transformations that pair well with mutate(), including more general transformations that can be applied to other types of vector, but are often used with numeric vectors. We’ll finish off by covering the summary functions that pair well with summarize() and show you how they can also be used with mutate().

                        -
                        +

                        Prerequisites

                        -
                        +
                        +
                        -
                        +

                        This chapter relies on features only found in dplyr 1.1.0, which is still in development. If you want to live on the edge, you can get the dev versions with devtools::install_github("tidyverse/dplyr").

                        -

                        This chapter relies on features only found in dplyr 1.1.0, which is still in development. If you want to live on the edge, you can get the dev versions with devtools::install_github("tidyverse/dplyr").

                        +
                        +

                        This chapter mostly uses functions from base R, which are available without loading any packages. But we still need the tidyverse because we’ll use these base R functions inside of tidyverse functions like mutate() and filter(). Like in the last chapter, we’ll use real examples from nycflights13, as well as toy examples made with c() and tribble().

                        @@ -109,9 +111,7 @@ Counts
                        flights |> 
                           group_by(dest) |> 
                        -  summarize(
                        -    carriers = n_distinct(carrier)
                        -  ) |> 
                        +  summarize(carriers = n_distinct(carrier)) |> 
                           arrange(desc(carriers))
                         #> # A tibble: 105 × 2
                         #>   dest  carriers
                        @@ -144,17 +144,7 @@ Counts
                         

                        Weighted counts are a common problem so count() has a wt argument that does the same thing:

                        -
                        flights |> count(tailnum, wt = distance)
                        -#> # A tibble: 4,044 × 2
                        -#>   tailnum      n
                        -#>   <chr>    <dbl>
                        -#> 1 D942DN    3418
                        -#> 2 N0EGMQ  250866
                        -#> 3 N10156  115966
                        -#> 4 N102UW   25722
                        -#> 5 N103US   24619
                        -#> 6 N104UW   25157
                        -#> # … with 4,038 more rows
                        +
                        flights |> count(tailnum, wt = distance)
                      3. @@ -176,7 +166,7 @@ Counts
                      4. -
                        +

                        Exercises

                        1. How can you use count() to count the number rows with a missing value for a given variable?
                        2. @@ -228,9 +218,7 @@ x * c(1, 2, 3) #> 5 2013 1 1 557 600 -3 838 846 #> 6 2013 1 1 558 600 -2 849 851 #> # … with 25,971 more rows, and 11 more variables: arr_delay <dbl>, -#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, -#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, -#> # time_hour <dttm> +#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, …

                      The code runs without error, but it doesn’t return what you want. Because of the recycling rules it finds flights in odd numbered rows that departed in January and flights in even numbered rows that departed in February. And unfortunately there’s no warning because flights has an even number of rows.

                      To protect you from this type of silent failure, most tidyverse functions use a stricter form of recycling that only recycles single values. Unfortunately that doesn’t help here, or in many other cases, because the key computation is performed by the base R function ==, not filter().

                      @@ -476,7 +464,7 @@ slide_vec(x, sum, .before = 2, .after = 2, .complete = TRUE)
                    -
                    +

                    Exercises

                    1. Explain in words what each line of the code used to generate #fig-prop-cancelled does.

                    2. @@ -671,7 +659,7 @@ df
                    -
                    +

                    Exercises

                    1. Find the 10 most delayed flights using a ranking function. How do you want to handle ties? Carefully read the documentation for min_rank().

                    2. @@ -718,10 +706,8 @@ Center .groups = "drop" ) |> ggplot(aes(x = mean, y = median)) + - geom_abline(slope = 1, intercept = 0, color = "white", size = 2) + - geom_point() -#> Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0. -#> ℹ Please use `linewidth` instead. + geom_abline(slope = 1, intercept = 0, color = "white", linewidth = 2) + + geom_point()

                      All points fall below a 45° line, meaning that the median delay is always less than the mean delay. Most points are clustered in a dense region of mean [0, 20] and median [0, 5]. As the mean delay increases, the spread of the median also increases. There are two outlying points with mean ~60, median ~50, and mean ~85, median ~55.

                      @@ -875,15 +861,13 @@ Positions #> 5 2013 1 2 42 2359 43 518 442 #> 6 2013 1 2 458 500 -2 703 650 #> # … with 1,189 more rows, and 12 more variables: arr_delay <dbl>, -#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, -#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, -#> # time_hour <dttm>, r <int> +#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, …

                    -Withmutate() +With mutate()

                    As the names suggest, the summary functions are typically paired with summarize(). However, because of the recycling rules we discussed in #sec-recycling they can also be usefully paired with mutate(), particularly when you want do some sort of group standardization. For example:

                    • @@ -894,7 +878,7 @@ Withmutate() x / first(x) computes an index based on the first observation.
                    -
                    +

                    Exercises

                    1. @@ -910,7 +894,7 @@ Exercises
                    -
                    +

                    Summary

                    You’re already familiar with many tools for working with numbers, and after reading this chapter you now know how to use them in R. You’ve also learned a handful of useful general transformations that are commonly, but not exclusively, applied to numeric vectors like ranks and offsets. Finally, you worked through a number of numeric summaries, and discussed a few of the statistical challenges that you should consider.

                    diff --git a/oreilly/preface-2e.html b/oreilly/preface-2e.html index 52ccb0a..70d74a5 100644 --- a/oreilly/preface-2e.html +++ b/oreilly/preface-2e.html @@ -1,9 +1,9 @@

                    Preface to the second edition

                    Welcome to the second edition of “R for Data Science”! This is a major reworking of the first edition, removing material we no longer think is useful, adding material we wish we included in the first edition, and generally updating the text and code to reflect changes in best practices. We’re also very excited to welcome a new co-author: Mine Çetinkaya-Rundel, a noted data science educator and one of our colleagues at Posit (the company formerly known as RStudio).

                    A brief summary of the biggest changes follows:

                    • The first part of the book has been renamed to “Whole game”. The goal of this section is to give you the rough details of the “whole game” of data science before we dive into the details.

                    • -
                    • The second part of the book is “Visualize”. This part gives data visualization tools and best practices a more thorough coverage compared to the first edition.

                    • -
                    • The third part of the book is now called “Transform” and gains new chapters on numbers, logical vectors, and missing values. These were previously parts of the data transformation chapter, but needed much more room.

                    • -
                    • The fourth part of the book is called “Import”. It’s a new set of chapters that goes beyond reading flat text files to now embrace working with spreadsheets, getting data out of databases, working with big data, rectangling hierarchical data, and scraping data from web sites.

                    • -
                    • The “Program” part continues, but has been rewritten from top-to-bottom to focus on the most important parts of function writing and iteration. Function writing now includes sections on how to wrap tidyverse functions (dealing with the challenges of tidy evaluation), since this has become much easier over the last few years. We’ve added a new chapter on important Base R functions that you’re likely to see when reading R code found in the wild.

                    • +
                    • The second part of the book is “Visualize”. This part gives data visualization tools and best practices a more thorough coverage compared to the first edition. The best place to get all the details is still the ggplot2 book, but now R4DS covers more of the most important techniques.

                    • +
                    • The third part of the book is now called “Transform” and gains new chapters on numbers, logical vectors, and missing values. These were previously parts of the data transformation chapter, but needed much more room to cover all the details.

                    • +
                    • The fourth part of the book is called “Import”. It’s a new set of chapters that goes beyond reading flat text files to working with spreadsheets, getting data out of databases, working with big data, rectangling hierarchical data, and scraping data from web sites.

                    • +
                    • The “Program” part remains, but has been rewritten from top-to-bottom to focus on the most important parts of function writing and iteration. Function writing now includes details on how to wrap tidyverse functions (dealing with the challenges of tidy evaluation), since this has become much easier and more important over the last few years. We’ve added a new chapter on important base R functions that you’re likely to see in wild-caught R code.

                    • The modeling part has been removed. We never had enough room to fully do modelling justice, and there are now much better resources available. We generally recommend using the tidymodels packages and reading Tidy Modeling with R by Max Kuhn and Julia Silge.

                    • -
                    • The communicate part continues as well, but features Quarto instead of R Markdown as the tool of choice for authoring reproducible computational documents.

                    • -

                    Other changes include switching from magrittr’s pipe (%>%) to the base pipe (|>) and switching the book’s source from RMarkdown to Quarto.

                    +
                  3. The communicate part remains, but has been thoroughly updated to feature Quarto instead of R Markdown. This edition of the book has been written in quarto, and it’s clearly the tool of the future.

                  4. +
                    diff --git a/oreilly/program.html b/oreilly/program.html index 2092c46..925bb43 100644 --- a/oreilly/program.html +++ b/oreilly/program.html @@ -3,16 +3,10 @@

                    Our model of the data science process with program (import, tidy, transform, visualize, model, and communicate, i.e. everything) highlighted in blue.

                    -
                    Figure 1: Programming is the water in which all other components of the data science process swims.
                    +
                    Figure 1: Programming is the water in which all the other components swim.
                    -

                    Programming produces code, and code is a tool of communication. Obviously code tells the computer what you want it to do. But it also communicates meaning to other humans. Thinking about code as a vehicle for communication is important because every project you do is fundamentally collaborative. Even if you’re not working with other people, you’ll definitely be working with future-you! Writing clear code is important so that others (like future-you) can understand why you tackled an analysis in the way you did. That means getting better at programming also involves getting better at communicating. Over time, you want your code to become not just easier to write, but easier for others to read.

                    Writing code is similar in many ways to writing prose. One parallel which we find particularly useful is that in both cases rewriting is the key to clarity. The first expression of your ideas is unlikely to be particularly clear, and you may need to rewrite multiple times. After solving a data analysis challenge, it’s often worth looking at your code and thinking about whether or not it’s obvious what you’ve done. If you spend a little time rewriting your code while the ideas are fresh, you can save a lot of time later trying to recreate what your code did. But this doesn’t mean you should rewrite every function: you need to balance what you need to achieve now with saving time in the long run. (But the more you rewrite your functions the more likely your first attempt will be clear.)

                    In the following three chapters, you’ll learn skills to improve your programming skills:

                    1. Copy-and-paste is a powerful tool, but you should avoid doing it more than twice. Repeating yourself in code is dangerous because it can easily lead to errors and inconsistencies. Instead, in #chp-functions, you’ll learn how to write functions which let you extract out repeated code so that it can be easily reused.

                    2. +

                      Programming produces code, and code is a tool of communication. Obviously code tells the computer what you want it to do. But it also communicates meaning to other humans. Thinking about code as a vehicle for communication is important because every project you do is fundamentally collaborative. Even if you’re not working with other people, you’ll definitely be working with future-you! Writing clear code is important so that others (like future-you) can understand why you tackled an analysis in the way you did. That means getting better at programming also involves getting better at communicating. Over time, you want your code to become not just easier to write, but easier for others to read.

                      In the following three chapters, you’ll learn skills to improve your programming skills:

                      1. Copy-and-paste is a powerful tool, but you should avoid doing it more than twice. Repeating yourself in code is dangerous because it can easily lead to errors and inconsistencies. Instead, in #chp-functions, you’ll learn how to write functions which let you extract out repeated code so that it can be easily reused.

                      2. Functions extract out repeated code, but you often need to repeat the same actions on different inputs. You need tools for iteration that let you do similar things again and again. These tools include for loops and functional programming, which you’ll learn about in #chp-iteration.

                      3. -
                      4. As you read more code written by others, you’ll see more code that doesn’t use the tidyverse. In #chp-base-R, you’ll learn some of the most important base R functions that you’ll see in the wild. These functions tend to be designed to use individual vectors, rather than data frames, often making them a good fit for your programming needs.

                      5. -
                      -

                      Learning more

                      -

                      The goal of these chapters is to teach you the minimum about programming that you need to practice data science. Once you have mastered the material in this book, we strongly believe you should continue to invest in your programming skills. Learning more about programming is a long-term investment: it won’t pay off immediately, but in the long term it will allow you to solve new problems more quickly, and let you reuse your insights from previous problems in new scenarios.

                      -

                      To learn more you need to study R as a programming language, not just an interactive environment for data science. We have written two books that will help you do so:

                      -
                      • Hands on Programming with R, by Garrett Grolemund. This is an introduction to R as a programming language and is a great place to start if R is your first programming language. It covers similar material to these chapters, but with a different style and different motivation examples (based in the casino). It’s a useful complement if you find that these four chapters go by too quickly.

                      • -
                      • Advanced R by Hadley Wickham. This dives into the details of R the programming language. This is a great place to start if you have existing programming experience. It’s also a great next step once you’ve internalized the ideas in these chapters.

                      • -
                      +
                    3. As you read more code written by others, you’ll see more code that doesn’t use the tidyverse. In #chp-base-R, you’ll learn some of the most important base R functions that you’ll see in the wild.

                    4. +

                    The goal of these chapters is to teach you the minimum about programming that you need for data science. Once you have mastered the material here, we strongly recommend that you continue to invest in your programming skills. We’ve written two books that you might find helpful. Hands on Programming with R, by Garrett Grolemund, is an introduction to R as a programming language and is a great place to start if R is your first programming language. Advanced R by Hadley Wickham dives into the details of R the programming language; it’s great place to start if you have existing programming experience and great next step once you’ve internalized the ideas in these chapters.

                    diff --git a/oreilly/quarto-formats.html b/oreilly/quarto-formats.html index 2a1889b..51835f9 100644 --- a/oreilly/quarto-formats.html +++ b/oreilly/quarto-formats.html @@ -1,6 +1,6 @@

                    Quarto formats

                    -
                    +

                    Introduction

                    So far, you’ve seen Quarto used to produce HTML documents. This chapter gives a brief overview of some of the many other types of output you can produce with Quarto.

                    @@ -268,7 +268,7 @@ Other formats

                    See https://quarto.org/docs/output-formats/all-formats.html for a list of even more formats.

                    -
                    +

                    Learning more

                    To learn more about effective communication in these different formats, we recommend the following resources:

                    diff --git a/oreilly/quarto.html b/oreilly/quarto.html index 22bc412..33d5cc3 100644 --- a/oreilly/quarto.html +++ b/oreilly/quarto.html @@ -1,6 +1,6 @@

                    Quarto

                    -
                    +

                    Introduction

                    Quarto provides a unified authoring framework for data science, combining your code, its results, and your prose. Quarto documents are fully reproducible and support dozens of output formats, like PDFs, Word files, presentations, and more.

                    @@ -11,7 +11,7 @@ Introduction

                  Quarto is a command line interface tool, not an R package. This means that help is, by-and-large, not available through ?. Instead, as you work through this chapter, and use Quarto in the future, you should refer to the Quarto documentation page at https://quarto.org for help.

                  If you’re an R Markdown user, you might be thinking “Quarto sounds a lot like R Markdown”. You’re not wrong! Quarto unifies the functionality of many packages from the R Markdown ecosystem (rmarkdown, bookdown, distill, xaringan, etc.) into a single consistent system as well as extends it with native support for multiple programming languages like Python and Julia in addition to R. In a way, Quarto reflects everything that was learned from expanding and supporting the R Markdown ecosystem over a decade.

                  -
                  +

                  Prerequisites

                  You need the Quarto command line interface (Quarto CLI), but you don’t need to explicitly install it or load it, as RStudio automatically does both when needed.

                  @@ -84,7 +84,7 @@ smaller |>

                  To get started with your own .qmd file, select File > New File > Quarto Document… in the menu bar. RStudio will launch a wizard that you can use to pre-populate your file with useful content that reminds you how the key features of Quarto work.

                  The following sections dive into the three components of a Quarto document in more details: the markdown text, the code chunks, and the YAML header.

                  -
                  +

                  Exercises

                  1. Create a new Quarto document using File > New File > Quarto Document. Read the instructions. Practice running the chunks individually. Then render the document by clicking the appropriate button and then by using the appropriate keyboard short cut. Verify that you can modify the code, re-run it, and see modified output.

                  2. @@ -106,7 +106,7 @@ Visual editor

                    The visual editor has many more features that we haven’t enumerated here that you might find useful as you gain experience authoring with it.

                    Most importantly, while the visual editor displays your content with formatting, under the hood, it saves your content in plain Markdown and you can switch back and forth between the visual and source editors to view and edit your content using either tool.

                    -
                    +

                    Exercises

                    @@ -165,7 +165,7 @@ Source editor

                    The best way to learn these is simply to try them out. It will take a few days, but soon they will become second nature, and you won’t need to think about them. If you forget, you can get to a handy reference sheet with Help > Markdown Quick Reference.

                    -
                    +

                    Exercises

                    1. Practice what you’ve learned by creating a brief CV. The title should be your name, and you should include headings for (at least) education or employment. Each of the sections should include a bulleted list of jobs/degrees. Highlight the year in bold.

                    2. @@ -341,7 +341,7 @@ comma(.12358124331)
                    -
                    +

                    Exercises

                    1. Add a section that explores how diamond sizes vary by cut, color, and clarity. Assume you’re writing a report for someone who doesn’t know R, and instead of setting echo: false on each chunk, set a global option.

                    2. @@ -394,14 +394,14 @@ Other important options

                      It’s a good idea to name code chunks that produce figures, even if you don’t routinely label other chunks. The chunk label is used to generate the file name of the graphic on disk, so naming your chunks makes it much easier to pick out plots and reuse in other circumstances (i.e. if you want to quickly drop a single plot into an email or a tweet).

                    -
                    +

                    Exercises

                    -
                    +

                    Tables

                    Similar to figures, you can include two types of tables in a Quarto document. They can be markdown tables that you create in directly in your Quarto document (using the Insert Table menu) or they can be tables generated as a result of a code chunk. In this section we will focus on the latter, tables generated via computation.

                    @@ -499,7 +499,7 @@ Tables

                    Read the documentation for ?knitr::kable to see the other ways in which you can customize the table. For even deeper customization, consider the gt, huxtable, reactable, kableExtra, xtable, stargazer, pander, tables, and ascii packages. Each provides a set of tools for returning formatted tables from R code.

                    There is also a rich set of options for controlling how figures are embedded. You’ll learn about these in ?sec-graphics-communication.

                    -
                    +

                    Exercises

                    @@ -672,7 +672,7 @@ csl: apa.csl
                    -
                    +

                    Learning more

                    Quarto is still relatively young, and is still growing rapidly. The best place to stay on top of innovations is the official Quarto website: https://quarto.org.

                    diff --git a/oreilly/rectangling.html b/oreilly/rectangling.html index bc2c942..92ec8a5 100644 --- a/oreilly/rectangling.html +++ b/oreilly/rectangling.html @@ -1,12 +1,12 @@

                    Hierarchical data

                    -
                    +

                    Introduction

                    In this chapter, you’ll learn the art of data rectangling, taking data that is fundamentally hierarchical, or tree-like, and converting it into a rectangular data frame made up of rows and columns. This is important because hierarchical data is surprisingly common, especially when working with data that comes from the web.

                    To learn about rectangling, you’ll need to first learn about lists, the data structure that makes hierarchical data possible. Then you’ll learn about two crucial tidyr functions: tidyr::unnest_longer() and tidyr::unnest_wider(). We’ll then show you a few case studies, applying these simple functions again and again to solve real problems. We’ll finish off by talking about JSON, the most frequent source of hierarchical datasets and a common format for data exchange on the web.

                    -
                    +

                    Prerequisites

                    In this chapter, we’ll use many functions from tidyr, a core member of the tidyverse. We’ll also use repurrrsive to provide some interesting datasets for rectangling practice, and we’ll finish by using jsonlite to read JSON files into R lists.

                    @@ -18,7 +18,7 @@ library(jsonlite)
                    -
                    +

                    Lists

                    So far you’ve worked with data frames that contain simple vectors like integers, numbers, characters, date-times, and factors. These vectors are simple because they’re homogeneous: every element is of the same data type. If you want to store elements of different types in the same vector, you’ll need a list, which you create with list():

                    @@ -174,13 +174,19 @@ df

                    Similarly, if you View() a data frame in RStudio, you’ll get the standard tabular view, which doesn’t allow you to selectively expand list columns. To explore those fields you’ll need to pull() and view, e.g. df |> pull(z) |> View().

                    Base R -

                    It’s possible to put a list in a column of a data.frame, but it’s a lot fiddlier because data.frame() treats a list as a list of columns:

                    + + + +

                    It’s possible to put a list in a column of a data.frame, but it’s a lot fiddlier because data.frame() treats a list as a list of columns:

                    +
                    data.frame(x = list(1:3, 3:5))
                     #>   x.1.3 x.3.5
                     #> 1     1     3
                     #> 2     2     4
                     #> 3     3     5
                    -

                    You can force data.frame() to treat a list as a list of rows by wrapping it in list I(), but the result doesn’t print particularly well:

                    +
                    +

                    You can force data.frame() to treat a list as a list of rows by wrapping it in list I(), but the result doesn’t print particularly well:

                    +
                    data.frame(
                       x = I(list(1:2, 3:5)), 
                       y = c("1, 2", "3, 4, 5")
                    @@ -188,7 +194,10 @@ Base R
                     #>         x       y
                     #> 1    1, 2    1, 2
                     #> 2 3, 4, 5 3, 4, 5
                    -

                    It’s easier to use list-columns with tibbles because tibble() treats lists like vectors and the print method has been designed with lists in mind.

                    +
                    +

                    It’s easier to use list-columns with tibbles because tibble() treats lists like vectors and the print method has been designed with lists in mind.

                    + +
                    @@ -220,7 +229,7 @@ df2 <- tribble(

                    -unnest_wider() +unnest_wider()

                    When each row has the same number of elements with the same names, like df1, it’s natural to put each component into its own column with unnest_wider():

                    @@ -260,7 +269,7 @@ df2 <- tribble(

                    -unnest_longer() +unnest_longer()

                    When each row contains an unnamed list, it’s most natural to put each element into its own row with unnest_longer():

                    @@ -387,7 +396,7 @@ Inconsistent types

                    You’ll learn more about map_lgl() in #chp-iteration.

                    -
                    +

                    Other functions

                    tidyr has a few other useful rectangling functions that we’re not going to cover in this book:

                    @@ -400,7 +409,7 @@ Other functions

                    These functions are good to know about as you might encounter them when reading other people’s code or tackling rarer rectangling challenges yourself.

                    -
                    +

                    Exercises

                    1. @@ -460,51 +469,26 @@ repos unnest_longer(json) |> unnest_wider(json) #> # A tibble: 176 × 68 -#> id name full_name owner private html_url description fork -#> <int> <chr> <chr> <list> <lgl> <chr> <chr> <lgl> -#> 1 61160198 after gaborcsa… <named list> FALSE https:/… Run Code i… FALSE -#> 2 40500181 argufy gaborcsa… <named list> FALSE https:/… Declarativ… FALSE -#> 3 36442442 ask gaborcsa… <named list> FALSE https:/… Friendly C… FALSE -#> 4 34924886 baseimp… gaborcsa… <named list> FALSE https:/… Do we get … FALSE -#> 5 61620661 citest gaborcsa… <named list> FALSE https:/… Test R pac… TRUE -#> 6 33907457 clisymb… gaborcsa… <named list> FALSE https:/… Unicode sy… FALSE -#> # … with 170 more rows, and 60 more variables: url <chr>, forks_url <chr>, -#> # keys_url <chr>, collaborators_url <chr>, teams_url <chr>, -#> # hooks_url <chr>, issue_events_url <chr>, events_url <chr>, -#> # assignees_url <chr>, branches_url <chr>, tags_url <chr>, -#> # blobs_url <chr>, git_tags_url <chr>, git_refs_url <chr>, -#> # trees_url <chr>, statuses_url <chr>, languages_url <chr>, -#> # stargazers_url <chr>, contributors_url <chr>, subscribers_url <chr>, … +#> id name full_name owner private html_url +#> <int> <chr> <chr> <list> <lgl> <chr> +#> 1 61160198 after gaborcsardi/after <named list> FALSE https://github… +#> 2 40500181 argufy gaborcsardi/argu… <named list> FALSE https://github… +#> 3 36442442 ask gaborcsardi/ask <named list> FALSE https://github… +#> 4 34924886 baseimports gaborcsardi/base… <named list> FALSE https://github… +#> 5 61620661 citest gaborcsardi/cite… <named list> FALSE https://github… +#> 6 33907457 clisymbols gaborcsardi/clis… <named list> FALSE https://github… +#> # … with 170 more rows, and 62 more variables: description <chr>, +#> # fork <lgl>, url <chr>, forks_url <chr>, keys_url <chr>, …
                    -

                    This has worked but the result is a little overwhelming: there are so many columns that tibble doesn’t even print all of them! We can see them all with names():

                    +

                    This has worked but the result is a little overwhelming: there are so many columns that tibble doesn’t even print all of them! We can see them all with names(); and here we look at the first 10:

                    repos |> 
                       unnest_longer(json) |> 
                       unnest_wider(json) |> 
                    -  names()
                    -#>  [1] "id"                "name"              "full_name"        
                    -#>  [4] "owner"             "private"           "html_url"         
                    -#>  [7] "description"       "fork"              "url"              
                    -#> [10] "forks_url"         "keys_url"          "collaborators_url"
                    -#> [13] "teams_url"         "hooks_url"         "issue_events_url" 
                    -#> [16] "events_url"        "assignees_url"     "branches_url"     
                    -#> [19] "tags_url"          "blobs_url"         "git_tags_url"     
                    -#> [22] "git_refs_url"      "trees_url"         "statuses_url"     
                    -#> [25] "languages_url"     "stargazers_url"    "contributors_url" 
                    -#> [28] "subscribers_url"   "subscription_url"  "commits_url"      
                    -#> [31] "git_commits_url"   "comments_url"      "issue_comment_url"
                    -#> [34] "contents_url"      "compare_url"       "merges_url"       
                    -#> [37] "archive_url"       "downloads_url"     "issues_url"       
                    -#> [40] "pulls_url"         "milestones_url"    "notifications_url"
                    -#> [43] "labels_url"        "releases_url"      "deployments_url"  
                    -#> [46] "created_at"        "updated_at"        "pushed_at"        
                    -#> [49] "git_url"           "ssh_url"           "clone_url"        
                    -#> [52] "svn_url"           "homepage"          "size"             
                    -#> [55] "stargazers_count"  "watchers_count"    "language"         
                    -#> [58] "has_issues"        "has_downloads"     "has_wiki"         
                    -#> [61] "has_pages"         "forks_count"       "mirror_url"       
                    -#> [64] "open_issues_count" "forks"             "open_issues"      
                    -#> [67] "watchers"          "default_branch"
                    + names() |> + head(10) +#> [1] "id" "name" "full_name" "owner" "private" +#> [6] "html_url" "description" "fork" "url" "forks_url"

                    Let’s select a few that look interesting:

                    @@ -523,7 +507,7 @@ repos #> 6 33907457 gaborcsardi/clisymbols <named list [17]> Unicode symbols for CLI… #> # … with 170 more rows
                    -

                    You can use this to work back to understand how gh_repos was strucured: each child was a GitHub user containing a list of up to 30 GitHub repositories that they created.

                    +

                    You can use this to work back to understand how gh_repos was structured: each child was a GitHub user containing a list of up to 30 GitHub repositories that they created.

                    owner is another list-column, and since it contains a named list, we can use unnest_wider() to get at the values:

                    repos |> 
                    @@ -531,11 +515,13 @@ repos
                       unnest_wider(json) |> 
                       select(id, full_name, owner, description) |> 
                       unnest_wider(owner)
                    -#> Error in `unpack()` at ]8;line = 121:col = 2;file:///Users/hadleywickham/Documents/tidy-data/tidyr/R/unnest-wider.Rtidyr/R/unnest-wider.R:121:2]8;;:
                    -#> ! Names must be unique.
                    +#> Error in `unnest_wider()`:
                    +#> ! Can't duplicate names between the affected columns and the original
                    +#>   data.
                     #> ✖ These names are duplicated:
                    -#>   * "id" at locations 1 and 4.
                    -#> ℹ Use argument `names_repair` to specify repair strategy.
                    +#> ℹ `id`, from `owner`. +#> ℹ Use `names_sep` to disambiguate using the column name. +#> ℹ Or use `names_repair` to specify a repair strategy.

                    Uh oh, this list column also contains an id column and we can’t have two id columns in the same data frame. Rather than following the advice to use names_repair (which would also work), we’ll instead use names_sep:

                    @@ -546,21 +532,16 @@ repos select(id, full_name, owner, description) |> unnest_wider(owner, names_sep = "_") #> # A tibble: 176 × 20 -#> id full_name owner_login owner_id owner_avatar_url owner_gravatar_id -#> <int> <chr> <chr> <int> <chr> <chr> -#> 1 61160198 gaborcsar… gaborcsardi 660288 https://avatars… "" -#> 2 40500181 gaborcsar… gaborcsardi 660288 https://avatars… "" -#> 3 36442442 gaborcsar… gaborcsardi 660288 https://avatars… "" -#> 4 34924886 gaborcsar… gaborcsardi 660288 https://avatars… "" -#> 5 61620661 gaborcsar… gaborcsardi 660288 https://avatars… "" -#> 6 33907457 gaborcsar… gaborcsardi 660288 https://avatars… "" -#> # … with 170 more rows, and 14 more variables: owner_url <chr>, -#> # owner_html_url <chr>, owner_followers_url <chr>, -#> # owner_following_url <chr>, owner_gists_url <chr>, -#> # owner_starred_url <chr>, owner_subscriptions_url <chr>, -#> # owner_organizations_url <chr>, owner_repos_url <chr>, -#> # owner_events_url <chr>, owner_received_events_url <chr>, -#> # owner_type <chr>, owner_site_admin <lgl>, description <chr> +#> id full_name owner_login owner_id owner_avatar_url +#> <int> <chr> <chr> <int> <chr> +#> 1 61160198 gaborcsardi/after gaborcsardi 660288 https://avatars.gith… +#> 2 40500181 gaborcsardi/argufy gaborcsardi 660288 https://avatars.gith… +#> 3 36442442 gaborcsardi/ask gaborcsardi 660288 https://avatars.gith… +#> 4 34924886 gaborcsardi/baseimports gaborcsardi 660288 https://avatars.gith… +#> 5 61620661 gaborcsardi/citest gaborcsardi 660288 https://avatars.gith… +#> 6 33907457 gaborcsardi/clisymbols gaborcsardi 660288 https://avatars.gith… +#> # … with 170 more rows, and 15 more variables: owner_gravatar_id <chr>, +#> # owner_url <chr>, owner_html_url <chr>, owner_followers_url <chr>, …

                    This gives another wide dataset, but you can see that owner appears to contain a lot of additional data about the person who “owns” the repository.

                    @@ -588,17 +569,16 @@ chars
                    chars |> 
                       unnest_wider(json)
                     #> # A tibble: 30 × 18
                    -#>   url         id name  gender culture born  died  alive titles aliases father
                    -#>   <chr>    <int> <chr> <chr>  <chr>   <chr> <chr> <lgl> <list> <list>  <chr> 
                    -#> 1 https:/…  1022 Theo… Male   "Ironb… "In … ""    TRUE  <chr>  <chr>   ""    
                    -#> 2 https:/…  1052 Tyri… Male   ""      "In … ""    TRUE  <chr>  <chr>   ""    
                    -#> 3 https:/…  1074 Vict… Male   "Ironb… "In … ""    TRUE  <chr>  <chr>   ""    
                    -#> 4 https:/…  1109 Will  Male   ""      ""    "In … FALSE <chr>  <chr>   ""    
                    -#> 5 https:/…  1166 Areo… Male   "Norvo… "In … ""    TRUE  <chr>  <chr>   ""    
                    -#> 6 https:/…  1267 Chett Male   ""      "At … "In … FALSE <chr>  <chr>   ""    
                    -#> # … with 24 more rows, and 7 more variables: mother <chr>, spouse <chr>,
                    -#> #   allegiances <list>, books <list>, povBooks <list>, tvSeries <list>,
                    -#> #   playedBy <list>
                    +#> url id name gender culture born +#> <chr> <int> <chr> <chr> <chr> <chr> +#> 1 https://www.anapio… 1022 Theon Greyjoy Male "Ironborn" "In 278 AC or … +#> 2 https://www.anapio… 1052 Tyrion Lannist… Male "" "In 273 AC, at… +#> 3 https://www.anapio… 1074 Victarion Grey… Male "Ironborn" "In 268 AC or … +#> 4 https://www.anapio… 1109 Will Male "" "" +#> 5 https://www.anapio… 1166 Areo Hotah Male "Norvoshi" "In 257 AC or … +#> 6 https://www.anapio… 1267 Chett Male "" "At Hag's Mire" +#> # … with 24 more rows, and 12 more variables: died <chr>, alive <lgl>, +#> # titles <list>, aliases <list>, father <chr>, mother <chr>, …

                    And selecting a few columns to make it easier to read:

                    @@ -607,15 +587,15 @@ chars select(id, name, gender, culture, born, died, alive) characters #> # A tibble: 30 × 7 -#> id name gender culture born died alive -#> <int> <chr> <chr> <chr> <chr> <chr> <lgl> -#> 1 1022 Theon Greyjoy Male "Ironborn" "In 278 AC or 279 AC… "" TRUE -#> 2 1052 Tyrion Lannister Male "" "In 273 AC, at Caste… "" TRUE -#> 3 1074 Victarion Greyjoy Male "Ironborn" "In 268 AC or before… "" TRUE -#> 4 1109 Will Male "" "" "In … FALSE -#> 5 1166 Areo Hotah Male "Norvoshi" "In 257 AC or before… "" TRUE -#> 6 1267 Chett Male "" "At Hag's Mire" "In … FALSE -#> # … with 24 more rows +#> id name gender culture born died +#> <int> <chr> <chr> <chr> <chr> <chr> +#> 1 1022 Theon Greyjoy Male "Ironborn" "In 278 AC or 27… "" +#> 2 1052 Tyrion Lannister Male "" "In 273 AC, at C… "" +#> 3 1074 Victarion Greyjoy Male "Ironborn" "In 268 AC or be… "" +#> 4 1109 Will Male "" "" "In 297 AC, at… +#> 5 1166 Areo Hotah Male "Norvoshi" "In 257 AC or be… "" +#> 6 1267 Chett Male "" "At Hag's Mire" "In 299 AC, at… +#> # … with 24 more rows, and 1 more variable: alive <lgl>

                    There are also many list-columns:

                    @@ -828,15 +808,16 @@ Deeply nested unnest_wider(results) locations #> # A tibble: 7 × 6 -#> city address_compone…¹ formatted_address geometry place_id types -#> <chr> <list> <chr> <list> <chr> <list> -#> 1 Houston <list [4]> Houston, TX, USA <named list> ChIJAYW… <list> -#> 2 Washington <list [2]> Washington, USA <named list> ChIJ-bD… <list> -#> 3 Washington <list [4]> Washington, DC, … <named list> ChIJW-T… <list> -#> 4 New York <list [3]> New York, NY, USA <named list> ChIJOwg… <list> -#> 5 Chicago <list [4]> Chicago, IL, USA <named list> ChIJ7cv… <list> -#> 6 Arlington <list [4]> Arlington, TX, U… <named list> ChIJ05g… <list> -#> # … with 1 more row, and abbreviated variable name ¹​address_components +#> city address_compone…¹ formatted_address geometry place_id +#> <chr> <list> <chr> <list> <chr> +#> 1 Houston <list [4]> Houston, TX, USA <named list> ChIJAYWNSLS4QI… +#> 2 Washington <list [2]> Washington, USA <named list> ChIJ-bDD5__lhV… +#> 3 Washington <list [4]> Washington, DC, … <named list> ChIJW-T2Wt7Gt4… +#> 4 New York <list [3]> New York, NY, USA <named list> ChIJOwg_06VPwo… +#> 5 Chicago <list [4]> Chicago, IL, USA <named list> ChIJ7cv00DwsDo… +#> 6 Arlington <list [4]> Arlington, TX, U… <named list> ChIJ05gI5NJiTo… +#> # … with 1 more row, 1 more variable: types <list>, and abbreviated variable +#> # name ¹​address_components

                    Now we can see why two cities got two results: Washington matched both Washington state and Washington, DC, and Arlington matched Arlington, Virginia and Arlington, Texas.

                    There are few different places we could go from here. We might want to determine the exact location of the match, which is stored in the geometry list-column:

                    @@ -937,7 +918,7 @@ locations

                    If these case studies have whetted your appetite for more real-life rectangling, you can see a few more examples in vignette("rectangling", package = "tidyr").

                    -
                    +

                    Exercises

                    1. Roughly estimate when gh_repos was created. Why can you only roughly estimate the date?

                    2. @@ -965,7 +946,7 @@ Exercises JSON

                      All of the case studies in the previous section were sourced from wild-caught JSON. JSON is short for javascript object notation and is the way that most web APIs return data. It’s important to understand it because while JSON and R’s data types are pretty similar, there isn’t a perfect 1-to-1 mapping, so it’s good to understand a bit about JSON if things go wrong.

                      -
                      +

                      Data types

                      JSON is a simple format designed to be easily read and written by machines, not humans. It has six key data types. Four of them are scalars:

                      @@ -1083,7 +1064,7 @@ Translation challenges

                      Since JSON doesn’t have any way to represent dates or date-times, they’re often stored as ISO8601 date times in strings, and you’ll need to use readr::parse_date() or readr::parse_datetime() to turn them into the correct data structure. Similarly, JSON’s rules for representing floating point numbers in JSON are a little imprecise, so you’ll also sometimes find numbers stored in strings. Apply readr::parse_double() as needed to the get correct variable type.

                      -
                      +

                      Exercises

                      1. @@ -1110,7 +1091,7 @@ df_row <- tibble(json = json_row)
                      -
                      +

                      Summary

                      In this chapter, you learned what lists are, how you can generate them from JSON files, and how turn them into rectangular data frames. Surprisingly we only need two new functions: unnest_longer() to put list elements into rows and unnest_wider() to put list elements into columns. It doesn’t matter how deeply nested the list-column is, all you need to do is repeatedly call these two functions.

                      diff --git a/oreilly/regexps.html b/oreilly/regexps.html index 2f4fe31..7ed7ee1 100644 --- a/oreilly/regexps.html +++ b/oreilly/regexps.html @@ -1,23 +1,14 @@

                      Regular expressions

                      -
                      +

                      Introduction

                      In #chp-strings, you learned a whole bunch of useful functions for working with strings. This chapter will focus on functions that use regular expressions, a concise and powerful language for describing patterns within strings. The term “regular expression” is a bit of a mouthful, so most people abbreviate it to “regex”You can pronounce it with either a hard-g (reg-x) or a soft-g (rej-x). or “regexp”.

                      The chapter starts with the basics of regular expressions and the most useful stringr functions for data analysis. We’ll then expand your knowledge of patterns and cover seven important new topics (escaping, anchoring, character classes, shorthand classes, quantifiers, precedence, and grouping). Next, we’ll talk about some of the other types of patterns that stringr functions can work with and the various “flags” that allow you to tweak the operation of regular expressions. We’ll finish with a survey of other places in the tidyverse and base R where you might use regexes.

                      -
                      +

                      Prerequisites

                      -
                      -
                      - -
                      - -
                      - -

                      This chapter relies on features only found in tidyr 1.3.0, which is still in development. If you want to live on the edge, you can get the dev version with devtools::install_github("tidyverse/tidyr").

                      -

                      In this chapter, we’ll use regular expression functions from stringr and tidyr, both core members of the tidyverse, as well as data from the babynames package.

                      library(tidyverse)
                      @@ -46,11 +37,7 @@ Pattern basics
                       #> [11] │ boysen<berry>
                       #> [19] │ cloud<berry>
                       #> [21] │ cran<berry>
                      -#> [29] │ elder<berry>
                      -#> [32] │ goji <berry>
                      -#> [33] │ goose<berry>
                      -#> [38] │ huckle<berry>
                      -#> ... and 4 more
                      +#> ... and 8 more
                       
                       str_view(fruit, "BERRY")
                      @@ -70,8 +57,7 @@ str_view(fruit, "BERRY") #> [51] │ nect<arine> #> [62] │ pine<apple> #> [64] │ pomegr<anate> -#> [70] │ r<aspbe>rry -#> [73] │ sal<al be>rry +#> ... and 2 more

                      Quantifiers control how many times a pattern can match:

                      • @@ -123,11 +109,7 @@ str_view(words, "[^aeiou][^aeiou][^aeiou][^aeiou]") #> [34] │ alth<ough> #> [37] │ am<ount> #> [46] │ app<oint> -#> [47] │ appr<oach> -#> [52] │ ar<ound> -#> [61] │ <auth>ority -#> [79] │ be<auty> -#> ... and 62 more +#> ... and 66 more

                        (We’ll learn more elegant ways to express these ideas in #sec-quantifiers.)

                        You can use alternation, | to pick between one or more alternative patterns. For example, the following patterns look for fruits containing “apple”, “pear”, or “banana”, or a repeated vowel.

                        @@ -144,11 +126,6 @@ str_view(fruit, "aa|ee|ii|oo|uu") #> [66] │ purple mangost<ee>n

                        Regular expressions are very compact and use a lot of punctuation characters, so they can seem overwhelming and hard to read at first. Don’t worry; you’ll get better with practice, and simple patterns will soon become second nature. Let’s kick off that process by practicing with some useful stringr functions.

                        - -
                        -

                        -Exercises

                        -
                      @@ -286,7 +263,7 @@ str_remove_all(x, "[aeiou]")

                      Extract variables

                      -

                      The last function we’ll discuss uses regular expressions to extract data out of one column into one or more new columns: separate_wider_regex(). It’s a peer of the separate_wider_location() and separate_wider_delim() functions that you learned about in #sec-string-columns. These functions live in tidyr because the operates on (columns of) data frames, rather than individual vectors.

                      +

                      The last function we’ll discuss uses regular expressions to extract data out of one column into one or more new columns: separate_wider_regex(). It’s a peer of the separate_wider_position() and separate_wider_delim() functions that you learned about in #sec-string-columns. These functions live in tidyr because they operate on (columns of) data frames, rather than individual vectors.

                      Let’s create a simple dataset to show how it works. Here we have some data derived from babynames where we have the name, gender, and age of a bunch of people in a rather weird formatWe wish we could reassure you that you’d never see something this weird in real life, but unfortunately over the course of your career you’re likely to see much weirder!:

                      df <- tribble(
                      @@ -325,7 +302,7 @@ Extract variables
                       

                      If the match fails, you can use too_short = "debug" to figure out what went wrong, just like separate_wider_delim() and separate_wider_position().

                      -
                      +

                      Exercises

                      1. What baby name has the most vowels? What name has the highest proportion of vowels? (Hint: what is the denominator?)

                      2. @@ -398,8 +375,8 @@ str_view(fruit, "a$") #> [56] │ papay<a> #> [74] │ satsum<a> -

                        It’s tempting to think that $ should matches the start of a string, because that’s how we write dollar amounts, but it’s not what regular expressions want.

                        -

                        To force a regular expression to only the full string, anchor it with both ^ and $:

                        +

                        It’s tempting to think that $ should match the start of a string, because that’s how we write dollar amounts, but it’s not what regular expressions want.

                        +

                        To force a regular expression to match only the full string, anchor it with both ^ and $:

                        str_view(fruit, "apple")
                         #>  [1] │ <apple>
                        @@ -407,7 +384,7 @@ str_view(fruit, "a$")
                         str_view(fruit, "^apple$")
                         #> [1] │ <apple>
                        -

                        You can also match the boundary between words (i.e. the start or end of a word) with \b. This can be particularly when using RStudio’s find and replace tool. For example, if to find all uses of sum(), you can search for \bsum\b to avoid matching summarize, summary, rowsum and so on:

                        +

                        You can also match the boundary between words (i.e. the start or end of a word) with \b. This can be particularly useful when using RStudio’s find and replace tool. For example, if to find all uses of sum(), you can search for \bsum\b to avoid matching summarize, summary, rowsum and so on:

                        x <- c("summary(x)", "summarize(df)", "rowsum(x)", "sum(x)")
                         str_view(x, "sum")
                        @@ -523,7 +500,7 @@ Operator precedence and parentheses
                         

                        Grouping and capturing

                        -

                        As well overriding operator precedence, parentheses have another important effect: they create capturing groups that allow you to use sub-components of the match.

                        +

                        As well as overriding operator precedence, parentheses have another important effect: they create capturing groups that allow you to use sub-components of the match.

                        The first way to use a capturing group is to refer back to it within a match with back reference: \1 refers to the match contained in the first parenthesis, \2 in the second parenthesis, and so on. For example, the following pattern finds all fruits that have a repeated pair of letters:

                        str_view(fruit, "(..)\\1")
                        @@ -548,17 +525,13 @@ Grouping and capturing
                         
                        sentences |> 
                           str_replace("(\\w+) (\\w+) (\\w+)", "\\1 \\3 \\2") |> 
                           str_view()
                        -#>  [1] │ The canoe birch slid on the smooth planks.
                        -#>  [2] │ Glue sheet the to the dark blue background.
                        -#>  [3] │ It's to easy tell the depth of a well.
                        -#>  [4] │ These a days chicken leg is a rare dish.
                        -#>  [5] │ Rice often is served in round bowls.
                        -#>  [6] │ The of juice lemons makes fine punch.
                        -#>  [7] │ The was box thrown beside the parked truck.
                        -#>  [8] │ The were hogs fed chopped corn and garbage.
                        -#>  [9] │ Four of hours steady work faced us.
                        -#> [10] │ A size large in stockings is hard to sell.
                        -#> ... and 710 more
                        +#> [1] │ The canoe birch slid on the smooth planks. +#> [2] │ Glue sheet the to the dark blue background. +#> [3] │ It's to easy tell the depth of a well. +#> [4] │ These a days chicken leg is a rare dish. +#> [5] │ Rice often is served in round bowls. +#> [6] │ The of juice lemons makes fine punch. +#> ... and 714 more

                        If you want extract the matches for each group you can use str_match(). But str_match() returns a matrix, so it’s not particularly easy to work withMostly because we never discuss matrices in this book!:

                        @@ -605,7 +578,7 @@ str_match(x, "gr(?:e|a)y")
                        -
                        +

                        Exercises

                        1. How would you match the literal string "'\? How about "$^$"?

                        2. @@ -645,7 +618,7 @@ Pattern control

                          Regex flags

                          -

                          There are a number of settings that can use to control the details of the regexp. These settings are often called flags in other programming languages. In stringr, you can use these by wrapping the pattern in a call to regex(). The most useful flag is probably ignore_case = TRUE because it allows characters to match either their uppercase or lowercase forms:

                          +

                          There are a number of settings that can be used to control the details of the regexp. These settings are often called flags in other programming languages. In stringr, you can use these by wrapping the pattern in a call to regex(). The most useful flag is probably ignore_case = TRUE because it allows characters to match either their uppercase or lowercase forms:

                          bananas <- c("banana", "Banana", "BANANA")
                           str_view(bananas, "banana")
                          @@ -737,7 +710,7 @@ str_view("i İ ı I", coll("İ", ignore_case = TRUE, locale = "tr"))
                           

                          Practice

                          To put these ideas into practice we’ll solve a few semi-authentic problems next. We’ll discuss three general techniques:

                          -
                          1. checking you work by creating simple positive and negative controls
                          2. +
                            1. checking your work by creating simple positive and negative controls
                            2. combining regular expressions with Boolean algebra
                            3. creating complex patterns using string manipulation
                            @@ -753,11 +726,7 @@ Check your work #> [7] │ <The> box was thrown beside the parked truck. #> [8] │ <The> hogs were fed chopped corn and garbage. #> [11] │ <The> boy was there when the sun rose. -#> [13] │ <The> source of the huge river is the clear spring. -#> [18] │ <The> soft cushion broke the man's fall. -#> [19] │ <The> salt breeze came across from the sea. -#> [20] │ <The> girl at the booth sold fifty bonds. -#> ... and 267 more
                          +#> ... and 271 more

                        Because that pattern also matches sentences starting with words like They or These. We need to make sure that the “e” is the last letter in the word, which we can do by adding adding a word boundary:

                        @@ -768,26 +737,18 @@ Check your work #> [8] │ <The> hogs were fed chopped corn and garbage. #> [11] │ <The> boy was there when the sun rose. #> [13] │ <The> source of the huge river is the clear spring. -#> [18] │ <The> soft cushion broke the man's fall. -#> [19] │ <The> salt breeze came across from the sea. -#> [20] │ <The> girl at the booth sold fifty bonds. -#> [21] │ <The> small pup gnawed a hole in the sock. -#> ... and 246 more +#> ... and 250 more

                        What about finding all sentences that begin with a pronoun?

                        str_view(sentences, "^She|He|It|They\\b")
                        -#>   [3] │ <It>'s easy to tell the depth of a well.
                        -#>  [15] │ <He>lp the woman get back to her feet.
                        -#>  [27] │ <He>r purse was full of useless trash.
                        -#>  [29] │ <It> snowed, rained, and hailed the same morning.
                        -#>  [63] │ <He> ran half way to the hardware store.
                        -#>  [90] │ <He> lay prone and hardly moved a limb.
                        -#> [116] │ <He> ordered peach pie with ice cream.
                        -#> [118] │ <He>mp is a weed found in parts of the tropics.
                        -#> [127] │ <It> caught its hind paw in a rusty trap.
                        -#> [132] │ <He> said the same phrase thirty times.
                        -#> ... and 53 more
                        +#> [3] │ <It>'s easy to tell the depth of a well. +#> [15] │ <He>lp the woman get back to her feet. +#> [27] │ <He>r purse was full of useless trash. +#> [29] │ <It> snowed, rained, and hailed the same morning. +#> [63] │ <He> ran half way to the hardware store. +#> [90] │ <He> lay prone and hardly moved a limb. +#> ... and 57 more

                        A quick inspection of the results shows that we’re getting some spurious matches. That’s because we’ve forgotten to use parentheses:

                        @@ -798,11 +759,7 @@ Check your work #> [90] │ <He> lay prone and hardly moved a limb. #> [116] │ <He> ordered peach pie with ice cream. #> [127] │ <It> caught its hind paw in a rusty trap. -#> [132] │ <He> said the same phrase thirty times. -#> [153] │ <He> broke a new shoelace that day. -#> [159] │ <She> sewed the torn coat quite neatly. -#> [168] │ <He> knew the skill of the great young actress. -#> ... and 47 more +#> ... and 51 more

                        You might wonder how you might spot such a mistake if it didn’t occur in the first few matches. A good technique is to create a few positive and negative matches and use them to test that your pattern works as expected:

                        @@ -850,11 +807,7 @@ Boolean operations #> [62] │ <availab>le #> [66] │ <ba>by #> [67] │ <ba>ck -#> [68] │ <ba>d -#> [69] │ <ba>g -#> [70] │ <bala>nce -#> [71] │ <ba>ll -#> ... and 20 more +#> ... and 24 more

                        It’s simpler to combine the results of two calls to str_detect():

                        @@ -897,11 +850,7 @@ Creating a pattern with code #> [148] │ The spot on the blotter was made by <green> ink. #> [160] │ The sofa cushion is <red> and of light weight. #> [174] │ The sky that morning was clear and bright <blue>. -#> [204] │ A <blue> crane is a tall wading bird. -#> [217] │ It is hard to erase <blue> or <red> ink. -#> [224] │ The lamp shone with a steady <green> flame. -#> [247] │ The box is held by a bright <red> snapper. -#> ... and 16 more +#> ... and 20 more

                        But as the number of colors grows, it would quickly get tedious to construct this pattern by hand. Wouldn’t it be nice if we could store the colors in a vector?

                        @@ -915,34 +864,26 @@ Creating a pattern with code

                        We could make this pattern more comprehensive if we had a good list of colors. One place we could start from is the list of built-in colors that R can use for plots:

                        str_view(colors())
                        -#>  [1] │ white
                        -#>  [2] │ aliceblue
                        -#>  [3] │ antiquewhite
                        -#>  [4] │ antiquewhite1
                        -#>  [5] │ antiquewhite2
                        -#>  [6] │ antiquewhite3
                        -#>  [7] │ antiquewhite4
                        -#>  [8] │ aquamarine
                        -#>  [9] │ aquamarine1
                        -#> [10] │ aquamarine2
                        -#> ... and 647 more
                        +#> [1] │ white +#> [2] │ aliceblue +#> [3] │ antiquewhite +#> [4] │ antiquewhite1 +#> [5] │ antiquewhite2 +#> [6] │ antiquewhite3 +#> ... and 651 more

                        But lets first eliminate the numbered variants:

                        cols <- colors()
                         cols <- cols[!str_detect(cols, "\\d")]
                         str_view(cols)
                        -#>  [1] │ white
                        -#>  [2] │ aliceblue
                        -#>  [3] │ antiquewhite
                        -#>  [4] │ aquamarine
                        -#>  [5] │ azure
                        -#>  [6] │ beige
                        -#>  [7] │ bisque
                        -#>  [8] │ black
                        -#>  [9] │ blanchedalmond
                        -#> [10] │ blue
                        -#> ... and 133 more
                        +#> [1] │ white +#> [2] │ aliceblue +#> [3] │ antiquewhite +#> [4] │ aquamarine +#> [5] │ azure +#> [6] │ beige +#> ... and 137 more

                        Then we can turn this into one giant pattern. We won’t show the pattern here because it’s huge, but you can see it working:

                        @@ -954,16 +895,12 @@ str_view(sentences, pattern) #> [66] │ Cars and busses stalled in <snow> drifts. #> [92] │ A wisp of cloud hung in the <blue> air. #> [112] │ Leaves turn <brown> and <yellow> in the fall. -#> [148] │ The spot on the blotter was made by <green> ink. -#> [149] │ Mud was spattered on the front of his <white> shirt. -#> [160] │ The sofa cushion is <red> and of light weight. -#> [167] │ The office paint was a dull, sad <tan>. -#> ... and 53 more +#> ... and 57 more
                        -

                        In this example, cols only contains numbers and letters so you don’t need to worry about metacharacters. But in general, whenever you create create patterns from existing strings it’s wise to run them through str_escape() to ensure they match literally.

                        +

                        In this example, cols only contains numbers and letters so you don’t need to worry about metacharacters. But in general, whenever you create patterns from existing strings it’s wise to run them through str_escape() to ensure they match literally.

                      -
                      +

                      Exercises

                      1. @@ -988,8 +925,8 @@ Regular expressions in other places tidyverse

                        There are three other particularly useful places where you might want to use a regular expressions

                        • matches(pattern) will select all variables whose name matches the supplied pattern. It’s a “tidyselect” function that you can use anywhere in any tidyverse function that selects variables (e.g. select(), rename_with() and across()).

                        • -
                        • pivot_longer()'s names_pattern argument takes a vector of regular expressions, just like separate_with_regex(). It’s useful when extracting data out of variable names with a complex structure

                        • -
                        • The delim argument in separate_delim_longer() and separate_delim_wider() usually matches a fixed string, but you can use regex() to make it match a pattern. This is useful, for example, if you want to match a comma that is optionally followed by a space, i.e. regex(", ?").

                        • +
                        • pivot_longer()'s names_pattern argument takes a vector of regular expressions, just like separate_wider_regex(). It’s useful when extracting data out of variable names with a complex structure

                        • +
                        • The delim argument in separate_longer_delim() and separate_wider_delim() usually matches a fixed string, but you can use regex() to make it match a pattern. This is useful, for example, if you want to match a comma that is optionally followed by a space, i.e. regex(", ?").

                      @@ -1011,7 +948,7 @@ Base R
                      -
                      +

                      Summary

                      With every punctuation character potentially overloaded with meaning, regular expressions are one of the most compact languages out there. They’re definitely confusing at first but as you train your eyes to read them and your brain to understand them, you unlock a powerful skill that you can use in R and in many other places.

                      diff --git a/oreilly/spreadsheets.html b/oreilly/spreadsheets.html index 753aa97..9e9b3d4 100644 --- a/oreilly/spreadsheets.html +++ b/oreilly/spreadsheets.html @@ -1,6 +1,6 @@

                      Spreadsheets

                      -
                      +

                      Introduction

                      So far, you have learned about importing data from plain text files, e.g., .csv and .tsv files. Sometimes you need to analyze data that lives in a spreadsheet. This chapter will introduce you to tools for working with data in Excel spreadsheets and Google Sheets. This will build on much of what you’ve learned in #chp-data-import, but we will also discuss additional considerations and complexities when working with data from spreadsheets.

                      @@ -11,7 +11,7 @@ Introduction

                      Excel

                      -
                      +

                      Prerequisites

                      In this section, you’ll learn how to load data from Excel spreadsheets in R with the readxl package. This package is non-core tidyverse, so you need to load it explicitly, but it is installed automatically when you install the tidyverse package.

                      @@ -190,15 +190,16 @@ Reading worksheets
                      read_excel("data/penguins.xlsx", sheet = "Torgersen Island")
                       #> # A tibble: 52 × 8
                      -#>   species island   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
                      -#>   <chr>   <chr>    <chr>          <chr>         <chr>             <chr>      
                      -#> 1 Adelie  Torgers… 39.1           18.7          181               3750       
                      -#> 2 Adelie  Torgers… 39.5           17.399999999… 186               3800       
                      -#> 3 Adelie  Torgers… 40.2999999999… 18            195               3250       
                      -#> 4 Adelie  Torgers… NA             NA            NA                NA         
                      -#> 5 Adelie  Torgers… 36.7000000000… 19.3          193               3450       
                      -#> 6 Adelie  Torgers… 39.2999999999… 20.6          190               3650       
                      -#> # … with 46 more rows, and 2 more variables: sex <chr>, year <dbl>
                      +#> species island bill_length_mm bill_depth_mm flipper_length_mm +#> <chr> <chr> <chr> <chr> <chr> +#> 1 Adelie Torgersen 39.1 18.7 181 +#> 2 Adelie Torgersen 39.5 17.399999999999999 186 +#> 3 Adelie Torgersen 40.299999999999997 18 195 +#> 4 Adelie Torgersen NA NA NA +#> 5 Adelie Torgersen 36.700000000000003 19.3 193 +#> 6 Adelie Torgersen 39.299999999999997 20.6 190 +#> # … with 46 more rows, and 3 more variables: body_mass_g <chr>, sex <chr>, +#> # year <dbl>

                      Some variables that appear to contain numerical data are read in as characters due to the character string "NA" not being recognized as a true NA.

                      @@ -206,15 +207,16 @@ Reading worksheets penguins_torgersen #> # A tibble: 52 × 8 -#> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g -#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> -#> 1 Adelie Torgers… 39.1 18.7 181 3750 -#> 2 Adelie Torgers… 39.5 17.4 186 3800 -#> 3 Adelie Torgers… 40.3 18 195 3250 -#> 4 Adelie Torgers… NA NA NA NA -#> 5 Adelie Torgers… 36.7 19.3 193 3450 -#> 6 Adelie Torgers… 39.3 20.6 190 3650 -#> # … with 46 more rows, and 2 more variables: sex <chr>, year <dbl> +#> species island bill_length_mm bill_depth_mm flipper_length_mm +#> <chr> <chr> <dbl> <dbl> <dbl> +#> 1 Adelie Torgersen 39.1 18.7 181 +#> 2 Adelie Torgersen 39.5 17.4 186 +#> 3 Adelie Torgersen 40.3 18 195 +#> 4 Adelie Torgersen NA NA NA +#> 5 Adelie Torgersen 36.7 19.3 193 +#> 6 Adelie Torgersen 39.3 20.6 190 +#> # … with 46 more rows, and 3 more variables: body_mass_g <dbl>, sex <chr>, +#> # year <dbl>

                      Alternatively, you can use excel_sheets() to get information on all worksheets in an Excel spreadsheet, and then read the one(s) you’re interested in.

                      @@ -240,15 +242,16 @@ dim(penguins_dream)
                      penguins <- bind_rows(penguins_torgersen, penguins_biscoe, penguins_dream)
                       penguins
                       #> # A tibble: 344 × 8
                      -#>   species island   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
                      -#>   <chr>   <chr>             <dbl>         <dbl>             <dbl>       <dbl>
                      -#> 1 Adelie  Torgers…           39.1          18.7               181        3750
                      -#> 2 Adelie  Torgers…           39.5          17.4               186        3800
                      -#> 3 Adelie  Torgers…           40.3          18                 195        3250
                      -#> 4 Adelie  Torgers…           NA            NA                  NA          NA
                      -#> 5 Adelie  Torgers…           36.7          19.3               193        3450
                      -#> 6 Adelie  Torgers…           39.3          20.6               190        3650
                      -#> # … with 338 more rows, and 2 more variables: sex <chr>, year <dbl>
                      +#> species island bill_length_mm bill_depth_mm flipper_length_mm +#> <chr> <chr> <dbl> <dbl> <dbl> +#> 1 Adelie Torgersen 39.1 18.7 181 +#> 2 Adelie Torgersen 39.5 17.4 186 +#> 3 Adelie Torgersen 40.3 18 195 +#> 4 Adelie Torgersen NA NA NA +#> 5 Adelie Torgersen 36.7 19.3 193 +#> 6 Adelie Torgersen 39.3 20.6 190 +#> # … with 338 more rows, and 3 more variables: body_mass_g <dbl>, sex <chr>, +#> # year <dbl>

                      In #chp-iteration we’ll talk about ways of doing this sort of task without repetitive code.

                      @@ -277,14 +280,14 @@ deaths <- read_excel(deaths_path) #> • `` -> `...6` deaths #> # A tibble: 18 × 6 -#> `Lots of people` ...2 ...3 ...4 ...5 ...6 -#> <chr> <chr> <chr> <chr> <chr> <chr> -#> 1 simply cannot resist writing <NA> <NA> <NA> <NA> some … -#> 2 at the top <NA> of their… -#> 3 or merging <NA> <NA> <NA> cells -#> 4 Name Profession Age Has kids Date of birth Date … -#> 5 David Bowie musician 69 TRUE 17175 42379 -#> 6 Carrie Fisher actor 60 TRUE 20749 42731 +#> `Lots of people` ...2 ...3 ...4 ...5 ...6 +#> <chr> <chr> <chr> <chr> <chr> <chr> +#> 1 simply cannot resi… <NA> <NA> <NA> <NA> some notes +#> 2 at the top <NA> of their spreadsh… +#> 3 or merging <NA> <NA> <NA> cells +#> 4 Name Profession Age Has kids Date of birth Date of death +#> 5 David Bowie musician 69 TRUE 17175 42379 +#> 6 Carrie Fisher actor 60 TRUE 20749 42731 #> # … with 12 more rows

                      The top three rows and the bottom four rows are not part of the data frame.

                      @@ -292,29 +295,29 @@ deaths
                      read_excel(deaths_path, skip = 4)
                       #> # A tibble: 14 × 6
                      -#>   Name        Profession Age   `Has kids` `Date of birth`     `Date of death`
                      -#>   <chr>       <chr>      <chr> <chr>      <dttm>              <chr>          
                      -#> 1 David Bowie musician   69    TRUE       1947-01-08 00:00:00 42379          
                      -#> 2 Carrie Fis… actor      60    TRUE       1956-10-21 00:00:00 42731          
                      -#> 3 Chuck Berry musician   90    TRUE       1926-10-18 00:00:00 42812          
                      -#> 4 Bill Paxton actor      61    TRUE       1955-05-17 00:00:00 42791          
                      -#> 5 Prince      musician   57    TRUE       1958-06-07 00:00:00 42481          
                      -#> 6 Alan Rickm… actor      69    FALSE      1946-02-21 00:00:00 42383          
                      -#> # … with 8 more rows
                      +#> Name Profession Age `Has kids` `Date of birth` +#> <chr> <chr> <chr> <chr> <dttm> +#> 1 David Bowie musician 69 TRUE 1947-01-08 00:00:00 +#> 2 Carrie Fisher actor 60 TRUE 1956-10-21 00:00:00 +#> 3 Chuck Berry musician 90 TRUE 1926-10-18 00:00:00 +#> 4 Bill Paxton actor 61 TRUE 1955-05-17 00:00:00 +#> 5 Prince musician 57 TRUE 1958-06-07 00:00:00 +#> 6 Alan Rickman actor 69 FALSE 1946-02-21 00:00:00 +#> # … with 8 more rows, and 1 more variable: `Date of death` <chr>

                      We could also set n_max to omit the extraneous rows at the bottom.

                      read_excel(deaths_path, skip = 4, n_max = 10)
                       #> # A tibble: 10 × 6
                      -#>   Name    Profession   Age `Has kids` `Date of birth`     `Date of death`    
                      -#>   <chr>   <chr>      <dbl> <lgl>      <dttm>              <dttm>             
                      -#> 1 David … musician      69 TRUE       1947-01-08 00:00:00 2016-01-10 00:00:00
                      -#> 2 Carrie… actor         60 TRUE       1956-10-21 00:00:00 2016-12-27 00:00:00
                      -#> 3 Chuck … musician      90 TRUE       1926-10-18 00:00:00 2017-03-18 00:00:00
                      -#> 4 Bill P… actor         61 TRUE       1955-05-17 00:00:00 2017-02-25 00:00:00
                      -#> 5 Prince  musician      57 TRUE       1958-06-07 00:00:00 2016-04-21 00:00:00
                      -#> 6 Alan R… actor         69 FALSE      1946-02-21 00:00:00 2016-01-14 00:00:00
                      -#> # … with 4 more rows
                      +#> Name Profession Age `Has kids` `Date of birth` +#> <chr> <chr> <dbl> <lgl> <dttm> +#> 1 David Bowie musician 69 TRUE 1947-01-08 00:00:00 +#> 2 Carrie Fisher actor 60 TRUE 1956-10-21 00:00:00 +#> 3 Chuck Berry musician 90 TRUE 1926-10-18 00:00:00 +#> 4 Bill Paxton actor 61 TRUE 1955-05-17 00:00:00 +#> 5 Prince musician 57 TRUE 1958-06-07 00:00:00 +#> 6 Alan Rickman actor 69 FALSE 1946-02-21 00:00:00 +#> # … with 4 more rows, and 1 more variable: `Date of death` <dttm>

                      Another approach is using cell ranges. In Excel, the top left cell is A1. As you move across columns to the right, the cell label moves down the alphabet, i.e. B1, C1, etc. And as you move down a column, the number in the cell label increases, i.e. A2, A3, etc.

                      The data we want to read in starts in cell A5 and ends in cell F15. In spreadsheet notation, this is A5:F15.

                      @@ -332,7 +335,7 @@ deaths
                      -
                      +

                      Data types

                      In CSV files, all values are strings. This is not particularly true to the data, but it is simple: everything is a string.

                      @@ -399,7 +402,7 @@ write_xlsx(bake_sale, path = "data/bake-sale.xlsx")

                      Formatted output

                      -

                      The readxl package is a light-weight solution for writing a simple Excel spreadsheet, but if you’re interested in additional features like writing to sheets within a spreadsheet and styling, you will want to use the openxlsx package. Note that this package is not part of the tidyverse so the functions and workflows may feel unfamiliar. For example, function names are camelCase, multiple functions can’t be composed in pipelines, and arguments are in a different order than they tend to be in the tidyverse. However, this is ok. As your R learning and usage expands outside of this book you will encounter lots of different styles used in various R packages that you might need to use to accomplish specific goals in R. A good way of familiarizing yourself with the coding style used in a new package is to run the examples provided in function documentation to get a feel for the syntax and the output formats as well as reading any vignettes that might come with the package.

                      +

                      The writexl package is a light-weight solution for writing a simple Excel spreadsheet, but if you’re interested in additional features like writing to sheets within a spreadsheet and styling, you will want to use the openxlsx package. Note that this package is not part of the tidyverse so the functions and workflows may feel unfamiliar. For example, function names are camelCase, multiple functions can’t be composed in pipelines, and arguments are in a different order than they tend to be in the tidyverse. However, this is ok. As your R learning and usage expands outside of this book you will encounter lots of different styles used in various R packages that you might need to use to accomplish specific goals in R. A good way of familiarizing yourself with the coding style used in a new package is to run the examples provided in function documentation to get a feel for the syntax and the output formats as well as reading any vignettes that might come with the package.

                      Below we show how to write a spreadsheet with three sheets, one for each species of penguins in the penguins data frame.

                      library(openxlsx)
                      @@ -466,7 +469,7 @@ writeDataTable(
                       

                      See https://ycphs.github.io/openxlsx/articles/Formatting.html for an extensive discussion on further formatting functionality for data written from R to Excel with openxlsx.

                      -
                      +

                      Exercises

                      1. @@ -595,8 +598,8 @@ Read sheets

                        The first argument to read_sheet() is the URL of the file to read. You can also access this file via https://pos.it/r4ds-students, however note that at the time of writing this book you can’t read a sheet directly from a short link.

                        students <- read_sheet("https://docs.google.com/spreadsheets/d/1V1nPp1tzOuutXFLb3G9Eyxi3qxeEhnOXUzL5_BcCQ0w/edit?usp=sharing")
                        -#> ✔ Reading from "students".
                        -#> ✔ Range 'Sheet1'.
                        +#> ✔ Reading from students. +#> ✔ Range Sheet1.

                        read_sheet() will read the file in as a tibble.

                        @@ -624,8 +627,8 @@ Read sheets age = if_else(age == "five", "5", age), age = parse_number(age) ) -#> ✔ Reading from "students". -#> ✔ Range '2:10000000'. +#> ✔ Reading from students. +#> ✔ Range 2:10000000. students #> # A tibble: 6 × 5 @@ -642,18 +645,19 @@ students

                        It’s also possible to read individual sheets from Google Sheets as well. Let’s read the penguins Google Sheet at https://pos.it/r4ds-penguins, and specifically the “Torgersen Island” sheet in it.

                        read_sheet("https://docs.google.com/spreadsheets/d/1aFu8lnD_g0yjF5O-K6SFgSEWiHPpgvFCF0NY9D6LXnY/edit?usp=sharing", sheet = "Torgersen Island")
                        -#> ✔ Reading from "penguins".
                        +#> ✔ Reading from penguins.
                         #> ✔ Range ''Torgersen Island''.
                         #> # A tibble: 52 × 8
                        -#>   species island   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
                        -#>   <chr>   <chr>    <list>         <list>        <list>            <list>     
                        -#> 1 Adelie  Torgers… <dbl [1]>      <dbl [1]>     <dbl [1]>         <dbl [1]>  
                        -#> 2 Adelie  Torgers… <dbl [1]>      <dbl [1]>     <dbl [1]>         <dbl [1]>  
                        -#> 3 Adelie  Torgers… <dbl [1]>      <dbl [1]>     <dbl [1]>         <dbl [1]>  
                        -#> 4 Adelie  Torgers… <chr [1]>      <chr [1]>     <chr [1]>         <chr [1]>  
                        -#> 5 Adelie  Torgers… <dbl [1]>      <dbl [1]>     <dbl [1]>         <dbl [1]>  
                        -#> 6 Adelie  Torgers… <dbl [1]>      <dbl [1]>     <dbl [1]>         <dbl [1]>  
                        -#> # … with 46 more rows, and 2 more variables: sex <chr>, year <dbl>
                        +#> species island bill_length_mm bill_depth_mm flipper_length_mm +#> <chr> <chr> <list> <list> <list> +#> 1 Adelie Torgersen <dbl [1]> <dbl [1]> <dbl [1]> +#> 2 Adelie Torgersen <dbl [1]> <dbl [1]> <dbl [1]> +#> 3 Adelie Torgersen <dbl [1]> <dbl [1]> <dbl [1]> +#> 4 Adelie Torgersen <chr [1]> <chr [1]> <chr [1]> +#> 5 Adelie Torgersen <dbl [1]> <dbl [1]> <dbl [1]> +#> 6 Adelie Torgersen <dbl [1]> <dbl [1]> <dbl [1]> +#> # … with 46 more rows, and 3 more variables: body_mass_g <list>, sex <chr>, +#> # year <dbl>

                        You can obtain a list of all sheets within a Google Sheet with sheet_names():

                        @@ -664,19 +668,19 @@ students
                        deaths_url <- gs4_example("deaths")
                         deaths <- read_sheet(deaths_url, range = "A5:F15")
                        -#> ✔ Reading from "deaths".
                        -#> ✔ Range 'A5:F15'.
                        +#> ✔ Reading from deaths.
                        +#> ✔ Range A5:F15.
                         deaths
                         #> # A tibble: 10 × 6
                        -#>   Name    Profession   Age `Has kids` `Date of birth`     `Date of death`    
                        -#>   <chr>   <chr>      <dbl> <lgl>      <dttm>              <dttm>             
                        -#> 1 David … musician      69 TRUE       1947-01-08 00:00:00 2016-01-10 00:00:00
                        -#> 2 Carrie… actor         60 TRUE       1956-10-21 00:00:00 2016-12-27 00:00:00
                        -#> 3 Chuck … musician      90 TRUE       1926-10-18 00:00:00 2017-03-18 00:00:00
                        -#> 4 Bill P… actor         61 TRUE       1955-05-17 00:00:00 2017-02-25 00:00:00
                        -#> 5 Prince  musician      57 TRUE       1958-06-07 00:00:00 2016-04-21 00:00:00
                        -#> 6 Alan R… actor         69 FALSE      1946-02-21 00:00:00 2016-01-14 00:00:00
                        -#> # … with 4 more rows
                        +#> Name Profession Age `Has kids` `Date of birth` +#> <chr> <chr> <dbl> <lgl> <dttm> +#> 1 David Bowie musician 69 TRUE 1947-01-08 00:00:00 +#> 2 Carrie Fisher actor 60 TRUE 1956-10-21 00:00:00 +#> 3 Chuck Berry musician 90 TRUE 1926-10-18 00:00:00 +#> 4 Bill Paxton actor 61 TRUE 1955-05-17 00:00:00 +#> 5 Prince musician 57 TRUE 1958-06-07 00:00:00 +#> 6 Alan Rickman actor 69 FALSE 1946-02-21 00:00:00 +#> # … with 4 more rows, and 1 more variable: `Date of death` <dttm>
                      @@ -700,7 +704,7 @@ Authentication

                      When you attempt to read in a sheet that requires authentication, googlesheets4 will direct you to a web browser with a prompt to sign in to your Google account and grant permission to operate on your behalf with Google Sheets. However, if you want to specify a specific Google account, authentication scope, etc. you can do so with gs4_auth(), e.g. gs4_auth(email = "mine@example.com"), which will force the use of a token associated with a specific email. For further authentication details, we recommend reading the documentation googlesheets4 auth vignette: https://googlesheets4.tidyverse.org/articles/auth.html.

                      -
                      +

                      Exercises

                      1. Read the students dataset from earlier in the chapter from Excel and also from Google Sheets, with no additional arguments supplied to the read_excel() and read_sheet() functions. Are the resulting data frames in R exactly the same? If not, how are they different?

                      2. @@ -728,7 +732,7 @@ Exercises
                      -
                      +

                      Summary

                      In this chapter you learned how to read data into R from spreadsheets: from Microsoft Excel with read_excel() from the readxl package and from Google Sheets with read_sheet() from the googlesheets4 package. These functions work very similarly to each other and have similar arguments for specifying column names, NA strings, rows to skip on top of the file you’re reading in, etc. Additionally, both functions make it possible to read a single sheet from a spreadsheet as well.

                      diff --git a/oreilly/strings.html b/oreilly/strings.html index 7c2163d..2a50f0e 100644 --- a/oreilly/strings.html +++ b/oreilly/strings.html @@ -1,24 +1,15 @@

                      Strings

                      -
                      +

                      Introduction

                      So far, you’ve used a bunch of strings without learning much about the details. Now it’s time to dive into them, learn what makes strings tick, and master some of the powerful string manipulation tools you have at your disposal.

                      We’ll begin with the details of creating strings and character vectors. You’ll then dive into creating strings from data, then the opposite; extracting strings from data. We’ll then discuss tools that work with individual letters. The chapter finishes with functions that work with individual letters and a brief discussion of where your expectations from English might steer you wrong when working with other languages.

                      We’ll keep working with strings in the next chapter, where you’ll learn more about the power of regular expressions.

                      -
                      +

                      Prerequisites

                      -
                      -
                      - -
                      - -
                      - -

                      This chapter relies on features only found in tidyr 1.3.0, which is still in development. If you want to live on the edge, you can get the dev versions with devtools::install_github("tidyverse/tidyr").

                      -

                      In this chapter, we’ll use functions from the stringr package, which is part of the core tidyverse. We’ll also use the babynames data since it provides some fun strings to manipulate.

                      library(tidyverse)
                      @@ -113,7 +104,7 @@ str_view(x)
                       

                      Note that str_view() uses a blue background for tabs to make them easier to spot. One of the challenges of working with text is that there’s a variety of ways that white space can end up in the text, so this background helps you recognize that something strange is going on.

                      -
                      +

                      Exercises

                      1. @@ -138,7 +129,7 @@ Creating many strings from data

                        -str_c() +str_c()

                        str_c() takes any number of vectors as arguments and returns a character vector:

                        @@ -151,16 +142,14 @@ str_c("Hello ", c("John", "Susan"))

                        str_c() is very similar to the base paste0(), but is designed to be used with mutate() by obeying the usual tidyverse rules for recycling and propagating missing values:

                        -
                        set.seed(1410)
                        -df <- tibble(name = c(wakefield::name(3), NA))
                        +
                        df <- tibble(name = c("Flora", "David", "Terra"))
                         df |> mutate(greeting = str_c("Hi ", name, "!"))
                        -#> # A tibble: 4 × 2
                        -#>   name       greeting      
                        -#>   <chr>      <chr>         
                        -#> 1 Ilena      Hi Ilena!     
                        -#> 2 Sacramento Hi Sacramento!
                        -#> 3 Graylon    Hi Graylon!   
                        -#> 4 <NA>       <NA>
                        +#> # A tibble: 3 × 2 +#> name greeting +#> <chr> <chr> +#> 1 Flora Hi Flora! +#> 2 David Hi David! +#> 3 Terra Hi Terra!

                        If you want missing values to display in another way, use coalesce() to replace them. Depending on what you want, you might use it either inside or outside of str_c():

                        @@ -169,48 +158,45 @@ df |> mutate(greeting = str_c("Hi ", name, "!")) greeting1 = str_c("Hi ", coalesce(name, "you"), "!"), greeting2 = coalesce(str_c("Hi ", name, "!"), "Hi!") ) -#> # A tibble: 4 × 3 -#> name greeting1 greeting2 -#> <chr> <chr> <chr> -#> 1 Ilena Hi Ilena! Hi Ilena! -#> 2 Sacramento Hi Sacramento! Hi Sacramento! -#> 3 Graylon Hi Graylon! Hi Graylon! -#> 4 <NA> Hi you! Hi! +#> # A tibble: 3 × 3 +#> name greeting1 greeting2 +#> <chr> <chr> <chr> +#> 1 Flora Hi Flora! Hi Flora! +#> 2 David Hi David! Hi David! +#> 3 Terra Hi Terra! Hi Terra!

                        -str_glue() +str_glue()

                        If you are mixing many fixed and variable strings with str_c(), you’ll notice that you type a lot of "s, making it hard to see the overall goal of the code. An alternative approach is provided by the glue package via str_glue()If you’re not using stringr, you can also access it directly with glue::glue().. You give it a single string that has a special feature: anything inside {} will be evaluated like it’s outside of the quotes:

                        df |> mutate(greeting = str_glue("Hi {name}!"))
                        -#> # A tibble: 4 × 2
                        -#>   name       greeting      
                        -#>   <chr>      <glue>        
                        -#> 1 Ilena      Hi Ilena!     
                        -#> 2 Sacramento Hi Sacramento!
                        -#> 3 Graylon    Hi Graylon!   
                        -#> 4 <NA>       Hi NA!
                        +#> # A tibble: 3 × 2 +#> name greeting +#> <chr> <glue> +#> 1 Flora Hi Flora! +#> 2 David Hi David! +#> 3 Terra Hi Terra!

                        As you can see, str_glue() currently converts missing values to the string "NA" unfortunately making it inconsistent with str_c().

                        You also might wonder what happens if you need to include a regular { or } in your string. You’re on the right track if you guess you’ll need to escape it somehow. The trick is that glue uses a slightly different escaping technique; instead of prefixing with special character like \, you double up the special characters:

                        df |> mutate(greeting = str_glue("{{Hi {name}!}}"))
                        -#> # A tibble: 4 × 2
                        -#>   name       greeting        
                        -#>   <chr>      <glue>          
                        -#> 1 Ilena      {Hi Ilena!}     
                        -#> 2 Sacramento {Hi Sacramento!}
                        -#> 3 Graylon    {Hi Graylon!}   
                        -#> 4 <NA>       {Hi NA!}
                        +#> # A tibble: 3 × 2 +#> name greeting +#> <chr> <glue> +#> 1 Flora {Hi Flora!} +#> 2 David {Hi David!} +#> 3 Terra {Hi Terra!}

                        -str_flatten() +str_flatten()

                        str_c() and glue() work well with mutate() because their output is the same length as their inputs. What if you want a function that works well with summarize(), i.e., something that always returns a single string? That’s the job of str_flatten()The base R equivalent is paste() used with the collapse argument.: it takes a character vector and combines each element of the vector into a single string:

                        @@ -244,7 +230,7 @@ df |>
                        -
                        +

                        Exercises

                        1. @@ -598,7 +584,12 @@ Long strings
                        2. str_wrap(x, 30) wraps a string introducing new lines so that each line is at most 30 characters (it doesn’t hyphenate, however, so any word longer than 30 characters will make a longer line)

                        3. The following code shows these functions in action with a made-up string:

                          -
                          x <- "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat."
                          +
                          x <- paste0(
                          +  "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod ",
                          +  "tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim ",
                          +  "veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea",
                          +  "commodo consequat."
                          +)
                           
                           str_view(str_trunc(x, 30))
                           #> [1] │ Lorem ipsum dolor sit amet,...
                          @@ -610,12 +601,12 @@ str_view(str_wrap(x, 30))
                           #>     │ magna aliqua. Ut enim ad
                           #>     │ minim veniam, quis nostrud
                           #>     │ exercitation ullamco laboris
                          -#>     │ nisi ut aliquip ex ea commodo
                          +#>     │ nisi ut aliquip ex eacommodo
                           #>     │ consequat.
                        -
                        +

                        Exercises

                        1. Use str_length() and str_sub() to extract the middle letter from each baby name. What will you do if the string has an even number of characters?
                        2. @@ -734,7 +725,7 @@ str_sort(c("a", "c", "ch", "h", "z"), locale = "cs")
                        -
                        +

                        Summary

                        In this chapter, you’ve learned about some of the power of the stringr package: how to create, combine, and extract strings, and about some of the challenges you might face with non-English strings. Now it’s time to learn one of the most important and powerful tools for working with strings: regular expressions. Regular expressions are a very concise but very expressive language for describing patterns within strings and are the topic of the next chapter.

                        diff --git a/oreilly/webscraping.html b/oreilly/webscraping.html index 7911892..1edc6a5 100644 --- a/oreilly/webscraping.html +++ b/oreilly/webscraping.html @@ -1,6 +1,6 @@

                        Web scraping

                        This vignette introduces you to the basics of web scraping with rvest. Web scraping is a very useful tool for extracting data from web pages. Some websites will offer an API, a set of structured HTTP requests that return data as JSON, which you handle using the techniques from #chp-rectangling. Where possible, you should use the API, because typically it will give you more reliable data. Unfortunately, however, programming with web APIs is out of scope for this book. Instead, we are teaching scraping, a technique that works whether or not a site provides an API.

                        In this chapter, we’ll first discuss the ethics and legalities of scraping before we dive into the basics of HTML. You’ll then learn the basics of CSS selectors to locate specific elements on the page, and how to use rvest functions to get data from text and attributes out of HTML and into R. We’ll then discuss some techniques to figure out what CSS selector you need for the page you’re scraping, before finishing up with a couple of case studies, and a brief discussion of dynamic websites.

                        -
                        +

                        Prerequisites

                        In this chapter, we’ll focus on tools provided by rvest. rvest is a member of the tidyverse, but is not a core member so you’ll need to load it explicitly. We’ll also load the full tidyverse since we’ll find it generally useful working with the data we’ve scraped.

                        @@ -240,7 +240,7 @@ html |>

                        html_attr() always returns a string, so if you’re extracting numbers or dates, you’ll need to do some post-processing.

                        -
                        +

                        Tables

                        If you’re lucky, your data will be already stored in an HTML table, and it’ll be a matter of just reading it from that table. It’s usually straightforward to recognize a table in your browser: it’ll have a rectangular structure of rows and columns, and you can copy and paste it into a tool like Excel.

                        @@ -248,22 +248,10 @@ Tables
                        html <- minimal_html("
                           <table class='mytable'>
                        -    <tr>
                        -      <th>x</th>
                        -      <th>y</th>
                        -    </tr>
                        -    <tr>
                        -      <td>1.5</td>
                        -      <td>2.7</td>
                        -    </tr>
                        -    <tr>
                        -      <td>4.9</td>
                        -      <td>1.3</td>
                        -    </tr>
                        -    <tr>
                        -      <td>7.2</td>
                        -      <td>8.1</td>
                        -    </tr>
                        +    <tr><th>x</th>   <th>y</th></tr>
                        +    <tr><td>1.5</td> <td>2.7</td></tr>
                        +    <tr><td>4.9</td> <td>1.3</td></tr>
                        +    <tr><td>7.2</td> <td>8.1</td></tr>
                           </table>
                           ")
                        @@ -374,7 +362,6 @@ section |> html_element(".director") |> html_text2() IMDB top films

                        For our next task we’ll tackle something a little trickier, extracting the top 250 movies from the internet movie database (IMDb). At the time we wrote this chapter, the page looked like #fig-scraping-imdb.

                        -
                        knitr::include_graphics("screenshots/scraping-imdb.png", dpi = 300)

                        The screenshot shows a table with columns "Rank and Title", "IMDb Rating", and "Your Rating". 9 movies out of the top 250 are shown. The top 5 are the Shawshank Redemption, The Godfather, The Dark Knight, The Godfather: Part II, and 12 Angry Men.

                        @@ -392,14 +379,14 @@ table <- html |> html_table() table #> # A tibble: 250 × 5 -#> `` `Rank & Title` `IMDb Rating` `Your Rating` `` -#> <lgl> <chr> <dbl> <chr> <lgl> -#> 1 NA "1.\n The Shawshank Redemptio… 9.2 "12345678910… NA -#> 2 NA "2.\n The Godfather\n … 9.2 "12345678910… NA -#> 3 NA "3.\n The Dark Knight\n … 9 "12345678910… NA -#> 4 NA "4.\n The Godfather: Part II\… 9 "12345678910… NA -#> 5 NA "5.\n 12 Angry Men\n (… 9 "12345678910… NA -#> 6 NA "6.\n Schindler's List\n … 8.9 "12345678910… NA +#> `` `Rank & Title` `IMDb Rating` `Your Rating` `` +#> <lgl> <chr> <dbl> <chr> <lgl> +#> 1 NA "1.\n The Shawshank Redempt… 9.2 "12345678910\n… NA +#> 2 NA "2.\n The Godfather\n … 9.2 "12345678910\n… NA +#> 3 NA "3.\n The Dark Knight\n … 9 "12345678910\n… NA +#> 4 NA "4.\n The Godfather: Part I… 9 "12345678910\n… NA +#> 5 NA "5.\n 12 Angry Men\n … 9 "12345678910\n… NA +#> 6 NA "6.\n Schindler's List\n … 8.9 "12345678910\n… NA #> # … with 244 more rows

                        This includes a few empty columns, but overall does a good job of capturing the information from the table. However, we need to do some more processing to make it easier to use. First, we’ll rename the columns to be easier to work with, and remove the extraneous whitespace in rank and title. We will do this with select() (instead of rename()) to do the renaming and selecting of just these two columns in one step. Then, we’ll apply separate_wider_regex() (from #sec-extract-variables) to pull out the title, year, and rank into their own variables.

                        @@ -438,12 +425,12 @@ ratings html_elements("td strong") |> head() |> html_attr("title") -#> [1] "9.2 based on 2,684,096 user ratings" -#> [2] "9.2 based on 1,861,107 user ratings" -#> [3] "9.0 based on 2,657,484 user ratings" -#> [4] "9.0 based on 1,273,669 user ratings" -#> [5] "9.0 based on 792,941 user ratings" -#> [6] "8.9 based on 1,357,901 user ratings" +#> [1] "9.2 based on 2,691,480 user ratings" +#> [2] "9.2 based on 1,867,146 user ratings" +#> [3] "9.0 based on 2,665,189 user ratings" +#> [4] "9.0 based on 1,276,943 user ratings" +#> [5] "9.0 based on 795,129 user ratings" +#> [6] "8.9 based on 1,361,148 user ratings"

                        We can combine this with the tabular data and again apply separate_wider_regex() to extract out the bit of data we care about:

                        @@ -465,12 +452,12 @@ ratings #> # A tibble: 250 × 5 #> rank title year rating number #> <chr> <chr> <chr> <dbl> <dbl> -#> 1 1 The Shawshank Redemption 1994 9.2 2684096 -#> 2 2 The Godfather 1972 9.2 1861107 -#> 3 3 The Dark Knight 2008 9 2657484 -#> 4 4 The Godfather: Part II 1974 9 1273669 -#> 5 5 12 Angry Men 1957 9 792941 -#> 6 6 Schindler's List 1993 8.9 1357901 +#> 1 1 The Shawshank Redemption 1994 9.2 2691480 +#> 2 2 The Godfather 1972 9.2 1867146 +#> 3 3 The Dark Knight 2008 9 2665189 +#> 4 4 The Godfather: Part II 1974 9 1276943 +#> 5 5 12 Angry Men 1957 9 795129 +#> 6 6 Schindler's List 1993 8.9 1361148 #> # … with 244 more rows
                        @@ -483,7 +470,7 @@ Dynamic sites

                        It’s still possible to scrape these types of sites, but rvest needs to use a more expensive process: fully simulating the web browser including running all javascript. This functionality is not available at the time of writing, but it’s something we’re actively working on and should be available by the time you read this. It uses the chromote package which actually runs the Chrome browser in the background, and gives you additional tools to interact with the site, like a human typing text and clicking buttons. Check out the rvest website for more details.

                        -
                        +

                        Summary

                        In this chapter, you’ve learned about the why, the why not, and the how of scraping data from web pages. First, you’ve learned about the basics of HTML and using CSS selectors to refer to specific elements, then you’ve learned about using the rvest package to get data out of HTML into R. We then demonstrated web scraping with two case studies: a simpler scenario on scraping data on StarWars films from the rvest package website and a more complex scenario on scraping the top 250 films from IMDB.

                        diff --git a/oreilly/workflow-basics.html b/oreilly/workflow-basics.html index 42aa69e..a7192ca 100644 --- a/oreilly/workflow-basics.html +++ b/oreilly/workflow-basics.html @@ -119,7 +119,7 @@ Calling functions
                        -
                        +

                        Exercises

                        1. @@ -153,7 +153,7 @@ ggsave(filename = "mpg-plot.png", plot = my_bar_plot)
                        -
                        +

                        Summary

                        Now that you’ve learned a little more about how R code works, and some tips to help you understand your code when you come back to it in the future. In the next chapter, we’ll continue your data science journey by teaching you about dplyr, the tidyverse package that helps you transform data, whether it’s selecting important variables, filtering down to rows of interest, or computing summary statistics.

                        diff --git a/oreilly/workflow-help.html b/oreilly/workflow-help.html index 8c8d5d6..3e32922 100644 --- a/oreilly/workflow-help.html +++ b/oreilly/workflow-help.html @@ -62,7 +62,7 @@ Investing in yourself

                        If you’re an active Twitter user, you might also want to follow Hadley (@hadleywickham), Mine (@minebocek), Garrett (@statgarrett), or follow @rstudiotips to keep up with new features in the IDE. If you want the full fire hose of new developments, you can also read the (#rstats) hashtag. This is one of the key tools that Hadley and Mine use to keep up with new developments in the community.

                        -
                        +

                        Summary

                        This chapter concludes the Whole Game part of the book. You’ve now seen the most important parts of the data science process: visualization, transformation, tidying and importing. Now you’ve got a holistic view of the whole process, and we start to get into the details of small pieces.

                        diff --git a/oreilly/workflow-pipes.html b/oreilly/workflow-pipes.html index bd2270f..53d2d1a 100644 --- a/oreilly/workflow-pipes.html +++ b/oreilly/workflow-pipes.html @@ -50,7 +50,7 @@ flights3 <- summarize(flight2,

                        -magrittr and the%>% pipe

                        +magrittr and the %>% pipe

                        If you’ve been using the tidyverse for a while, you might be familiar with the %>% pipe provided by the magrittr package. The magrittr package is included in the core tidyverse, so you can use %>% whenever you load the tidyverse:

                        library(tidyverse)
                        @@ -70,7 +70,7 @@ mtcars %>%
                         
                         

                        -|> vs. %>% +|> vs. %>%

                        While |> and %>% behave identically for simple cases, there are a few crucial differences. These are most likely to affect you if you’re a long-term user of %>% who has taken advantage of some of the more advanced features. But they’re still good to know about even if you’ve never used %>% because you’re likely to encounter some of them when reading wild-caught code.

                        • By default, the pipe passes the object on its left-hand side to the first argument of the function on the right-hand side. %>% allows you to change the placement with a . placeholder. For example, x %>% f(1) is equivalent to f(x, 1) but x %>% f(1, .) is equivalent to f(1, x). R 4.2.0 added a _ placeholder to the base pipe, with one additional restriction: the argument has to be named. For example, x |> f(1, y = _) is equivalent to f(1, y = x).

                        • @@ -89,7 +89,7 @@ mtcars %>%

                          -|> vs + +|> vs +

                          Sometimes we’ll turn the end of a data transformation pipeline into a plot. Watch for the transition from |> to +. We wish this transition wasn’t necessary, but unfortunately, ggplot2 was created before the pipe was discovered.

                          @@ -100,7 +100,7 @@ mtcars %>%
                          -
                          +

                          Summary

                          In this chapter, you’ve learned more about the pipe: why we recommend it and some of the history that lead to |>. The pipe is important because you’ll use it again and again throughout your analysis, but hopefully, it will quickly become invisible, and your fingers will type it (or use the keyboard shortcut) without your brain having to think too much about it.

                          diff --git a/oreilly/workflow-scripts.html b/oreilly/workflow-scripts.html index fe42de8..cda223a 100644 --- a/oreilly/workflow-scripts.html +++ b/oreilly/workflow-scripts.html @@ -116,7 +116,12 @@ What is the source of truth?

                      We collectively use this pattern hundreds of times a week.

                      RStudio server -

                      If you’re using RStudio server, your R session is never restarted by default. When you close your RStudio server tab, it might feel like you’re closing R, but the server actually keeps it running in the background. The next time you return, you’ll be in exactly the same place you left. This makes it even more important to regularly restart R so that you’re starting with a refresh slate.

                      + + + +

                      If you’re using RStudio server, your R session is never restarted by default. When you close your RStudio server tab, it might feel like you’re closing R, but the server actually keeps it running in the background. The next time you return, you’ll be in exactly the same place you left. This makes it even more important to regularly restart R so that you’re starting with a refresh slate.

                      + +
                      @@ -196,28 +201,21 @@ Relative and absolute paths
                      -
                      -

                      -Summary

                      -

                      In summary, scripts and projects give you a solid workflow that will serve you well in the future:

                      -
                      • Create one RStudio project for each data analysis project.
                      • -
                      • Save your scripts (with informative names) in the project, edit them, run them in bits or as a whole. Restart R frequently to make sure you’ve captured everything in your scripts.
                      • -
                      • Only ever use relative paths, not absolute paths.
                      • -

                      Then everything you need is in one place and cleanly separated from all the other projects that you are working on.

                      -
                      - -
                      +

                      Exercises

                      1. Go to the RStudio Tips Twitter account, https://twitter.com/rstudiotips and find one tip that looks interesting. Practice using it!

                      2. What other common mistakes will RStudio diagnostics report? Read https://support.posit.co/hc/en-us/articles/205753617-Code-Diagnostics to find out.

                      -
                      +

                      Summary

                      -

                      In this chapter, you’ve learned how to organize your R code in scripts (files) and projects (directories). Much like code style, this may feel like busywork at first. But as you accumulate more code across multiple projects, you’ll learn to appreciate how a little up front organisation can save you a bunch of time down the road.

                      -

                      Next up, you’ll learn about how to get help and how to ask good coding questions.

                      +

                      In summary, scripts and projects give you a solid workflow that will serve you well in the future:

                      +
                      • Create one RStudio project for each data analysis project.
                      • +
                      • Save your scripts (with informative names) in the project, edit them, run them in bits or as a whole. Restart R frequently to make sure you’ve captured everything in your scripts.
                      • +
                      • Only ever use relative paths, not absolute paths.
                      • +

                      Then everything you need is in one place and cleanly separated from all the other projects that you are working on. Next up, you’ll learn about how to get help and how to ask good coding questions.

                      diff --git a/oreilly/workflow-style.html b/oreilly/workflow-style.html index fbdf5c9..6812140 100644 --- a/oreilly/workflow-style.html +++ b/oreilly/workflow-style.html @@ -153,7 +153,7 @@ ggplot2 span = 0.5, se = FALSE, color = "white", - size = 4 + linewidth = 4 ) + geom_point() @@ -179,7 +179,7 @@ Sectioning comments
                      -
                      +

                      Exercises

                      1. @@ -192,7 +192,7 @@ flights|>filter(carrier=="UA",dest%in%c("IAH","HOU"),sched_dep_time>0900,s
                      -
                      +

                      Summary

                      In this chapter, you’ve learn the most important principles of code style. These may feel like a set of arbitrary rules to start with (because they are!) but over time, as you write more code, and share code with more people, you’ll see how important a consistent style is. And don’t forget about the styler package: it’s a great way to quickly improve the quality of poorly styled code.

                      diff --git a/workflow-scripts.qmd b/workflow-scripts.qmd index 09fe42a..f3155be 100644 --- a/workflow-scripts.qmd +++ b/workflow-scripts.qmd @@ -335,16 +335,6 @@ Mac and Linux uses slashes (e.g. `plots/diamonds.pdf`) and Windows uses backslas R can work with either type (no matter what platform you're currently using), but unfortunately, backslashes mean something special to R, and to get a single backslash in the path, you need to type two backslashes! That makes life frustrating, so we recommend always using the Linux/Mac style with forward slashes. -## Summary - -In summary, scripts and projects give you a solid workflow that will serve you well in the future: - -- Create one RStudio project for each data analysis project. -- Save your scripts (with informative names) in the project, edit them, run them in bits or as a whole. Restart R frequently to make sure you've captured everything in your scripts. -- Only ever use relative paths, not absolute paths. - -Then everything you need is in one place and cleanly separated from all the other projects that you are working on. - ## Exercises 1. Go to the RStudio Tips Twitter account, and find one tip that looks interesting. @@ -355,8 +345,11 @@ Then everything you need is in one place and cleanly separated from all the othe ## Summary -In this chapter, you've learned how to organize your R code in scripts (files) and projects (directories). -Much like code style, this may feel like busywork at first. -But as you accumulate more code across multiple projects, you'll learn to appreciate how a little up front organisation can save you a bunch of time down the road. +In summary, scripts and projects give you a solid workflow that will serve you well in the future: +- Create one RStudio project for each data analysis project. +- Save your scripts (with informative names) in the project, edit them, run them in bits or as a whole. Restart R frequently to make sure you've captured everything in your scripts. +- Only ever use relative paths, not absolute paths. + +Then everything you need is in one place and cleanly separated from all the other projects that you are working on. Next up, you'll learn about how to get help and how to ask good coding questions.