Switch from I to we

Fixes #642
This commit is contained in:
Hadley Wickham 2022-08-09 11:43:12 -05:00
parent c6b1f501c2
commit 1d0902c9bf
22 changed files with 145 additions and 152 deletions

24
EDA.qmd
View File

@ -65,7 +65,7 @@ You can loosely word these questions as:
2. What type of covariation occurs between my variables?
The rest of this chapter will look at these two questions.
I'll explain what variation and covariation are, and I'll show you several ways to answer each question.
We'll explain what variation and covariation are, and we'll show you several ways to answer each question.
To make the discussion easier, let's define some terms:
- A **variable** is a quantity, quality, or property that you can measure.
@ -75,7 +75,7 @@ To make the discussion easier, let's define some terms:
- An **observation** is a set of measurements made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object).
An observation will contain several values, each associated with a different variable.
I'll sometimes refer to an observation as a data point.
We'll sometimes refer to an observation as a data point.
- **Tabular data** is a set of values, each associated with a variable and an observation.
Tabular data is *tidy* if each value is placed in its own "cell", each variable in its own column, and each observation in its own row.
@ -166,7 +166,7 @@ ggplot(data = smaller, mapping = aes(x = carat)) +
geom_histogram(binwidth = 0.1)
```
If you wish to overlay multiple histograms in the same plot, I recommend using `geom_freqpoly()` instead of `geom_histogram()`.
If you wish to overlay multiple histograms in the same plot, we recommend using `geom_freqpoly()` instead of `geom_histogram()`.
`geom_freqpoly()` performs the same calculation as `geom_histogram()`, but instead of displaying the counts with bars, uses lines instead.
It's much easier to understand overlapping lines than bars.
@ -190,7 +190,7 @@ There are a few challenges with this type of plot, which we will come back to in
Now that you can visualize variation, what should you look for in your plots?
And what type of follow-up questions should you ask?
I've put together a list below of the most useful types of information that you will find in your graphs, along with some follow-up questions for each type of information.
We've put together a list below of the most useful types of information that you will find in your graphs, along with some follow-up questions for each type of information.
The key to asking good follow-up questions will be to rely on your curiosity (What do you want to learn more about?) as well as your skepticism (How could this be misleading?).
### Typical values
@ -354,10 +354,10 @@ If you've encountered unusual values in your dataset, and simply want to move on
filter(between(y, 3, 20))
```
I don't recommend this option because just because one measurement is invalid, doesn't mean all the measurements are.
We don't recommend this option because just because one measurement is invalid, doesn't mean all the measurements are.
Additionally, if you have low quality data, by time that you've applied this approach to every variable you might find that you don't have any data left!
2. Instead, I recommend replacing the unusual values with missing values.
2. Instead, we recommend replacing the unusual values with missing values.
The easiest way to do this is to use `mutate()` to replace the variable with a modified copy.
You can use the `if_else()` function to replace unusual values with `NA`:
@ -936,7 +936,7 @@ ggplot(faithful, aes(eruptions)) +
Sometimes we'll turn the end of a pipeline of data transformation into a plot.
Watch for the transition from `|>` to `+`.
I wish this transition wasn't necessary but unfortunately ggplot2 was created before the pipe was discovered.
We wish this transition wasn't necessary but unfortunately ggplot2 was created before the pipe was discovered.
```{r}
#| eval: false
@ -955,11 +955,7 @@ diamonds |>
## Learning more
If you want to learn more about the mechanics of ggplot2, I'd highly recommend reading the [ggplot2 book](https://ggplot2-book.org).
It's been recently updated and has much more space to explore all the facets of visualization.
If you want to learn more about the mechanics of ggplot2, we highly recommend reading the [ggplot2 book](https://ggplot2-book.org).
Another useful resource is the [*R Graphics Cookbook*](https://r-graphics.org) by Winston Chang.
Another useful resource is the [*R Graphics Cookbook*](https://www.amazon.com/Graphics-Cookbook-Practical-Recipes-Visualizing/dp/1449316956) by Winston Chang.
Much of the contents are available online at <http://www.cookbook-r.com/Graphs/>.
I also recommend [*Graphical Data Analysis with R*](https://www.amazon.com/Graphical-Data-Analysis-Chapman-Hall/dp/1498715230), by Antony Unwin.
This is a book-length treatment similar to the material covered in this chapter, but has the space to go into much greater depth.
<!--# TODO: add Claus + Kieran books -->

View File

@ -19,9 +19,9 @@ To help others quickly build up a good mental model of the data, you will need t
In this chapter, you'll learn some of the tools that ggplot2 provides to do so.
This chapter focuses on the tools you need to create good graphics.
I assume that you know what you want, and just need to know how to do it.
For that reason, I highly recommend pairing this chapter with a good general visualisation book.
I particularly like [*The Truthful Art*](https://www.amazon.com/gp/product/0321934075/), by Albert Cairo.
We assume that you know what you want, and just need to know how to do it.
For that reason, we highly recommend pairing this chapter with a good general visualisation book.
We particularly like [*The Truthful Art*](https://www.amazon.com/gp/product/0321934075/), by Albert Cairo.
It doesn't teach the mechanics of creating visualisations, but instead focuses on what you need to think about in order to create effective graphics.
### Prerequisites
@ -165,7 +165,7 @@ ggplot(mpg, aes(displ, hwy)) +
ggrepel::geom_label_repel(aes(label = model), data = best_in_class)
```
Note another handy technique used here: I added a second layer of large, hollow points to highlight the points that I've labelled.
Note another handy technique used here: we added a second layer of large, hollow points to highlight the labelled points.
You can sometimes use the same idea to replace the legend with labels placed directly on the plot.
It's not wonderful for this plot, but it isn't too bad.
@ -221,7 +221,7 @@ ggplot(mpg, aes(displ, hwy)) +
geom_text(aes(label = label), data = label, vjust = "top", hjust = "right")
```
In these examples, I manually broke the label up into lines using `"\n"`.
In these examples, we manually broke the label up into lines using `"\n"`.
Another approach is to use `stringr::str_wrap()` to automatically add line breaks, given the number of characters you want per line:
```{r}
@ -263,7 +263,7 @@ Remember, in addition to `geom_text()`, you have many other geoms in ggplot2 ava
A few ideas:
- Use `geom_hline()` and `geom_vline()` to add reference lines.
I often make them thick (`size = 2`) and white (`colour = white`), and draw them underneath the primary data layer.
We often make them thick (`size = 2`) and white (`colour = white`), and draw them underneath the primary data layer.
That makes them easy to see, without drawing attention away from the data.
- Use `geom_rect()` to draw a rectangle around points of interest.
@ -699,7 +699,7 @@ file.remove("my-plot.pdf")
If you don't specify the `width` and `height` they will be taken from the dimensions of the current plotting device.
For reproducible code, you'll want to specify them.
Generally, however, I think you should be assembling your final reports using R Markdown, so I want to focus on the important code chunk options that you should know about for graphics.
Generally, however, we recommend that you assemble your final reports using R Markdown, so we focus on the important code chunk options that you should know about for graphics.
You can learn more about `ggsave()` in the documentation.
### Figure sizing
@ -710,18 +710,20 @@ The biggest challenge of graphics in R Markdown is getting your figures the righ
There are five main options that control figure sizing: `fig.width`, `fig.height`, `fig.asp`, `out.width` and `out.height`.
Image sizing is challenging because there are two sizes (the size of the figure created by R and the size at which it is inserted in the output document), and multiple ways of specifying the size (i.e., height, width, and aspect ratio: pick two of three).
I only ever use three of the five options:
<!-- TODO: https://www.tidyverse.org/blog/2020/08/taking-control-of-plot-scaling/ -->
- I find it most aesthetically pleasing for plots to have a consistent width.
To enforce this, I set `fig.width = 6` (6") and `fig.asp = 0.618` (the golden ratio) in the defaults.
Then in individual chunks, I only adjust `fig.asp`.
We recommend three of the five options:
- I control the output size with `out.width` and set it to a percentage of the line width.
I default to `out.width = "70%"` and `fig.align = "center"`.
- Plots tend to be more aesthetically pleasing if they have consistent width.
To enforce this, set `fig.width = 6` (6") and `fig.asp = 0.618` (the golden ratio) in the defaults.
Then in individual chunks, only adjust `fig.asp`.
- Control the output size with `out.width` and set it to a percentage of the line width.
We suggest to `out.width = "70%"` and `fig.align = "center"`.
That gives plots room to breathe, without taking up too much space.
- To put multiple plots in a single row I set the `out.width` to `50%` for two plots, `33%` for 3 plots, or `25%` to 4 plots, and set `fig.align = "default"`.
Depending on what I'm trying to illustrate (e.g. show data or show plot variations), I'll also tweak `fig.width`, as discussed below.
- To put multiple plots in a single row, set the `out.width` to `50%` for two plots, `33%` for 3 plots, or `25%` to 4 plots, and set `fig.align = "default"`.
Depending on what you're trying to illustrate (e.g. show data or show plot variations), you might also tweak `fig.width`, as discussed below.
If you find that you're having to squint to read the text in your plot, you need to tweak `fig.width`.
If `fig.width` is larger than the size the figure is rendered in the final doc, the text will be too small; if `fig.width` is smaller, the text will be too big.
@ -760,7 +762,7 @@ For example, if your default `fig.width` is 6 and `out.width` is 0.7, when you s
### Other important options
When mingling code and text, like I do in this book, I recommend setting `fig.show = "hold"` so that plots are shown after the code.
When mingling code and text, like in this book, you can set `fig.show = "hold"` so that plots are shown after the code.
This has the pleasant side effect of forcing you to break up large blocks of code with their explanations.
To add a caption to the plot, use `fig.cap`.

View File

@ -15,7 +15,7 @@ R has several systems for making graphs, but ggplot2 is one of the most elegant
ggplot2 implements the **grammar of graphics**, a coherent system for describing and building graphs.
With ggplot2, you can do more faster by learning one system and applying it in many places.
If you'd like to learn more about the theoretical underpinnings of ggplot2, I recommend reading "The Layered Grammar of Graphics", <http://vita.had.co.nz/papers/layered-grammar.pdf>.
If you'd like to learn more about the theoretical underpinnings of ggplot2, you might enjoy reading "The Layered Grammar of Graphics", <http://vita.had.co.nz/papers/layered-grammar.pdf>, the scientific paper that discusses the theoretical underpinnings..
### Prerequisites
@ -91,7 +91,7 @@ Does this confirm or refute your hypothesis about fuel efficiency and engine siz
With ggplot2, you begin a plot with the function `ggplot()`.
`ggplot()` creates a coordinate system that you can add layers to.
The first argument of `ggplot()` is the dataset to use in the graph.
So `ggplot(data = mpg)` creates an empty graph, but it's not very interesting so I'm not going to show it here.
So `ggplot(data = mpg)` creates an empty graph, but it's not very interesting so we won't show it here.
You complete your graph by adding one or more layers to `ggplot()`.
The function `geom_point()` adds a layer of points to your plot, which creates a scatterplot.
@ -364,7 +364,7 @@ ggplot(shapes, aes(x, y)) +
As you start to run R code, you're likely to run into problems.
Don't worry --- it happens to everyone.
I have been writing R code for years, and every day I still write code that doesn't work!
We have all been writing R code for years, but every day we still write code that doesn't work!
Start by carefully comparing the code that you're running to the code in the book.
R is extremely picky, and a misplaced character can make all the difference.
@ -728,7 +728,7 @@ ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
3. What does `show.legend = FALSE` do?
What happens if you remove it?\
Why do you think I used it earlier in the chapter?
Why do you think we used it earlier in the chapter?
4. What does the `se` argument to `geom_smooth()` do?
@ -862,7 +862,7 @@ This means that you can typically use geoms without worrying about the underlyin
However, there are three reasons why you might need to use a stat explicitly:
1. You might want to override the default stat.
In the code below, I change the stat of `geom_bar()` from count (the default) to identity.
In the code below, we change the stat of `geom_bar()` from count (the default) to identity.
This lets me map the height of the bars to the raw values of a $y$ variable.
Unfortunately when people talk about bar charts casually, they might be referring to this type of bar chart, where the height of the bar is already present in the data, or the previous bar chart where the height of the bar is generated by counting rows.

View File

@ -130,7 +130,7 @@ dbWriteTable(con, "mpg", ggplot2::mpg)
dbWriteTable(con, "diamonds", ggplot2::diamonds)
```
If you're using duckdb in a real project, I highly recommend learning about `duckdb_read_csv()` and `duckdb_register_arrow()`.
If you're using duckdb in a real project, we highly recommend learning about `duckdb_read_csv()` and `duckdb_register_arrow()`.
These give you powerful and performant ways to quickly load data directly into duckdb, without having to first load it in to R.
## DBI basics
@ -159,7 +159,7 @@ con |>
as_tibble()
```
`dbReadTable()` returns a `data.frame` so I use `as_tibble()` to convert it into a tibble so that it prints nicely.
`dbReadTable()` returns a `data.frame` so we use `as_tibble()` to convert it into a tibble so that it prints nicely.
In real life, it's rare that you'll use `dbReadTable()` because often database tables are too big to fit in memory, and you want bring back only a subset of the rows and columns.
@ -255,7 +255,7 @@ Then, once you're ready to analyse the data with functions that are unique to R,
## SQL
The rest of the chapter will teach you a little SQL through the lens of dbplyr.
It's a rather non-traditional introduction to SQL but I hope it will get you quickly up to speed with the basics.
It's a rather non-traditional introduction to SQL but we hope it will get you quickly up to speed with the basics.
Luckily, if you understand dplyr you're in a great place to quickly pick up SQL because so many of the concepts are the same.
We'll explore the relationship between dplyr and SQL using a couple of old friends from the nycflights13 package: `flights` and `planes`.
@ -446,7 +446,7 @@ flights |>
summarise(delay = mean(arr_delay))
```
If you want to learn more about how NULLs work, I recommend "[*Three valued logic*](https://modern-sql.com/concept/three-valued-logic)" by Markus Winand.
If you want to learn more about how NULLs work, you might enjoy "[*Three valued logic*](https://modern-sql.com/concept/three-valued-logic)" by Markus Winand.
In general, you can work with `NULL`s using the functions you'd use for `NA`s in R:
@ -674,7 +674,7 @@ dbplyr's translations are certainly not perfect, and there are many R functions
### Learning more
If you've finished this chapter and would like to learn more about SQL.
I have two recommendations:
We have two recommendations:
- [*SQL for Data Scientists*](https://sqlfordatascientists.com) by Renée M. P. Teate is an introduction to SQL designed specifically for the needs of data scientists, and includes examples of the sort of highly interconnected data you're likely to encounter in real organisations.
- [*Practical SQL*](https://www.practicalsql.com) by Anthony DeBarros is written from the perspective of a data journalist (a data scientist specialized in telling compelling stories) and goes into more detail about getting your data into a database and running your own DBMS.

View File

@ -19,7 +19,7 @@ To warm up, try these three seemingly simple questions:
- Does every day have 24 hours?
- Does every minute have 60 seconds?
I'm sure you know that not every year has 365 days, but do you know the full rule for determining if a year is a leap year?
We're sure you know that not every year has 365 days, but do you know the full rule for determining if a year is a leap year?
(It has three parts.) You might have remembered that many parts of the world use daylight savings time (DST), so that some days have 23 hours, and others have 25.
You might not have known that some minutes have 61 seconds because every now and then leap seconds are added because the Earth's rotation is gradually slowing down.
@ -53,7 +53,7 @@ There are three types of date/time data that refer to an instant in time:
- A **date-time** is a date plus a time: it uniquely identifies an instant in time (typically to the nearest second).
Tibbles print this as `<dttm>`.
Elsewhere in R these are called POSIXct, but I don't think that's a very useful name.
Elsewhere in R these are called POSIXct, but that's not a very useful name.
In this chapter we are only going to focus on dates and date-times as R doesn't have a native class for storing times.
If you need one, you can use the **hms** package.
@ -135,7 +135,7 @@ flights |>
Let's do the same thing for each of the four time columns in `flights`.
The times are represented in a slightly odd format, so we use modulus arithmetic to pull out the hour and minute components.
Once I've created the date-time variables, I focus in on the variables we'll explore in the rest of the chapter.
Once we've created the date-time variables, we focus in on the variables we'll explore in the rest of the chapter.
```{r}
make_datetime_100 <- function(year, month, day, time) {
@ -155,7 +155,7 @@ flights_dt <- flights |>
flights_dt
```
With this data, I can visualise the distribution of departure times across the year:
With this data, we can visualise the distribution of departure times across the year:
```{r}
flights_dt |>

View File

@ -12,7 +12,7 @@ status("complete")
Factors are used for categorical variables, variables that have a fixed and known set of possible values.
They are also useful when you want to display character vectors in a non-alphabetical order.
If you want to learn more about factors after reading this chapter, I recommend reading Amelia McNamara and Nicholas Horton's paper, [*Wrangling categorical data in R*](https://peerj.com/preprints/3163/).
If you want to learn more about factors after reading this chapter, we recommend reading Amelia McNamara and Nicholas Horton's paper, [*Wrangling categorical data in R*](https://peerj.com/preprints/3163/).
This paper lays out some of the history discussed in [*stringsAsFactors: An unauthorized biography*](https://simplystatistics.org/posts/2015-07-24-stringsasfactors-an-unauthorized-biography/) and [*stringsAsFactors = \<sigh\>*](http://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh), and compares the tidy approaches to categorical data outlined in this book with base R methods.
An early version of the paper helped motivate and scope the forcats package; thanks Amelia & Nick!
@ -109,7 +109,7 @@ levels(f2)
For the rest of this chapter, we're going to use `forcats::gss_cat`.
It's a sample of data from the [General Social Survey](http://gss.norc.org), a long-running US survey conducted by the independent research organization NORC at the University of Chicago.
The survey has thousands of questions, so in `gss_cat` I've selected a handful that will illustrate some common challenges you'll encounter when working with factors.
The survey has thousands of questions, so in `gss_cat` Hadley selected a handful that will illustrate some common challenges you'll encounter when working with factors.
```{r}
gss_cat
@ -193,7 +193,7 @@ ggplot(relig_summary, aes(tvhours, fct_reorder(relig, tvhours))) +
Reordering religion makes it much easier to see that people in the "Don't know" category watch much more TV, and Hinduism & Other Eastern religions watch much less.
As you start making more complicated transformations, I'd recommend moving them out of `aes()` and into a separate `mutate()` step.
As you start making more complicated transformations, we recommend moving them out of `aes()` and into a separate `mutate()` step.
For example, you could rewrite the plot above as:
```{r}
@ -425,6 +425,6 @@ In practice, `ordered()` factors behave very similarly to regular factors.
There are only two places where you might notice different behavior:
- If you map an ordered factor to color or fill in ggplot2, it will default to `scale_color_viridis()`/`scale_fill_viridis()`, a color scale that implies a ranking.
- If you use an ordered function in a linear model, it will use "polygonal contrasts". These are mildly useful, but you are unlikely to have heard of them unless you have a PhD in Statistics, and even then you probably don't routinely interpret them. If you want to learn more, I'd recommend `vignette("contrasts", package = "faux")` by Lisa DeBruine.
- If you use an ordered function in a linear model, it will use "polygonal contrasts". These are mildly useful, but you are unlikely to have heard of them unless you have a PhD in Statistics, and even then you probably don't routinely interpret them. If you want to learn more, we recommend `vignette("contrasts", package = "faux")` by Lisa DeBruine.
Given the arguable utility of these differences, we don't generally recommend using ordered factors.

View File

@ -19,7 +19,7 @@ Writing a function has three big advantages over using copy-and-paste:
3. You eliminate the chance of making incidental mistakes when you copy and paste (i.e. updating a variable name in one place, but not in another).
Writing good functions is a lifetime journey.
Even after using R for many years I still learn new techniques and better ways of approaching old problems.
Even after using R for many years we still learn new techniques and better ways of approaching old problems.
The goal of this chapter is not to teach you every esoteric detail of functions but to get you started with some pragmatic advice that you can apply immediately.
As well as practical advice for writing functions, this chapter also gives you some suggestions for how to style your code.
@ -58,7 +58,7 @@ df$d <- (df$d - min(df$d, na.rm = TRUE)) /
You might be able to puzzle out that this rescales each column to have a range from 0 to 1.
But did you spot the mistake?
I made an error when copying-and-pasting the code for `df$b`: I forgot to change an `a` to a `b`.
Hadley made an error when copying-and-pasting the code for `df$b`: he forgot to change an `a` to a `b`.
Extracting repeated code out into a function is a good idea because it prevents you from making this type of mistake.
To write a function you need to first analyse the code.
@ -73,7 +73,7 @@ How many inputs does it have?
This code only has one input: `df$a`.
(If you're surprised that `TRUE` is not an input, you can explore why in the exercise below.) To make the inputs more clear, it's a good idea to rewrite the code using temporary variables with general names.
Here this code only requires a single numeric vector, so I'll call it `x`:
Here this code only requires a single numeric vector, so we'll call it `x`:
```{r}
x <- df$a
@ -89,7 +89,7 @@ rng <- range(x, na.rm = TRUE)
```
Pulling out intermediate calculations into named variables is a good practice because it makes it more clear what the code is doing.
Now that I've simplified the code, and checked that it still works, I can turn it into a function:
Now that we've simplified the code, and checked that it still works, we can turn it into a function:
```{r}
rescale01 <- function(x) {
@ -102,7 +102,7 @@ rescale01(c(0, 5, 10))
There are three key steps to creating a new function:
1. You need to pick a **name** for the function.
Here I've used `rescale01` because this function rescales a vector to lie between 0 and 1.
Here we used `rescale01` because this function rescales a vector to lie between 0 and 1.
2. You list the inputs, or **arguments**, to the function inside `function`.
Here we have just one argument.
@ -110,7 +110,7 @@ There are three key steps to creating a new function:
3. You place the code you have developed in the **body** of the function, a `{` block that immediately follows `function(...)`.
Note the overall process: I only made the function after I'd figured out how to make it work with a simple input.
Note the overall process: we only made the function after we'd figured out how to make it work with a simple input.
It's easier to start with working code and turn it into a function; it's harder to create a function and then try to make it work.
At this point it's a good idea to check your function with a few different inputs:
@ -234,7 +234,7 @@ impute_missing()
collapse_years()
```
If your function name is composed of multiple words, I recommend using "snake_case", where each lowercase word is separated by an underscore.
If your function name is composed of multiple words, we recommend using "snake_case", where each lowercase word is separated by an underscore.
camelCase is a popular alternative.
It doesn't really matter which one you pick, the important thing is to be consistent: pick one or the other and stick with it.
R itself is not very consistent, but there's nothing you can do about that.
@ -471,7 +471,7 @@ y <- 10
x <- if (y < 20) "Too low" else "Too high"
```
I recommend this only for very brief `if` statements.
We recommend this only for very brief `if` statements.
Otherwise, the full form is easier to read:
```{r}
@ -517,7 +517,7 @@ if (y < 20) {
}
```
How would you change the call to `cut()` if I'd used `<` instead of `<=`?
How would you change the call to `cut()` if we used `<` instead of `<=`?
What is the other chief advantage of `cut()` for this problem?
(Hint: what happens if you have many values in `temp`?)
@ -661,7 +661,7 @@ wt_mean <- function(x, w) {
Be careful not to take this too far.
There's a tradeoff between how much time you spend making your function robust, versus how long you spend writing it.
For example, if you also added a `na.rm` argument, I probably wouldn't check it carefully:
For example, if you also added a `na.rm` argument, you don't need to check it carefully:
```{r}
wt_mean <- function(x, w, na.rm = FALSE) {
@ -721,7 +721,7 @@ This special argument captures any number of arguments that aren't otherwise mat
It's useful because you can then send those `...` on to another function.
This is a useful catch-all if your function primarily wraps another function.
For example, I commonly create these helper functions that wrap around `str_c()`:
For example, Hadley often create's these helper functions that wrap around `str_c()`:
```{r}
commas <- function(...) stringr::str_c(..., collapse = ", ")
@ -735,7 +735,7 @@ rule <- function(..., pad = "-") {
rule("Important output")
```
Here `...` lets me forward on any arguments that I don't want to deal with to `str_c()`.
Here `...` lets you forward on any extra arguments to `str_c()`.
It's a very convenient technique.
But it does come at a price: any misspelled arguments will not raise an error.
This makes it easy for typos to go unnoticed:
@ -782,7 +782,7 @@ There are two things you should consider when returning a value:
### Explicit return statements
The value returned by the function is usually the last statement it evaluates, but you can choose to return early by using `return()`.
I think it's best to save the use of `return()` to signal that you can return early with a simpler solution.
We think it's best to save the use of `return()` to signal that you can return early with a simpler solution.
A common reason to do this is because the inputs are empty:
```{r}

View File

@ -215,12 +215,12 @@ for (i in seq_along(df)) {
```
Typically you'll be modifying a list or data frame with this sort of loop, so remember to use `[[`, not `[`.
You might have spotted that I used `[[` in all my for loops: I think it's better to use `[[` even for atomic vectors because it makes it clear that I want to work with a single element.
You might have spotted that we used `[[` in all my for loops: we think it's better to use `[[` even for atomic vectors because it makes it clear that you want to work with a single element.
### Looping patterns
There are three basic ways to loop over a vector.
So far I've shown you the most general: looping over the numeric indices with `for (i in seq_along(xs))`, and extracting the value with `x[[i]]`.
So far we've shown you the most general: looping over the numeric indices with `for (i in seq_along(xs))`, and extracting the value with `x[[i]]`.
There are two other forms:
1. Loop over the elements: `for (x in xs)`.
@ -281,7 +281,7 @@ str(out)
str(unlist(out))
```
Here I've used `unlist()` to flatten a list of vectors into a single vector.
Here we've used `unlist()` to flatten a list of vectors into a single vector.
A stricter option is to use `purrr::flatten_dbl()` --- it will throw an error if the input isn't a list of doubles.
This pattern occurs in other places too:
@ -348,7 +348,7 @@ while (nheads < 3) {
flips
```
I mention while loops only briefly, because I hardly ever use them.
I mention while loops only briefly, because we hardly ever use them.
They're most often used for simulation, which is outside the scope of this book.
However, it is good to know they exist so that you're prepared for problems where the number of iterations is not known in advance.
@ -370,12 +370,12 @@ However, it is good to know they exist so that you're prepared for problems wher
show_mean(mpg)
#> displ: 3.47
#> year: 2004
#> year: 2004
#> cyl: 5.89
#> cty: 16.86
```
(Extra challenge: what function did I use to make sure that the numbers lined up nicely, even though the variable names had different lengths?)
(Extra challenge: what function did we use to make sure that the numbers lined up nicely, even though the variable names had different lengths?)
4. What does this code do?
How does it work?
@ -591,7 +591,7 @@ models <- mtcars |>
map(~lm(mpg ~ wt, data = .x))
```
Here I've used `.x` as a pronoun: it refers to the current list element (in the same way that `i` referred to the current index in the for loop).
Here we've used `.x` as a pronoun: it refers to the current list element (in the same way that `i` referred to the current index in the for loop).
`.x` in a one-sided formula corresponds to an argument in an anonymous function.
When you're looking at many models, you might want to extract a summary statistic like the $R^2$.
@ -649,7 +649,7 @@ If you're familiar with the apply family of functions in base R, you might have
The only problem with `vapply()` is that it's a lot of typing: `vapply(df, is.numeric, logical(1))` is equivalent to `map_lgl(df, is.numeric)`.
One advantage of `vapply()` over purrr's map functions is that it can also produce matrices --- the map functions only ever produce vectors.
I focus on purrr functions here because they have more consistent names and arguments, helpful shortcuts, and in the future will provide easy parallelism and progress bars.
We focus on purrr functions here because they have more consistent names and arguments, helpful shortcuts, and in the future will provide easy parallelism and progress bars.
### Exercises
@ -853,7 +853,7 @@ params |>
pmap(rnorm)
```
As soon as your code gets complicated, I think a data frame is a good approach because it ensures that each column has a name and is the same length as all the other columns.
As soon as your code gets complicated, we think a data frame is a good approach because it ensures that each column has a name and is the same length as all the other columns.
### Invoking different functions

View File

@ -109,7 +109,7 @@ knitr::include_graphics("diagrams/relational-nycflights.png")
What variables would you need?
What data frames would you need to combine?
2. I forgot to draw the relationship between `weather` and `airports`.
2. We forgot to draw the relationship between `weather` and `airports`.
What is the relationship and how should it appear in the diagram?
3. `weather` only contains information for the origin (NYC) airports.
@ -227,7 +227,7 @@ flights2 |>
```
The result of joining airlines to flights2 is an additional variable: `name`.
This is why I call this type of join a mutating join.
This is why we call this type of join a mutating join.
In this case, you could get the same result using `mutate()` and a pair of base R functions, `[` and `match()`:
```{r}
@ -248,7 +248,7 @@ Finally, you'll learn how to tell dplyr which variables are the keys for a given
## Join types
To help you learn how joins work, I'm going to use a visual representation:
To help you learn how joins work, we'll use a visual representation:
```{r}
#| echo: false
@ -278,7 +278,7 @@ y <- tribble(
The coloured column represents the "key" variable: these are used to match the rows between the data frames.
The grey column represents the "value" column that is carried along for the ride.
In these examples I'll show a single key variable, but the idea generalises in a straightforward way to multiple keys and multiple values.
In these examples we've shown a single key variable, but the idea generalises in a straightforward way to multiple keys and multiple values.
A join is a way of connecting each row in `x` to zero, one, or more rows in `y`.
The following diagram shows each potential match as an intersection of a pair of lines.
@ -430,7 +430,7 @@ TODO: update for new warnings
knitr::include_graphics("diagrams/join-one-to-many.png")
```
Note that I've put the key column in a slightly different position in the output.
Note that we've put the key column in a slightly different position in the output.
This reflects that the key is a primary key in `y` and a foreign key in `x`.
```{r}

View File

@ -265,7 +265,7 @@ flights |>
This code doesn't error but it also doesn't seem to have worked.
What's going on?
Here R first evaluates `month == 11` creating a logical vector, which I'll call `nov`.
Here R first evaluates `month == 11` creating a logical vector, which we call `nov`.
It computes `nov | 12`.
When you use a number with a logical operator it converts everything apart from 0 to TRUE, so this is equivalent to `nov | TRUE` which will always be `TRUE`, so every row will be selected:
@ -546,13 +546,13 @@ flights |>
## Making groups {#sec-groups-from-logical}
Before we move on to the next chapter, I want to show you one last trick.
I don't know exactly how to describe it, and it feels a little magical, but it's super handy so I wanted to make sure you knew about it.
Before we move on to the next chapter, we want to show you one last trick.
We don't know exactly how to describe it, and it feels a little magical, but it's super handy so we wanted to make sure you knew about it.
Sometimes you want to divide your dataset up into groups based on the occurrence of some event.
For example, when you're looking at website data, it's common to want to break up events into sessions, where a session is defined as a gap of more than x minutes since the last activity.
Here's some made up data that illustrates the problem.
I've computed the time lag between the events, and figured out if there's a gap that's big enough to qualify.
We've computed the time lag between the events, and figured out if there's a gap that's big enough to qualify.
```{r}
events <- tibble(
@ -566,7 +566,7 @@ events <- events |>
events
```
How do I go from that logical vector to something that I can `group_by()`?
How do we go from that logical vector to something that we can `group_by()`?
You can use the cumulative sum, `cumsum(),` to turn this logical vector into a unique group identifier.
Remember that whenever you use a logical vector in a numeric context `TRUE` becomes 1 and `FALSE` becomes 0, taking the cumulative sum of a logical vector creates a numeric index that increments every time it sees a `TRUE`.

View File

@ -37,7 +37,7 @@ This function is great for quick exploration and checks during analysis:
flights |> count(dest)
```
(Despite the advice in [Chapter -@sec-workflow-style], I usually put `count()` on a single line because I'm usually using it at the console for a quick check that my calculation is working as expected.)
(Despite the advice in @sec-workflow-style, we usually put `count()` on a single line because it's usually used at the console for a quick check that a calculation is working as expected.)
If you want to see the most common values add `sort = TRUE`:
@ -119,7 +119,7 @@ There are a couple of variants of `n()` that you might find useful:
Transformation functions work well with `mutate()` because their output is the same length as the input.
The vast majority of transformation functions are already built into base R.
It's impractical to list them all so this section will show the most useful ones.
As an example, while R provides all the trigonometric functions that you might dream of, I don't list them here because they're rarely needed for data science.
As an example, while R provides all the trigonometric functions that you might dream of, we don't list them here because they're rarely needed for data science.
### Arithmetic and recycling rules
@ -274,7 +274,7 @@ This a straight line because a little algebra reveals that `log(money) = log(sta
This is a useful pattern: if you see a (roughly) straight line after log-transforming the y-axis, you know that there's underlying exponential growth.
If you're log-transforming your data with dplyr you have a choice of three logarithms provided by base R: `log()` (the natural log, base e), `log2()` (base 2), and `log10()` (base 10).
I recommend using `log2()` or `log10()`.
We recommend using `log2()` or `log10()`.
`log2()` is easy to interpret because difference of 1 on the log scale corresponds to doubling on the original scale and a difference of -1 corresponds to halving; whereas `log10()` is easy to back-transform because (e.g) 3 is 10\^3 = 1000.
The inverse of `log()` is `exp()`; to compute the inverse of `log2()` or `log10()` you'll need to use `2^` or `10^`.
@ -623,7 +623,7 @@ flights |>
Sometimes you're not so interested in where the bulk of the data lies, but how it is spread out.
Two commonly used summaries are the standard deviation, `sd(x)`, and the inter-quartile range, `IQR()`.
I won't explain `sd()` here since you're probably already familiar with it, but `IQR()` might be new --- it's `quantile(x, 0.75) - quantile(x, 0.25)` and gives you the range that contains the middle 50% of the data.
We won't explain `sd()` here since you're probably already familiar with it, but `IQR()` might be new --- it's `quantile(x, 0.75) - quantile(x, 0.25)` and gives you the range that contains the middle 50% of the data.
We can use this to reveal a small oddity in the flights data.
You might expect that the spread of the distance between origin and destination to be zero, since airports are always in the same place.
@ -725,7 +725,7 @@ flights |>
(These functions currently lack an `na.rm` argument but will hopefully be fixed by the time you read this book: <https://github.com/tidyverse/dplyr/issues/6242>).
If you're familiar with `[`, you might wonder if you ever need these functions.
I think there are main reasons: the `default` argument and the `order_by` argument.
There are two main reasons: the `default` argument and the `order_by` argument.
`default` allows you to set a default value that's used if the requested position doesn't exist, e.g. you're trying to get the 3rd element from a two element group.
`order_by` lets you locally override the existing ordering of the rows, so you can get the element at the position in the ordering by `order_by()`.

View File

@ -63,7 +63,7 @@ Using parsers is mostly a matter of understanding what's available and how they
There are eight particularly important parsers:
1. `parse_logical()` and `parse_integer()` parse logicals and integers respectively.
There's basically nothing that can go wrong with these parsers so I won't describe them here further.
There's basically nothing that can go wrong with these parsers so we won't describe them here further.
2. `parse_double()` is a strict numeric parser, and `parse_number()` is a flexible numeric parser.
These are more complicated than you might expect because different parts of the world write numbers in different ways.
@ -180,8 +180,8 @@ guess_encoding(charToRaw(x2))
The first argument to `guess_encoding()` can either be a path to a file, or, as in this case, a raw vector (useful if the strings are already in R).
Encodings are a rich and complex topic, and I've only scratched the surface here.
If you'd like to learn more I'd recommend reading the detailed explanation at <http://kunststube.net/encoding/>.
Encodings are a rich and complex topic, and we've only scratched the surface here.
If you'd like to learn more we recommend reading the detailed explanation at <http://kunststube.net/encoding/>.
### Factors {#sec-readr-factors}
@ -209,7 +209,7 @@ When called without any additional arguments:
parse_datetime("20101010")
```
This is the most important date/time standard, and if you work with dates and times frequently, I recommend reading <https://en.wikipedia.org/wiki/ISO_8601>
This is the most important date/time standard, and if you work with dates and times frequently, we recommend reading <https://en.wikipedia.org/wiki/ISO_8601>
- `parse_date()` expects a four digit year, a `-` or `/`, the month, a `-` or `/`, then the day:
@ -300,7 +300,7 @@ parse_date("1 janvier 2015", "%d %B %Y", locale = locale("fr"))
What happens to the default value of `grouping_mark` when you set `decimal_mark` to ","?
What happens to the default value of `decimal_mark` when you set the `grouping_mark` to "."?
3. I didn't discuss the `date_format` and `time_format` options to `locale()`.
3. We didn't discuss the `date_format` and `time_format` options to `locale()`.
What do they do?
Construct an example that shows when they might be useful.
@ -424,7 +424,7 @@ tail(challenge)
Every `parse_xyz()` function has a corresponding `col_xyz()` function.
You use `parse_xyz()` when the data is in a character vector in R already; you use `col_xyz()` when you want to tell readr how to load the data.
I highly recommend always supplying `col_types`, building up from the print-out provided by readr.
We highly recommend always supplying `col_types`, building up from the print-out provided by readr.
This ensures that you have a consistent and reproducible data import script.
If you rely on the default guesses and your data changes, readr will continue to read it in.
If you want to be really strict, use `stop_for_problems()`: that will throw an error and stop your script if there are any parsing problems.

View File

@ -162,7 +162,7 @@ Earlier in this chapter we talked about the use of parentheses for clarifying pr
You can also use parentheses to extract parts of a complex match.
For example, imagine we want to extract nouns from the sentences.
As a heuristic, we'll look for any word that comes after "a" or "the".
Defining a "word" in a regular expression is a little tricky, so here I use a simple approximation: a sequence of at least one character that isn't a space.
Defining a "word" in a regular expression is a little tricky, so here we use a simple approximation: a sequence of at least one character that isn't a space.
```{r}
noun <- "(a|the) ([^ ]+)"

View File

@ -26,7 +26,7 @@ That means getting better at programming also involves getting better at communi
Over time, you want your code to become not just easier to write, but easier for others to read.
Writing code is similar in many ways to writing prose.
One parallel which I find particularly useful is that in both cases rewriting is the key to clarity.
One parallel which we find particularly useful is that in both cases rewriting is the key to clarity.
The first expression of your ideas is unlikely to be particularly clear, and you may need to rewrite multiple times.
After solving a data analysis challenge, it's often worth looking at your code and thinking about whether or not it's obvious what you've done.
If you spend a little time rewriting your code while the ideas are fresh, you can save a lot of time later trying to recreate what your code did.
@ -51,7 +51,7 @@ In the following four chapters, you'll learn skills that will allow you to both
## Learning more
The goal of these chapters is to teach you the minimum about programming that you need to practice data science, which turns out to be a reasonable amount.
Once you have mastered the material in this book, I strongly believe you should invest further in your programming skills.
Once you have mastered the material in this book, we strongly believe you should invest further in your programming skills.
Learning more about programming is a long-term investment: it won't pay off immediately, but in the long term it will allow you to solve new problems more quickly, and let you reuse your insights from previous problems in new scenarios.
To learn more you need to study R as a programming language, not just an interactive environment for data science.

View File

@ -314,7 +314,7 @@ Because `unnest_longer()` can't find a common type of vector, it keeps the origi
You might wonder if this breaks the commandment that every element of a column must be the same type --- not quite, because every element is a still a list, and each component of that list contains something different.
What happens if you find this problem in a dataset you're trying to rectangle?
I think there are two basic options.
There are two basic options.
You could use the `transform` argument to coerce all inputs to a common type.
It's not particularly useful here because there's only really one class that these five class can be converted to: character.
@ -346,7 +346,7 @@ df4 |>
tidyr has a few other useful rectangling functions that we're not going to cover in this book:
- `unnest_auto()` automatically picks between `unnest_longer()` and `unnest_wider()` based on the structure of the list-column. It's a great for rapid exploration, but I think it's ultimately a bad idea because it doesn't force you to understand how your data is structured, and makes your code harder to understand.
- `unnest_auto()` automatically picks between `unnest_longer()` and `unnest_wider()` based on the structure of the list-column. It's a great for rapid exploration, but ultimately its a bad idea because it doesn't force you to understand how your data is structured, and makes your code harder to understand.
- `unnest()` expands both rows and columns. It's useful when you have a list-column that contains a 2d structure like a data frame, which we don't see in this book.
- `hoist()` allows you to reach into a deeply nested list and extract just the components that you need. It's mostly equivalent to repeated invocations of `unnest_wider()` + `select()` so read up on it if you're trying to extract just a couple of important variables embedded in a bunch of data that you don't care about.
@ -380,7 +380,7 @@ We'll start by exploring `gh_repos`.
This is a list that contains data about a collection of GitHub repositories retrieved using the GitHub API. It's a very deeply nested list so it's difficult to show the structure in this book; you might want to explore a little on your own with `View(gh_repos)` before we continue.
`gh_repos` is a list, but our tools work with list-columns, so we'll begin by putting it into a tibble.
I call the column `json` for reasons we'll get to later.
We call the column `json` for reasons we'll get to later.
```{r}
repos <- tibble(json = gh_repos)
@ -438,7 +438,7 @@ repos |>
```
Uh oh, this list column also contains an `id` column and we can't have two `id` columns in the same data frame.
Rather than following the advice to use `names_repair` (which would also work), I'll instead use `names_sep`:
Rather than following the advice to use `names_repair` (which would also work), we'll instead use `names_sep`:
```{r}
repos |>
@ -635,7 +635,7 @@ locations |>
unnest_wider(bounds)
```
I then rename `southwest` and `northeast` (the corners of the rectangle) so I can use `names_sep` to create short but evocative names:
We then rename `southwest` and `northeast` (the corners of the rectangle) so we can use `names_sep` to create short but evocative names:
```{r}
locations |>

View File

@ -34,7 +34,7 @@ library(tidyverse)
It's worth noting that the regular expressions used by stringr are very slightly different to those of base R.
That's because stringr is built on top of the [stringi package](https://stringi.gagolewski.com), which is in turn built on top of the [ICU engine](https://unicode-org.github.io/icu/userguide/strings/regexp.html), whereas base R functions (like `gsub()` and `grepl()`) use either the [TRE engine](https://github.com/laurikari/tre) or the [PCRE engine](https://www.pcre.org).
Fortunately, the basics of regular expressions are so well established that you'll encounter few variations when working with the patterns you'll learn in this book (and I'll point them out where important).
Fortunately, the basics of regular expressions are so well established that you'll encounter few variations when working with the patterns you'll learn in this book (and we'll point them out where important).
You only need to be aware of the difference when you start to rely on advanced features like complex Unicode character ranges or special features that use the `(?…)` syntax.
You can learn more about these advanced features in `vignette("regular-expressions", package = "stringr")`.
Another useful reference is [https://www.regular-expressions.info/](https://www.regular-expressions.info/tutorial.html).
@ -57,10 +57,10 @@ Next you'll learn about **anchors**, which allow you to match the start or end o
Then you'll learn about **character classes** and their shortcuts, which allow you to match any character from a set.
We'll finish up with **quantifiers**, which control how many times a pattern can match, and **alternation**, which allows you to match either *this* or *that.*
The terms I use here are the technical names for each component.
The terms we use here are the technical names for each component.
They're not always the most evocative of their purpose, but it's very helpful to know the correct terms if you later want to Google for more details.
I'll concentrate on showing how these patterns work with `str_view()` and `str_view_all()` but remember that you can use them with any of the functions that you learned about in [Chapter -@sec-strings], i.e.:
We'll concentrate on showing how these patterns work with `str_view()` and `str_view_all()` but remember that you can use them with any of the functions that you learned about in [Chapter -@sec-strings], i.e.:
- `str_detect(x, pattern)` returns a logical vector the same length as `x`, indicating whether each element matches (`TRUE`) or doesn't match (`FALSE`) the pattern.
- `str_count(x, pattern)` returns the number of times `pattern` matches in each element of `x`.
@ -87,7 +87,7 @@ str_view(dot)
str_view(c("abc", "a.c", "bef"), "a\\.c")
```
In this book, I'll write regular expression as `\.` and strings that represent the regular expression as `"\\."`.
In this book, we'll write regular expression as `\.` and strings that represent the regular expression as `"\\."`.
If `\` is used as an escape character in regular expressions, how do you match a literal `\`?
Well you need to escape it, creating the regular expression `\\`.
@ -125,7 +125,7 @@ str_view(x, "^a") # match "a" at start
str_view(x, "a$") # match "a" at end
```
To remember which is which, try this mnemonic which I learned from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): if you begin with power (`^`), you end up with money (`$`).
To remember which is which, try this mnemonic which Hadley learned from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): if you begin with power (`^`), you end up with money (`$`).
It's tempting to put `$` at the start, because that's how we write sums of money, but it's not what regular expressions want.
To force a regular expression to only match the full string, anchor it with both `^` and `$`:
@ -137,9 +137,9 @@ str_view(x, "^apple$")
```
You can also match the boundary between words (i.e. the start or end of a word) with `\b`.
I don't often use this in my R code, but I'll sometimes use it when I'm doing a search in RStudio.
This is not that useful in R code, but it can be handy when searching in RStudio.
It's useful to find the name of a function that's a component of other functions.
For example, if I want to find all uses of `sum()`, I'll search for `\bsum\b` to avoid matching `summarise`, `summary`, `rowsum` and so on:
For example, if to find all uses of `sum()`, you can search for `\bsum\b` to avoid matching `summarise`, `summary`, `rowsum` and so on:
```{r}
x <- c("summary(x)", "summarise(df)", "rowsum(x)", "sum(x)")
@ -266,7 +266,7 @@ But these tend to be less likely to cause confusion because they mostly behave h
6. Write the equivalents of `?`, `+`, `*` in `{m,n}` form.
7. Describe in words what these regular expressions match: (read carefully to see if I'm using a regular expression or a string that defines a regular expression.)
7. Describe in words what these regular expressions match: (read carefully to see if each entry is a regular expression or a string that defines a regular expression.)
a. `^.*$`
b. `"\\{.+\\}"`
@ -310,7 +310,7 @@ str_view(sentences, "^She|He|It|They\\b", match = TRUE)
```
A quick inspection of the results shows that we're getting some spurious matches.
That's because I've forgotten to use parentheses:
That's because we've forgotten to use parentheses:
```{r}
str_view(sentences, "^(She|He|It|They)\\b", match = TRUE)
@ -356,7 +356,7 @@ There's no "and" operator built in to regular expressions so we have to tackle i
words[str_detect(words, "a.*b|b.*a")]
```
I think its simpler to combine the results of two calls to `str_detect()`:
Its simpler to combine the results of two calls to `str_detect()`:
```{r}
words[str_detect(words, "a") & str_detect(words, "b")]
@ -490,7 +490,7 @@ sentences |>
head(10)
```
But I think you're generally better off using `str_match()` or `tidyr::separate_groups()`, which you'll learn about next.
But you're generally better off using `str_match()` or `tidyr::separate_groups()`, which you'll learn about next.
### Extracting groups
@ -503,7 +503,7 @@ sentences |>
head()
```
Instead I recommend using tidyr's `separate_groups()` which creates a column for each capturing group.
Instead, we recommend using tidyr's `separate_groups()` which creates a column for each capturing group.
### Named groups
@ -601,7 +601,7 @@ str_view_all(x, regex("^Line", multiline = TRUE))
Finally, if you're writing a complicated regular expression and you're worried you might not understand it in the future, `comments = TRUE` can be extremely useful.
It allows you to use comments and whitespace to make complex regular expressions more understandable.
Spaces and new lines are ignored, as is everything after `#`.
(Note that I'm using a raw string here to minimize the number of escapes needed)
(Note that we use a raw string here to minimize the number of escapes needed)
```{r}
phone <- regex(r"(

View File

@ -254,7 +254,7 @@ knitr::include_graphics("screenshots/rmarkdown-shiny.png")
You can then refer to the values with `input$name` and `input$age`, and the code that uses them will be automatically re-run whenever they change.
I can't show you a live shiny app here because shiny interactions occur on the **server-side**.
We can't show you a live shiny app here because shiny interactions occur on the **server-side**.
This means that you can write interactive apps without knowing JavaScript, but you need a server to run them on.
This introduces a logistical issue: Shiny apps need a Shiny server to be run online.
When you run shiny apps on your own computer, shiny automatically sets up a shiny server for you, but you need a public facing shiny server if you want to publish this sort of interactivity online.
@ -298,14 +298,14 @@ You can also create your own by following the instructions at <http://rmarkdown.
## Learning more
To learn more about effective communication in these different formats I recommend the following resources:
To learn more about effective communication in these different formats we recommend the following resources:
- To improve your presentation skills, I recommend [*Presentation Patterns*](https://amzn.com/0321820800), by Neal Ford, Matthew McCollough, and Nathaniel Schutta.
- To improve your presentation skills, try [*Presentation Patterns*](https://amzn.com/0321820800), by Neal Ford, Matthew McCollough, and Nathaniel Schutta.
It provides a set of effective patterns (both low- and high-level) that you can apply to improve your presentations.
- If you give academic talks, I recommend reading the [*Leek group guide to giving talks*](https://github.com/jtleek/talkguide).
- If you give academic talks, you might like the [*Leek group guide to giving talks*](https://github.com/jtleek/talkguide).
- I haven't taken it myself, but I've heard good things about Matt McGarrity's online course on public speaking: <https://www.coursera.org/learn/public-speaking>.
- We haven't taken it outselves, but we've heard good things about Matt McGarrity's online course on public speaking: <https://www.coursera.org/learn/public-speaking>.
- If you are creating a lot of dashboards, make sure to read Stephen Few's [*Information Dashboard Design: The Effective Visual Communication of Data*](https://www.amazon.com/Information-Dashboard-Design-Effective-Communication/dp/0596100167).
It will help you create dashboards that are truly useful, not just pretty to look at.

View File

@ -29,7 +29,7 @@ It:
A lab notebook helps you share not only what you've done, but why you did it with your colleagues or lab mates.
Much of the good advice about using lab notebooks effectively can also be translated to analysis notebooks.
I've drawn on my own experiences and Colin Purrington's advice on lab notebooks (<http://colinpurrington.com/tips/lab-notebooks>) to come up with the following tips:
We've drawn on our own experiences and Colin Purrington's advice on lab notebooks (<http://colinpurrington.com/tips/lab-notebooks>) to come up with the following tips:
- Ensure each notebook has a descriptive title, an evocative filename, and a first paragraph that briefly describes the aims of the analysis.
@ -61,4 +61,4 @@ I've drawn on my own experiences and Colin Purrington's advice on lab notebooks
- You are going to create many, many, many analysis notebooks over the course of your career.
How are you going to organise them so you can find them again in the future?
I recommend storing them in individual projects, and coming up with a good naming scheme.
We recommend storing them in individual projects, and coming up with a good naming scheme.

View File

@ -163,10 +163,10 @@ There are three ways to do so:
3. By manually typing the chunk delimiters ```` ```{r} ```` and ```` ``` ````.
Obviously, I'd recommend you learn the keyboard shortcut.
Obviously, we'd recommend you learn the keyboard shortcut.
It will save you a lot of time in the long run!
You can continue to run the code using the keyboard shortcut that by now (I hope!) you know and love: Cmd/Ctrl + Enter.
You can continue to run the code using the keyboard shortcut that by now (we hope!) you know and love: Cmd/Ctrl + Enter.
However, chunks get a new keyboard shortcut: Cmd/Ctrl + Shift + Enter, which runs all the code in the chunk.
Think of a chunk like a function.
A chunk should be relatively self-contained, and focussed around a single task.
@ -308,14 +308,14 @@ Then you can write:
As your caching strategies get progressively more complicated, it's a good idea to regularly clear out all your caches with `knitr::clean_cache()`.
I've used the advice of [David Robinson](https://twitter.com/drob/status/738786604731490304) to name these chunks: each chunk is named after the primary object that it creates.
We've follow the advice of [David Robinson](https://twitter.com/drob/status/738786604731490304) to name these chunks: each chunk is named after the primary object that it creates.
This makes it easier to understand the `dependson` specification.
### Global options
As you work more with knitr, you will discover that some of the default chunk options don't fit your needs and you want to change them.
You can do this by calling `knitr::opts_chunk$set()` in a code chunk.
For example, when writing books and tutorials I set:
For example, when writing books and tutorials we set:
```{r}
#| eval: false
@ -342,7 +342,7 @@ You might consider setting `message = FALSE` and `warning = FALSE`, but that wou
There is one other way to embed R code into an R Markdown document: directly into the text, with: `r inline()`.
This can be very useful if you mention properties of your data in the text.
For example, in the example document I used at the start of the chapter I had:
For example, the example document used at the start of the chapter had:
> We have data about `r inline('nrow(diamonds)')` diamonds.
> Only `r inline('nrow(diamonds) - nrow(smaller)')` are larger than 2.5 carats.
@ -356,7 +356,7 @@ When the report is knit, the results of these computations are inserted into the
When inserting numbers into text, `format()` is your friend.
It allows you to set the number of `digits` so you don't print to a ridiculous degree of accuracy, and a `big.mark` to make numbers easier to read.
I'll often combine these into a helper function:
You might combine these into a helper function:
```{r}
comma <- function(x) format(x, digits = 2, big.mark = ",")
@ -504,7 +504,7 @@ csl: apa.csl
```
As with the bibliography field, your csl file should contain a path to the file.
Here I assume that the csl file is in the same directory as the .Rmd file.
Here we assume that the csl file is in the same directory as the .Rmd file.
A good place to find CSL style files for common bibliography styles is <http://github.com/citation-style-language/styles>.
## Learning more
@ -522,8 +522,8 @@ We recommend two free resources that will teach you about Git:
2. The "Git and GitHub" chapter of *R Packages*, by Hadley.
You can also read it for free online: <http://r-pkgs.had.co.nz/git.html>.
I have also not touched on what you should actually write in order to clearly communicate the results of your analysis.
To improve your writing, I highly recommend reading either [*Style: Lessons in Clarity and Grace*](https://www.amazon.com/Style-Lessons-Clarity-Grace-12th/dp/0134080416) by Joseph M. Williams & Joseph Bizup, or [*The Sense of Structure: Writing from the Reader's Perspective*](https://www.amazon.com/Sense-Structure-Writing-Readers-Perspective/dp/0205296327) by George Gopen.
We have not touched on what you should actually write in order to clearly communicate the results of your analysis.
To improve your writing, we highly recommend reading either [*Style: Lessons in Clarity and Grace*](https://www.amazon.com/Style-Lessons-Clarity-Grace-12th/dp/0134080416) by Joseph M. Williams & Joseph Bizup, or [*The Sense of Structure: Writing from the Reader's Perspective*](https://www.amazon.com/Sense-Structure-Writing-Readers-Perspective/dp/0205296327) by George Gopen.
Both books will help you understand the structure of sentences and paragraphs, and give you the tools to make your writing more clear.
(These books are rather expensive if purchased new, but they're used by many English classes so there are plenty of cheap second-hand copies).
George Gopen also has a number of short articles on writing at <https://www.georgegopen.com/the-litigation-articles.html>.

View File

@ -92,7 +92,7 @@ str_view(x)
### Raw strings {#sec-raw-strings}
Creating a string with multiple quotes or backslashes gets confusing quickly.
To illustrate the problem, lets create a string that contains the contents of the chunk where I define the `double_quote` and `single_quote` variables:
To illustrate the problem, lets create a string that contains the contents of the chunk where we define the `double_quote` and `single_quote` variables:
```{r}
tricky <- "double_quote <- \"\\\"\" # or '\"'
@ -153,7 +153,7 @@ Then we'll talk about a slightly different scenario where you want to summarise
`str_c()`[^strings-3] takes any number of vectors as arguments and returns a character vector:
[^strings-3]: `str_c()` is very similar to the base `paste0()`.
There are two main reasons I recommend: it obeys the usual rules for handling `NA` and it uses the tidyverse recycling rules.
There are two main reasons we recommend: it obeys the usual rules for handling `NA` and it uses the tidyverse recycling rules.
```{r}
str_c("x", "y")
@ -324,7 +324,7 @@ will match any character[^strings-8], so `"a."` will match any string that conta
str_detect(c("a", "ab", "ae", "bd", "ea", "eab"), "a.")
```
To get a better sense of what's happening, I'm going to switch to `str_view_all()`.
To get a better sense of what's happening, lets switch to `str_view_all()`.
This shows which characters are matched by surrounding it with `<>` and coloring it blue:
```{r}
@ -332,7 +332,7 @@ str_view_all(c("a", "ab", "ae", "bd", "ea", "eab"), "a.")
```
Regular expressions are a powerful and flexible language which we'll come back to in [Chapter -@sec-regular-expressions].
Here I'll just introduce only the most important components: quantifiers and character classes.
Here we'll just introduce only the most important components: quantifiers and character classes.
**Quantifiers** control how many times an element that can be applied to other pattern: `?` makes a pattern optional (i.e. it matches 0 or 1 times), `+` lets a pattern repeat (i.e. it matches at least once), and `*` lets a pattern be optional or repeat (i.e. it matches any number of times, including 0).
@ -400,7 +400,7 @@ babynames |>
```
If you look closely, you'll notice that there's something off with our calculations: "Aaban" contains three "a"s, but our summary reports only two vowels.
That's because I've forgotten to tell you that regular expressions are case sensitive.
That's because we've forgotten to tell you that regular expressions are case sensitive.
There are three ways we could fix this:
- Add the upper case vowels to the character class: `str_count(name, "[aeiouAEIOU]")`.
@ -495,7 +495,7 @@ Waiting on: <https://github.com/tidyverse/tidyups/pull/15>
## Locale dependent operations {#sec-other-languages}
So far all of our examples have been using English.
The details of the many ways other languages are different to English are too diverse to detail here, but I wanted to give a quick outline of the functions who's behavior differs based on your **locale**, the set of settings that vary from country to country.
The details of the many ways other languages are different to English are too diverse to detail here, but we wanted to give a quick outline of the functions who's behavior differs based on your **locale**, the set of settings that vary from country to country.
Locale is specified with lower-case language abbreviation, optionally followed by a `_` and a upper-case region identifier.
For example, "en" is English, "en_GB" is British English, and "en_US" is American English.
@ -550,7 +550,7 @@ Fortunately there are three sets of functions where the locale matters:
Functions that work with the components of strings called **code points**.
Depending on the language involved, this might be a letter (like in most European languages), a syllable (like Japanese), or a logogram (like in Chinese).
It might be something more exotic like an accent, or a special symbol used to join two emoji together.
But to keep things simple, I'll call these letters.
But to keep things simple, we'll call these letters.
### Length
@ -562,7 +562,7 @@ str_length(c("a", "R for data science", NA))
You could use this with `count()` to find the distribution of lengths of US babynames, and then with `filter()` to look at the longest names[^strings-10]:
[^strings-10]: Looking at these entries, I'd say the babynames data removes spaces or hyphens from names and truncates after 15 letters.
[^strings-10]: Looking at these entries, we'd guess that the babynames data removes spaces or hyphens from names and truncates after 15 letters.
```{r}
babynames |>

View File

@ -14,7 +14,7 @@ Tibbles *are* data frames, but they tweak some older behaviors to make your life
R is an old language, and some things that were useful 10 or 20 years ago now get in your way.
It's difficult to change base R without breaking existing code, so most innovation occurs in packages.
Here we will describe the **tibble** package, which provides opinionated data frames that make working in the tidyverse a little easier.
In most places, I'll use the term tibble and data frame interchangeably; when I want to draw particular attention to R's built-in data frame, I'll call them `data.frame`s.
In most places, we use the term tibble and data frame interchangeably; when we want to draw particular attention to R's built-in data frame, we'll call them `data.frame`s.
If this chapter leaves you wanting to learn more about tibbles, you might enjoy `vignette("tibble")`.

View File

@ -11,12 +11,7 @@ source("_common.R")
So far this book has focussed on tibbles and packages that work with them.
But as you start to write your own functions, and dig deeper into R, you need to learn about vectors, the objects that underlie tibbles.
If you've learned R in a more traditional way, you're probably already familiar with vectors, as most R resources start with vectors and work their way up to tibbles.
I think it's better to start with tibbles because they're immediately useful, and then work your way down to the underlying components.
Vectors are particularly important as most of the functions you will write will work with vectors.
It is possible to write functions that work with tibbles (like ggplot2, dplyr, and tidyr), but the tools you need to write such functions are currently idiosyncratic and immature.
I am working on a better approach, <https://github.com/hadley/lazyeval>, but it will not be ready in time for the publication of the book.
Even when complete, you'll still need to understand vectors, it'll just make it easier to write a user-friendly layer on top.
We think it's better to start with tibbles because they're immediately useful, and then work your way down to the underlying components.
### Prerequisites
@ -85,7 +80,7 @@ You'll start with atomic vectors, then build up to lists, and finish off with au
## Important types of atomic vector
The four most important types of atomic vector are logical, integer, double, and character.
Raw and complex are rarely used during a data analysis, so I won't discuss them here.
Raw and complex are rarely used during a data analysis, so we won't discuss them here.
### Logical
@ -149,7 +144,7 @@ The distinction between integers and doubles is not usually important, but there
Character vectors are the most complex type of atomic vector, because each element of a character vector is a string, and a string can contain an arbitrary amount of data.
You've already learned a lot about working with strings in \[strings\].
Here I wanted to mention one important feature of the underlying string implementation: R uses a global string pool.
Here we wanted to mention one important feature of the underlying string implementation: R uses a global string pool.
This means that each unique string is only stored in memory once, and every use of the string points to that representation.
This reduces the amount of memory needed by duplicated strings.
You can see this behaviour in practice with `lobstr::obj_size()`:
@ -223,7 +218,7 @@ There are two ways to convert, or coerce, one type of vector to another:
2. Implicit coercion happens when you use a vector in a specific context that expects a certain type of vector.
For example, when you use a logical vector with a numeric summary function, or when you use a double vector where an integer vector is expected.
Because explicit coercion is used relatively rarely, and is largely easy to understand, I'll focus on implicit coercion here.
Because explicit coercion is used relatively rarely, and is largely easy to understand, we'll focus on implicit coercion here.
You've already seen the most important type of implicit coercion: using a logical vector in a numeric context.
In this case `TRUE` is converted to `1` and `FALSE` converted to `0`.
@ -247,7 +242,7 @@ if (length(x)) {
```
In this case, 0 is converted to `FALSE` and everything else is converted to `TRUE`.
I think this makes it harder to understand your code, and I don't recommend it.
We think this makes it harder to understand your code, and we don't recommend it.
Instead be explicit: `length(x) > 0`.
It's also important to understand what happens when you try and create a vector containing multiple types with `c()`: the most complex type always wins.
@ -286,7 +281,7 @@ As well as implicitly coercing the types of vectors to be compatible, R will als
This is called vector **recycling**, because the shorter vector is repeated, or recycled, to the same length as the longer vector.
This is generally most useful when you are mixing vectors and "scalars".
I put scalars in quotes because R doesn't actually have scalars: instead, a single number is a vector of length 1.
We put scalars in quotes because R doesn't actually have scalars: instead, a single number is a vector of length 1.
Because there are no scalars, most built-in functions are **vectorised**, meaning that they will operate on a vector of numbers.
That's why, for example, this code works:
@ -488,7 +483,7 @@ x2 <- list(list(1, 2), list(3, 4))
x3 <- list(1, list(2, list(3)))
```
I'll draw them as follows:
We'll draw them as follows:
```{r}
#| echo: false
@ -504,11 +499,11 @@ There are three principles:
2. Children are drawn inside their parent, and have a slightly darker background to make it easier to see the hierarchy.
3. The orientation of the children (i.e. rows or columns) isn't important, so I'll pick a row or column orientation to either save space or illustrate an important property in the example.
3. The orientation of the children (i.e. rows or columns) isn't important, so we'll pick a row or column orientation to either save space or illustrate an important property in the example.
### Subsetting
There are three ways to subset a list, which I'll illustrate with a list named `a`:
There are three ways to subset a list, which we'll illustrate with a list named `a`:
```{r}
a <- list(a = 1:3, b = "a string", c = pi, d = list(-1, -5))
@ -659,7 +654,7 @@ Other important generics are the subsetting functions `[`, `[[`, and `$`.
## Augmented vectors
Atomic vectors and lists are the building blocks for other important vector types like factors and dates.
I call these **augmented vectors**, because they are vectors with additional **attributes**, including class.
We call these **augmented vectors**, because they are vectors with additional **attributes**, including class.
Because augmented vectors have a class, they behave differently to the atomic vector on which they are built.
In this book, we make use of four important augmented vectors: