Typo fixes (#1204)

* Grammatical edits

* Grammatical edits

* Grammatical edits

* Remove ref since fn not mentioned earlier

* Typo fixes, closes #1191

* Add palmerpenguins, closes #1192

* Grammatical edits

* More grammatical edits

* Omit warning, closes #1193

* Fix link, closes #1197

* Grammatical edits

* Code style + clarify labs() args, closes #1199

* Fix year, closes #1200

* Use penguins instead, closes #1201
This commit is contained in:
Mine Cetinkaya-Rundel 2023-01-03 02:06:27 -05:00 committed by GitHub
parent 26a20c586a
commit e68098f193
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
7 changed files with 304 additions and 267 deletions

View File

@ -9,13 +9,13 @@ status("polishing")
## Introduction
Working with data provided by R packages is a great way to learn the tools of data science, but at some point you want to apply what you've learned to your own data.
Working with data provided by R packages is a great way to learn data science tools, but you want to apply what you've learned to your own data at some point.
In this chapter, you'll learn the basics of reading data files into R.
Specifically, this chapter will focus on reading plain-text rectangular files.
We'll start with some practical advice for handling features like column names and types and missing data.
We'll start with practical advice for handling features like column names, types, and missing data.
You will then learn about reading data from multiple files at once and writing data from R to a file.
Finally, you'll learn how to hand craft data frames in R.
Finally, you'll learn how to handcraft data frames in R.
### Prerequisites
@ -30,9 +30,9 @@ library(tidyverse)
## Reading data from a file
To begin we'll focus on the most rectangular data file type: the CSV, short for comma-separated values.
To begin, we'll focus on the most rectangular data file type: CSV, which is short for comma-separated values.
Here is what a simple CSV file looks like.
The first row, commonly called the header row, gives the column names, and the following six rows give the data.
The first row, commonly called the header row, gives the column names, and the following six rows provide the data.
```{r}
#| echo: false
@ -62,16 +62,16 @@ The first argument is the most important: it's the path to the file.
students <- read_csv("data/students.csv")
```
When you run `read_csv()` it prints out a message telling you the number of rows and columns of data, the delimiter that was used, and the column specifications (names of columns organized by the type of data the column contains).
It also prints out some information about how to retrieve the full column specification as well as how to quiet this message.
This message is an important part of readr and we'll come back to in @sec-col-types.
When you run `read_csv()`, it prints out a message telling you the number of rows and columns of data, the delimiter that was used, and the column specifications (names of columns organized by the type of data the column contains).
It also prints out some information about retrieving the full column specification and how to quiet this message.
This message is an integral part of readr, and we'll return to it in @sec-col-types.
### Practical advice
Once you read data in, the first step usually involves transforming it in some way to make it easier to work with in the rest of your analysis.
Let's take another look at the `students` data with that in mind.
In the `favourite.food` column, there are a bunch of food items and then the character string `N/A`, which should have been an real `NA` that R will recognize as "not available".
In the `favourite.food` column, there are a bunch of food items, and then the character string `N/A`, which should have been a real `NA` that R will recognize as "not available".
This is something we can address using the `na` argument.
```{r}
@ -81,9 +81,9 @@ students <- read_csv("data/students.csv", na = c("N/A", ""))
students
```
You might also notice that the `Student ID` and `Full Name` columns are surrounded by back ticks.
You might also notice that the `Student ID` and `Full Name` columns are surrounded by backticks.
That's because they contain spaces, breaking R's usual rules for variable names.
To refer to them, you need to use those back ticks:
To refer to them, you need to use those backticks:
```{r}
students |>
@ -104,7 +104,7 @@ students |> janitor::clean_names()
```
Another common task after reading in data is to consider variable types.
For example, `meal_type` is a categorical variable with a known set of possible values, which in R should be represent as factor:
For example, `meal_type` is a categorical variable with a known set of possible values, which in R should be represented as a factor:
```{r}
students |>
@ -114,10 +114,11 @@ students |>
)
```
Note that the values in the `meal_type` variable has stayed exactly the same, but the type of variable denoted underneath the variable name has changed from character (`<chr>`) to factor (`<fct>`).
Note that the values in the `meal_type` variable have stayed the same, but the type of variable denoted underneath the variable name has changed from character (`<chr>`) to factor (`<fct>`).
You'll learn more about factors in @sec-factors.
Before you move on to analyzing these data, you'll probably want to fix the `age` column as well: currently it's a character variable because of the one observation that is typed out as `five` instead of a numeric `5`.
Before you analyze these data, you'll probably want to fix the `age` column.
Currently, it's a character variable because one of the observations is typed out as `five` instead of a numeric `5`.
We discuss the details of fixing this issue in @sec-import-spreadsheets.
```{r}
@ -133,7 +134,7 @@ students
### Other arguments
There are a couple of other important arguments that we need to mention, and they'll be easier to demonstrate if we first show you a handy trick: `read_csv()` can read csv files that you've created in a string:
There are a couple of other important arguments that we need to mention, and they'll be easier to demonstrate if we first show you a handy trick: `read_csv()` can read CSV files that you've created in a string:
```{r}
#| message: false
@ -145,8 +146,8 @@ read_csv(
)
```
Usually `read_csv()` uses the first line of the data for the column names, which is a very common convention.
But sometime there are a few lines of metadata at the top of the file.
Usually, `read_csv()` uses the first line of the data for the column names, which is a very common convention.
But it's not uncommon for a few lines of metadata to be included at the top of the file.
You can use `skip = n` to skip the first `n` lines or use `comment = "#"` to drop all lines that start with (e.g.) `#`:
```{r}
@ -169,7 +170,7 @@ read_csv(
```
In other cases, the data might not have column names.
You can use `col_names = FALSE` to tell `read_csv()` not to treat the first row as headings, and instead label them sequentially from `X1` to `Xn`:
You can use `col_names = FALSE` to tell `read_csv()` not to treat the first row as headings and instead label them sequentially from `X1` to `Xn`:
```{r}
#| message: false
@ -181,7 +182,7 @@ read_csv(
)
```
Alternatively you can pass `col_names` a character vector which will be used as the column names:
Alternatively, you can pass `col_names` a character vector which will be used as the column names:
```{r}
#| message: false
@ -194,25 +195,25 @@ read_csv(
```
These arguments are all you need to know to read the majority of CSV files that you'll encounter in practice.
(For the rest, you'll need to carefully inspect your `.csv` file and carefully read the documentation for `read_csv()`'s many other arguments.)
(For the rest, you'll need to carefully inspect your `.csv` file and read the documentation for `read_csv()`'s many other arguments.)
### Other file types
Once you've mastered `read_csv()`, using readr's other functions is straightforward; it's just a matter of knowing which function to reach for:
- `read_csv2()` reads semicolon separated files.
These use `;` instead of `,` to separate fields, and are common in countries that use `,` as the decimal marker.
- `read_csv2()` reads semicolon-separated files.
These use `;` instead of `,` to separate fields and are common in countries that use `,` as the decimal marker.
- `read_tsv()` reads tab delimited files.
- `read_tsv()` reads tab-delimited files.
- `read_delim()` reads in files with any delimiter, attempting to automatically guess the delimited if you don't specify it.
- `read_delim()` reads in files with any delimiter, attempting to automatically guess the delimiter if you don't specify it.
- `read_fwf()` reads fixed width files.
You can specify fields either by their widths with `fwf_widths()` or their position with `fwf_positions()`.
- `read_fwf()` reads fixed-width files.
You can specify fields by their widths with `fwf_widths()` or by their positions with `fwf_positions()`.
- `read_table()` reads a common variation of fixed width files where columns are separated by white space.
- `read_table()` reads a common variation of fixed-width files where columns are separated by white space.
- `read_log()` reads Apache style log files.
- `read_log()` reads Apache-style log files.
### Exercises
@ -223,8 +224,8 @@ Once you've mastered `read_csv()`, using readr's other functions is straightforw
3. What are the most important arguments to `read_fwf()`?
4. Sometimes strings in a CSV file contain commas.
To prevent them from causing problems they need to be surrounded by a quoting character, like `"` or `'`. By default, `read_csv()` assumes that the quoting character will be `"`.
What argument to `read_csv()` do you need to specify to read the following text into a data frame?
To prevent them from causing problems, they need to be surrounded by a quoting character, like `"` or `'`. By default, `read_csv()` assumes that the quoting character will be `"`.
To read the following text into a data frame, what argument to `read_csv()` do you need to specify?
```{r}
#| eval: false
@ -249,8 +250,8 @@ Once you've mastered `read_csv()`, using readr's other functions is straightforw
a. Extracting the variable called `1`.
b. Plotting a scatterplot of `1` vs. `2`.
c. Creating a new column called `3` which is `2` divided by `1`.
d. Renaming the columns to `one`, `two` and `three`.
c. Creating a new column called `3`, which is `2` divided by `1`.
d. Renaming the columns to `one`, `two`, and `three`.
```{r}
annoying <- tibble(
@ -261,21 +262,21 @@ Once you've mastered `read_csv()`, using readr's other functions is straightforw
## Controlling column types {#sec-col-types}
A CSV file doesn't contain any information about the type of each variable (i.e. whether it's a logical, number, string, etc.), so readr will try to guess the type.
This section describes how the guessing process works, how to resolve some common problems that cause it to fail, and if needed, how to supply the column types yourself.
Finally, we'll mention a couple of general strategies that are a useful if readr is failing catastrophically and you need to get more insight in to the structure of your file.
A CSV file doesn't contain any information about the type of each variable (i.e., whether it's a logical, number, string, etc.), so readr will try to guess the type.
This section describes how the guessing process works, how to resolve some common problems that cause it to fail, and, if needed, how to supply the column types yourself.
Finally, we'll mention a few general strategies that are useful if readr is failing catastrophically and you need to get more insight into the structure of your file.
### Guessing types
readr uses a heuristic to figure out the column types.
For each column, it pulls the values of 1,000[^data-import-2] rows spaced evenly from the first row to the last, ignoring an missing values.
For each column, it pulls the values of 1,000[^data-import-2] rows spaced evenly from the first row to the last, ignoring missing values.
It then works through the following questions:
[^data-import-2]: You can override the default of 1000 with the `guess_max` argument.
- Does it contain only `F`, `T`, `FALSE`, or `TRUE` (ignoring case)? If so, it's a logical.
- Does it contain only numbers (e.g. `1`, `-4.5`, `5e6`, `Inf`)? If so, it's a number.
- Does it match match the ISO8601 standard? If so, it's a date or date-time. (We'll come back to date/times in more detail in @sec-creating-datetimes).
- Does it contain only numbers (e.g., `1`, `-4.5`, `5e6`, `Inf`)? If so, it's a number.
- Does it match the ISO8601 standard? If so, it's a date or date-time. (We'll return to date-times in more detail in @sec-creating-datetimes).
- Otherwise, it must be a string.
You can see that behavior in action in this simple example:
@ -289,12 +290,12 @@ read_csv("
)
```
This heuristic works well if you have a clean dataset, but in real life you'll encounter a selection of weird and wonderful failures.
This heuristic works well if you have a clean dataset, but in real life, you'll encounter a selection of weird and beautiful failures.
### Missing values, column types, and problems
The most common way column detection fails is that a column contains unexpected values and you get a character column instead of a more specific type.
One of the most common causes for this a missing value, recorded using something other than the `NA` that stringr expects.
The most common way column detection fails is that a column contains unexpected values, and you get a character column instead of a more specific type.
One of the most common causes for this is a missing value, recorded using something other than the `NA` that stringr expects.
Take this simple 1 column CSV file as an example:
@ -315,7 +316,7 @@ df <- read_csv(csv)
```
In this very small case, you can easily see the missing value `.`.
But what happens if you have thousands of rows with only a few missing values represented by `.`s speckled amongst them?
But what happens if you have thousands of rows with only a few missing values represented by `.`s speckled among them?
One approach is to tell readr that `x` is a numeric column, and then see where it fails.
You can do that with the `col_types` argument, which takes a named list:
@ -342,9 +343,9 @@ df <- read_csv(csv, na = ".")
readr provides a total of nine column types for you to use:
- `col_logical()` and `col_double()` read logicals and real numbers. They're relatively rarely needed (except as above), since readr will usually guess them for you.
- `col_integer()` reads integers. We distinguish because integers and doubles in this book because they're functionally equivalent, but reading integers explicitly can occasionally be useful because they occupy half the memory of doubles.
- `col_integer()` reads integers. We distinguish integers and doubles in this book because they're functionally equivalent, but reading integers explicitly can occasionally be useful because they occupy half the memory of doubles.
- `col_character()` reads strings. This is sometimes useful to specify explicitly when you have a column that is a numeric identifier, i.e. long series of digits that identifies some object, but it doesn't make sense to (e.g.) divide it in half.
- `col_factor()`, `col_date()` and `col_datetime()` create factors, dates and date-time respectively; you'll learn more about those when we get to those data types in @sec-factors and @sec-dates-and-times.
- `col_factor()`, `col_date()`, and `col_datetime()` create factors, dates, and date-times respectively; you'll learn more about those when we get to those data types in @sec-factors and @sec-dates-and-times.
- `col_number()` is a permissive numeric parser that will ignore non-numeric components, and is particularly useful for currencies. You'll learn more about it in @sec-numbers.
- `col_skip()` skips a column so it's not included in the result.
@ -429,7 +430,7 @@ There are two main alternative:
```
2. The arrow package allows you to read and write parquet files, a fast binary file format that can be shared across programming languages.
We'll come back to arrow in more depth in @sec-arrow.
We'll return to arrow in more depth in @sec-arrow.
```{r}
#| eval: false

View File

@ -233,7 +233,7 @@ Note that if you want to find the number of duplicates, or rows that weren't dup
3. Sort `flights` to find the fastest flights (Hint: try sorting by a calculation).
4. Was there a flight on every day of 2017?
4. Was there a flight on every day of 2013?
5. Which flights traveled the farthest distance?
Which traveled the least distance?

View File

@ -3,6 +3,7 @@
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
status("complete")
```
@ -45,7 +46,7 @@ library(tidyverse)
You only need to install a package once, but you need to reload it every time you start a new session.
In addition to tidyverse, we will also use the **palmerpenguins** package, which includes the `penguins` dataset containing body measurements for penguins in three islands in the Palmer Archipelago.
In addition to tidyverse, we will also use the **palmerpenguins** package, which includes the `penguins` dataset containing body measurements for penguins on three islands in the Palmer Archipelago.
```{r}
library(palmerpenguins)
@ -158,10 +159,14 @@ The following plots show the result of adding these mappings, one at a time.
#| also shows body mass on the y-axis. The values range from 3000 to
#| 6000.
ggplot(data = penguins,
mapping = aes(x = flipper_length_mm))
ggplot(data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g))
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm)
)
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
)
```
Our empty canvas now has more structure -- it's clear where flipper lengths will be displayed (on the x-axis) and where body masses will be displayed (on the y-axis).
@ -184,8 +189,10 @@ You'll learn a whole bunch of geoms throughout the book, particularly in @sec-la
#| displays a positive, linear, relatively strong relationship between
#| these two variables.
ggplot(data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)) +
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_point()
```
@ -229,9 +236,10 @@ Throughout the book you will make many more ggplots and have many more opportuni
#| between these two variables. Species (Adelie, Chinstrap, and Gentoo)
#| are represented with different colors.
ggplot(data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g,
color = species)) +
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)
) +
geom_point()
```
@ -252,9 +260,10 @@ Since this is a new geometric object representing our data, we will add a new ge
#| Chinstrap, and Gentoo). Different penguin species are plotted in
#| different colors for the points and the smooth curves.
ggplot(data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g,
color = species)) +
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)
) +
geom_point() +
geom_smooth()
```
@ -274,8 +283,10 @@ Since we want points to be colored based on species but don't want the smooth cu
#| Chinstrap, and Gentoo). Different penguin species are plotted in
#| different colors for the points only.
ggplot(data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)) +
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_point(mapping = aes(color = species)) +
geom_smooth()
```
@ -296,8 +307,10 @@ Therefore, in addition to color, we can also map `species` to the `shape` aesthe
#| Chinstrap, and Gentoo). Different penguin species are plotted in
#| different colors and shapes for the points only.
ggplot(data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)) +
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_point(mapping = aes(color = species, shape = species)) +
geom_smooth()
```
@ -305,6 +318,8 @@ ggplot(data = penguins,
Note that the legend is automatically updated to reflect the different shapes of the points as well.
And finally, we can improve the labels of our plot using the `labs()` function in a new layer.
Some of the arguments to `labs()` might be self explanatory: `title` adds a title and `subtitle` adds a subtitle to the plot.
Other arguments match the aesthetic mappings, `x` is the x-axis label, `y` is the y-axis label, and `color` and `shape` define the label for the legend.
```{r}
#| warning: false
@ -318,16 +333,18 @@ And finally, we can improve the labels of our plot using the `labs()` function i
#| roughly the same for these three species, and Gentoo penguins are
#| larger than penguins from the other two species.
ggplot(penguins,
aes(x = flipper_length_mm, y = body_mass_g)) +
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_point(aes(color = species, shape = species)) +
geom_smooth() +
labs(
title = "Body mass and flipper length",
subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
x = "Flipper length (mm)",
x = "Flipper length (mm)",
y = "Body mass (g)",
color = "Species",
color = "Species",
shape = "Species"
)
```
@ -342,7 +359,7 @@ We finally have a plot that perfectly matches our "ultimate goal"!
2. What does the `bill_depth_mm` variable in the `penguins` data frame describe?
Read the help for `?penguins` to find out.
3. Make a scatterplot of `bill_depth_mm` vs `bill_length_mm`.
3. Make a scatterplot of `bill_depth_mm` vs. `bill_length_mm`.
Describe the relationship between these two variables.
4. What happens if you make a scatterplot of `species` vs `bill_depth_mm`?
@ -369,27 +386,32 @@ We finally have a plot that perfectly matches our "ultimate goal"!
```{r}
#| echo: false
#| warning: false
#| fig-alt: >
#| A scatterplot of body mass vs. flipper length of penguins, colored
#| by bill depth. A smooth curve of the relationship between body mass
#| and flipper length is overlaid. The relationship is positive,
#| fairly linear, and moderately strong.
ggplot(data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)) +
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_point(aes(color = bill_depth_mm)) +
geom_smooth()
```
9 .
Run this code in your head and predict what the output will look like.
Then, run the code in R and check your predictions.
9. Run this code in your head and predict what the output will look like.
Then, run the code in R and check your predictions.
```{r}
#| eval: false
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() +
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g, color = island)
) +
geom_point() +
geom_smooth(se = FALSE)
```
@ -399,13 +421,22 @@ Then, run the code in R and check your predictions.
```{r}
#| eval: false
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_point() +
geom_smooth()
ggplot() +
geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))
ggplot() +
geom_point(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_smooth(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
)
```
## ggplot2 calls
@ -416,8 +447,10 @@ So far we've been very explicit, which is helpful when you are learning:
```{r}
#| eval: false
ggplot(data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)) +
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_point()
```
@ -759,9 +792,13 @@ You will learn about many other geoms for visualizing distributions of variables
#| one labelled "species" which shows the shape scale and the other
#| that shows the color scale.
ggplot(data = penguins,
mapping = aes(x = bill_length_mm, y = bill_depth_mm,
color = species, shape = species)) +
ggplot(
data = penguins,
mapping = aes(
x = bill_length_mm, y = bill_depth_mm,
color = species, shape = species
)
) +
geom_point() +
labs(color = "Species")
```
@ -785,7 +822,7 @@ ggsave(filename = "my-plot.png")
file.remove("my-plot.png")
```
This will save your plot to your working directory, a concept you'll learn more about in @sec-workflow-scripts.
This will save your plot to your working directory, a concept you'll learn more about in @sec-workflow-scripts-projects.
If you don't specify the `width` and `height` they will be taken from the dimensions of the current plotting device.
For reproducible code, you'll want to specify them.

136
intro.qmd
View File

@ -8,20 +8,20 @@ source("_common.R")
Data science is an exciting discipline that allows you to transform raw data into understanding, insight, and knowledge.
The goal of "R for Data Science" is to help you learn the most important tools in R that will allow you to do data science efficiently and reproducibly.
After reading this book, you'll have the tools to tackle a wide variety of data science challenges, using the best parts of R.
After reading this book, you'll have the tools to tackle a wide variety of data science challenges using the best parts of R.
## What you will learn
Data science is a huge field, and there's no way you can master it all by reading a single book.
The goal of this book is to give you a solid foundation in the most important tools, and enough knowledge to find the resources to learn more when necessary.
Data science is a vast field, and there's no way you can master it all by reading a single book.
This book aims to give you a solid foundation in the most important tools and enough knowledge to find the resources to learn more when necessary.
Our model of the tools needed in a typical data science project looks something like @fig-ds-diagram.
```{r}
#| label: fig-ds-diagram
#| echo: false
#| fig-cap: >
#| In our model of the data science process you start with data import
#| and tidying. Next you understand your data with an iterative cycle of
#| In our model of the data science process, you start with data import
#| and tidying. Next, you understand your data with an iterative cycle of
#| transforming, visualizing, and modeling. You finish the process
#| by communicating your results to other humans.
#| fig-alt: >
@ -33,81 +33,79 @@ Our model of the tools needed in a typical data science project looks something
knitr::include_graphics("diagrams/data-science/base.png", dpi = 270)
```
First you must **import** your data into R.
This typically means that you take data stored in a file, database, or web application programming interface (API), and load it into a data frame in R.
First, you must **import** your data into R.
This typically means that you take data stored in a file, database, or web application programming interface (API) and load it into a data frame in R.
If you can't get your data into R, you can't do data science on it!
Once you've imported your data, it is a good idea to **tidy** it.
Tidying your data means storing it in a consistent form that matches the semantics of the dataset with the way it is stored.
Tidying your data means storing it in a consistent form that matches the semantics of the dataset with how it is stored.
In brief, when your data is tidy, each column is a variable, and each row is an observation.
Tidy data is important because the consistent structure lets you focus your efforts on answering questions about the data, not fighting to get the data into the right form for different functions.
Once you have tidy data, a common next step is to **transform** it.
Transformation includes narrowing in on observations of interest (like all people in one city, or all data from the last year), creating new variables that are functions of existing variables (like computing speed from distance and time), and calculating a set of summary statistics (like counts or means).
Together, tidying and transforming are called **wrangling**, because getting your data in a form that's natural to work with often feels like a fight!
Transformation includes narrowing in on observations of interest (like all people in one city or all data from the last year), creating new variables that are functions of existing variables (like computing speed from distance and time), and calculating a set of summary statistics (like counts or means).
Together, tidying and transforming are called **wrangling** because getting your data in a form that's natural to work with often feels like a fight!
Once you have tidy data with the variables you need, there are two main engines of knowledge generation: visualization and modelling.
These have complementary strengths and weaknesses so any real analysis will iterate between them many times.
Once you have tidy data with the variables you need, there are two main engines of knowledge generation: visualization and modeling.
These have complementary strengths and weaknesses, so any real analysis will iterate between them many times.
**Visualization** is a fundamentally human activity.
A good visualization will show you things that you did not expect, or raise new questions about the data.
A good visualization might also hint that you're asking the wrong question, or that you need to collect different data.
Visualizations can surprise you and they don't scale particularly well because they require a human to interpret them.
A good visualization will show you things you did not expect or raise new questions about the data.
A good visualization might also hint that you're asking the wrong question or that you need to collect different data.
Visualizations can surprise you, and they don't scale particularly well because they require a human to interpret them.
The last step of data science is **communication**, an absolutely critical part of any data analysis project.
It doesn't matter how well your models and visualization have led you to understand the data unless you can also communicate your results to others.
Surrounding all these tools is **programming**.
Programming is a cross-cutting tool that you use in nearly every part of a data science project.
You don't need to be an expert programmer to be a successful data scientist, but learning more about programming pays off, because becoming a better programmer allows you to automate common tasks, and solve new problems with greater ease.
You don't need to be an expert programmer to be a successful data scientist, but learning more about programming pays off because becoming a better programmer allows you to automate common tasks and solve new problems with greater ease.
You'll use these tools in every data science project, but for most projects they're not enough.
There's a rough 80-20 rule at play; you can tackle about 80% of every project using the tools that you'll learn in this book, but you'll need other tools to tackle the remaining 20%.
You'll use these tools in every data science project, but they're not enough for most projects.
There's a rough 80-20 rule at play; you can tackle about 80% of every project using the tools you'll learn in this book, but you'll need other tools to tackle the remaining 20%.
Throughout this book, we'll point you to resources where you can learn more.
## How this book is organised
## How this book is organized
The previous description of the tools of data science is organised roughly according to the order in which you use them in an analysis (although of course you'll iterate through them multiple times).
In our experience, however, learning data ingest and tidying first is sub-optimal, because 80% of the time it's routine and boring, and the other 20% of the time it's weird and frustrating.
The previous description of the tools of data science is organized roughly according to the order in which you use them in an analysis (although, of course, you'll iterate through them multiple times).
In our experience, however, learning data ingesting and tidying first is sub-optimal because 80% of the time, it's routine and boring, and the other 20% of the time, it's weird and frustrating.
That's a bad place to start learning a new subject!
Instead, we'll start with visualization and transformation of data that's already been imported and tidied.
That way, when you ingest and tidy your own data, your motivation will stay high because you know the pain is worth the effort.
Within each chapter, we try and adhere to a similar pattern: start with some motivating examples so you can see the bigger picture, and then dive into the details.
Within each chapter, we try and adhere to a similar pattern: start with some motivating examples so you can see the bigger picture and then dive into the details.
Each section of the book is paired with exercises to help you practice what you've learned.
Although it can be tempting to skip the exercises, there's no better way to learn than practicing on real problems.
## What you won't learn
There are a number of important topics that this book doesn't cover.
There are several important topics that this book doesn't cover.
We believe it's important to stay ruthlessly focused on the essentials so you can get up and running as quickly as possible.
That means this book can't cover every important topic.
### Modeling
<!--# TO DO: Say a few sentences about modelling. -->
To learn more about modeling, we highly recommend [Tidy Modeling with R](https://www.tmwr.org), by our colleagues Max Kuhn and Julia Silge.
To learn more about modeling, we highly recommend [Tidy Modeling with R](https://www.tmwr.org) by our colleagues Max Kuhn and Julia Silge.
This book will teach you the tidymodels family of packages, which, as you might guess from the name, share many conventions with the tidyverse packages we use in this book.
### Big data
This book proudly focuses on small, in-memory datasets.
This is the right place to start because you can't tackle big data unless you have experience with small data.
The tools you learn in this book will easily handle hundreds of megabytes of data, and with a little care, you can typically use them to work with 1-2 Gb of data.
The tools you learn in this book will easily handle hundreds of megabytes of data, and with a bit of care, you can typically use them to work with 1-2 Gb of data.
If you're routinely working with larger data (10-100 Gb, say), you should learn more about [data.table](https://github.com/Rdatatable/data.table).
This book doesn't teach data.table because it has a very concise interface that offers fewer linguistic cues, which makes it harder to learn.
However, if you're working with large data, the performance payoff is well worth the effort required to learn it.
However, the performance payoff is well worth the effort required to learn it if you're working with large data.
If your data is bigger than this, carefully consider whether your big data problem is actually a small data problem in disguise.
While the complete data set might be big, often the data needed to answer a specific question is small.
While the complete data set might be big, often, the data needed to answer a specific question is small.
You might be able to find a subset, subsample, or summary that fits in memory and still allows you to answer the question that you're interested in.
The challenge here is finding the right small data, which often requires a lot of iteration.
Another possibility is that your big data problem is actually a large number of small data problems in disguise.
Each individual problem might fit in memory, but you have millions of them.
For example, you might want to fit a model to each person in your dataset.
This would be trivial if you had just 10 or 100 people, but instead you have a million.
This would be trivial if you had just 10 or 100 people; instead, you have a million.
Fortunately, each problem is independent of the others (a setup that is sometimes called embarrassingly parallel), so you just need a system (like [Hadoop](https://hadoop.apache.org/) or [Spark](https://spark.apache.org/)) that allows you to send different datasets to different computers for processing.
Once you've figured out how to answer your question for a single subset using the tools described in this book, you can learn new tools like **sparklyr** to solve it for the full dataset.
@ -119,35 +117,35 @@ They're not!
And in practice, most data science teams use a mix of languages, often at least R and Python.
However, we strongly believe that it's best to master one tool at a time.
You will get better faster if you dive deep, rather than spreading yourself thinly over many topics.
You will get better faster if you dive deep rather than spreading yourself thinly over many topics.
This doesn't mean you should only know one thing, just that you'll generally learn faster if you stick to one thing at a time.
You should strive to learn new things throughout your career, but make sure your understanding is solid before you move on to the next interesting thing.
You should strive to learn new things throughout your career, but make sure your understanding is solid before you move on to the next exciting thing.
We think R is a great place to start your data science journey because it is an environment designed from the ground up to support data science.
R is not just a programming language, it is also an interactive environment for doing data science.
R is not just a programming language; it is also an interactive environment for doing data science.
To support interaction, R is a much more flexible language than many of its peers.
This flexibility comes with its downsides, but the big upside is how easy it is to evolve tailored grammars for specific parts of the data science process.
These mini languages help you think about problems as a data scientist, while supporting fluent interaction between your brain and the computer.
This flexibility has its downsides, but the big upside is how easy it is to evolve tailored grammars for specific parts of the data science process.
These mini languages help you think about problems as a data scientist while supporting fluent interaction between your brain and the computer.
## Prerequisites
We've made a few assumptions about what you already know in order to get the most out of this book.
We've made a few assumptions about what you already know to get the most out of this book.
You should be generally numerically literate, and it's helpful if you have some programming experience already.
If you've never programmed before, you might find [Hands on Programming with R](https://rstudio-education.github.io/hopr/) by Garrett to be a useful adjunct to this book.
If you've never programmed before, you might find [Hands on Programming with R](https://rstudio-education.github.io/hopr/) by Garrett to be a valuable adjunct to this book.
There are four things you need to run the code in this book: R, RStudio, a collection of R packages called the **tidyverse**, and a handful of other packages.
You need four things to run the code in this book: R, RStudio, a collection of R packages called the **tidyverse**, and a handful of other packages.
Packages are the fundamental units of reproducible R code.
They include reusable functions, the documentation that describes how to use them, and sample data.
They include reusable functions, documentation that describes how to use them, and sample data.
### R
To download R, go to CRAN, the **c**omprehensive **R** **a**rchive **n**etwork.
CRAN is composed of a set of mirror servers distributed around the world and is used to distribute R and R packages.
Don't try and pick a mirror that's close to you: instead use the cloud mirror, <https://cloud.r-project.org>, which automatically figures it out for you.
Don't pick a mirror close to you; instead, use the cloud mirror, <https://cloud.r-project.org>, which automatically figures it out for you.
A new major version of R comes out once a year, and there are 2-3 minor releases each year.
It's a good idea to update regularly.
Upgrading can be a bit of a hassle, especially for major versions, which require you to re-install all your packages, but putting it off only makes it worse.
Upgrading can be a bit of a hassle, especially for major versions requiring you to re-install all your packages, but putting it off only makes it worse.
You'll need at least R 4.1.0 for this book.
### RStudio
@ -156,11 +154,11 @@ RStudio is an integrated development environment, or IDE, for R programming.
Download and install it from <https://posit.co/download/rstudio-desktop/>.
RStudio is updated a couple of times a year.
When a new version is available, RStudio will let you know.
It's a good idea to upgrade regularly so you can take advantage of the latest and greatest features.
It's a good idea to upgrade regularly to take advantage of the latest and greatest features.
For this book, make sure you have at least RStudio 2022.02.0.
When you start RStudio, @fig-rstudio-console, you'll see two key regions in the interface: the console pane, and the output pane.
For now, all you need to know is that you type R code in the console pane, and press enter to run it.
When you start RStudio, @fig-rstudio-console, you'll see two key regions in the interface: the console pane and the output pane.
For now, all you need to know is that you type the R code in the console pane and press enter to run it.
You'll learn more as we go along!
```{r}
@ -181,7 +179,7 @@ You'll also need to install some R packages.
An R **package** is a collection of functions, data, and documentation that extends the capabilities of base R.
Using packages is key to the successful use of R.
The majority of the packages that you will learn in this book are part of the so-called tidyverse.
All packages in the tidyverse share a common philosophy of data and R programming, and are designed to work together naturally.
All packages in the tidyverse share a common philosophy of data and R programming and are designed to work together naturally.
You can install the complete tidyverse with a single line of code:
@ -191,9 +189,9 @@ You can install the complete tidyverse with a single line of code:
install.packages("tidyverse")
```
On your own computer, type that line of code in the console, and then press enter to run it.
R will download the packages from CRAN and install them on to your computer.
If you have problems installing, make sure that you are connected to the internet, and that <https://cloud.r-project.org/> isn't blocked by your firewall or proxy.
On your computer, type that line of code in the console, and then press enter to run it.
R will download the packages from CRAN and install them on your computer.
If you have problems installing, make sure that you are connected to the internet and that <https://cloud.r-project.org/> isn't blocked by your firewall or proxy.
You will not be able to use the functions, objects, or help files in a package until you load it with `library()`.
Once you have installed a package, you can load it using the `library()` function:
@ -202,33 +200,33 @@ Once you have installed a package, you can load it using the `library()` functio
library(tidyverse)
```
This tells you that tidyverse is loading eight packages: ggplot2, tibble, tidyr, readr, purrr, dplyr, stringr, and forcats.
These are considered to be the **core** of the tidyverse because you'll use them in almost every analysis.
This tells you that tidyverse loads eight packages: ggplot2, tibble, tidyr, readr, purrr, dplyr, stringr, and forcats.
These are considered the **core** of the tidyverse because you'll use them in almost every analysis.
Packages in the tidyverse change fairly frequently.
You can check whether updates are available, and optionally install them, by running `tidyverse_update()`.
You can check whether updates are available and optionally install them by running `tidyverse_update()`.
### Other packages
There are many other excellent packages that are not part of the tidyverse, because they solve problems in a different domain, or are designed with a different set of underlying principles.
There are many other excellent packages that are not part of the tidyverse because they solve problems in a different domain or are designed with a different set of underlying principles.
This doesn't make them better or worse, just different.
In other words, the complement to the tidyverse is not the messyverse, but many other universes of interrelated packages.
In other words, the complement to the tidyverse is not the messyverse but many other universes of interrelated packages.
As you tackle more data science projects with R, you'll learn new packages and new ways of thinking about data.
In this book we'll use three data packages from outside the tidyverse:
In this book, we'll use three data packages from outside the tidyverse:
```{r}
#| eval: false
install.packages(c("nycflights13", "gapminder", "Lahman"))
install.packages(c("gapminder", "Lahman", "nycflights13", "palmerpenguins"))
```
These packages provide data on airline flights, world development, and baseball that we'll use to illustrate key data science ideas.
These packages provide data on world development, baseball, airline flights, and body measurements of penguins that we'll use to illustrate key data science ideas.
## Running R code
The previous section showed you several examples of running R code.
Code in the book looks like this:
The code in the book looks like this:
```{r}
#| eval: true
@ -242,36 +240,36 @@ If you run the same code in your local console, it will look like this:
There are two main differences.
In your console, you type after the `>`, called the **prompt**; we don't show the prompt in the book.
In the book, output is commented out with `#>`; in your console it appears directly after your code.
In the book, the output is commented out with `#>`; in your console, it appears directly after your code.
These two differences mean that if you're working with an electronic version of the book, you can easily copy code out of the book and into the console.
Throughout the book, we use a consistent set of conventions to refer to code:
- Functions are displayed in a code font and followed by parentheses, like `sum()`, or `mean()`.
- Functions are displayed in a code font and followed by parentheses, like `sum()` or `mean()`.
- Other R objects (such as data or function arguments) are in a code font, without parentheses, like `flights` or `x`.
- Sometimes, to make it clear which package an object comes from, we'll use the package name followed by two colons, like `dplyr::mutate()`, or\
- Sometimes, to make it clear which package an object comes from, we'll use the package name followed by two colons, like `dplyr::mutate()` or\
`nycflights13::flights`.
This is also valid R code.
## Acknowledgements
## Acknowledgments
This book isn't just the product of Hadley, Mine, and Garrett, but is the result of many conversations (in person and online) that we've had with many people in the R community.
There are a few people we'd like to thank in particular, because they have spent many hours answering our questions and helping us to better think about data science:
This book isn't just the product of Hadley, Mine, and Garrett but is the result of many conversations (in person and online) that we've had with many people in the R community.
There are a few people we'd like to thank in particular because they have spent many hours answering our questions and helping us to better think about data science:
- Jenny Bryan and Lionel Henry for many helpful discussions around working with lists and list-columns.
- The three chapters on workflow were adapted (with permission), from <https://stat545.com/block002_hello-r-workspace-wd-project.html> by Jenny Bryan.
- The three chapters on workflow were adapted (with permission) from <https://stat545.com/block002_hello-r-workspace-wd-project.html> by Jenny Bryan.
- Yihui Xie for his work on the [bookdown](https://github.com/rstudio/bookdown) package, and for tirelessly responding to my feature requests.
- Yihui Xie for his work on the [bookdown](https://github.com/rstudio/bookdown) package and for tirelessly responding to my feature requests.
- Bill Behrman for his thoughtful reading of the entire book, and for trying it out with his data science class at Stanford.
- Bill Behrman for his thoughtful reading of the entire book and for trying it out with his data science class at Stanford.
- The #rstats Twitter community who reviewed all of the draft chapters and provided tons of useful feedback.
- The #rstats Twitter community who reviewed all of the draft chapters and provided tons of helpful feedback.
This book was written in the open, and many people contributed pull requests to fix minor problems.
Special thanks goes to everyone who contributed via GitHub:
Special thanks go to everyone who contributed via GitHub:
```{r}
#| eval: false
@ -335,7 +333,7 @@ cat(".\n")
An online version of this book is available at <https://r4ds.hadley.nz>.
It will continue to evolve in between reprints of the physical book.
The source of the book is available at <https://github.com/hadley/r4ds>.
The book is powered by [Quarto](https://quarto.org) which makes it easy to write books that combine text and executable code.
The book is powered by [Quarto](https://quarto.org), which makes it easy to write books that combine text and executable code.
This book was built with:

View File

@ -9,7 +9,7 @@ status("polishing")
## Introduction
So far you've seen Quarto used to produce HTML documents.
So far, you've seen Quarto used to produce HTML documents.
This chapter gives a brief overview of some of the many other types of output you can produce with Quarto.
There are two ways to set the output of a document:
@ -41,9 +41,9 @@ There are two ways to set the output of a document:
Quarto offers a wide range of output formats.
You can find the complete list at <https://quarto.org/docs/output-formats/all-formats.html>.
Many formats share some output options (e.g. `toc: true` for including a table of contents), but others have options that are format specific (e.g. `code-fold: true` collapses code chunks into a `<details>` tag for HTML output so the user can display it on demand, it's not applicable in a PDF or Word document).
Many formats share some output options (e.g., `toc: true` for including a table of contents), but others have options that are format specific (e.g., `code-fold: true` collapses code chunks into a `<details>` tag for HTML output so the user can display it on demand, it's not applicable in a PDF or Word document).
To override the default voptions, you need to use an expanded `format` field.
To override the default options, you need to use an expanded `format` field.
For example, if you wanted to render an `html` with a floating table of contents, you'd use:
``` yaml
@ -64,7 +64,7 @@ format:
docx: default
```
Note the special syntax (`pdf: default`) if you don't want to override any of the default options.
Note the special syntax (`pdf: default`) if you don't want to override any default options.
To render to all formats specified in the YAML of a document, you can use `output_format = "all"`.
@ -77,10 +77,10 @@ quarto::quarto_render("diamond-sizes.qmd", output_format = "all")
## Documents
The previous chapter focused on the default `html` output.
There are a number of basic variations on that theme, generating different types of documents.
There are several basic variations on that theme, generating different types of documents.
For example:
- `pdf` makes a PDF with LaTeX (an open source document layout system), which you'll need to install.
- `pdf` makes a PDF with LaTeX (an open-source document layout system), which you'll need to install.
RStudio will prompt you if you don't already have it.
- `docx` for Microsoft Word (`.docx`) documents.
@ -93,7 +93,7 @@ For example:
- `ipynb` for Jupyter Notebooks (`.ipynb`).
Remember, when generating a document to share with decision makers, you can turn off the default display of code by setting global options in document YAML:
Remember, when generating a document to share with decision-makers, you can turn off the default display of code by setting global options in document YAML:
``` yaml
execute:
@ -113,7 +113,7 @@ format:
You can also use Quarto to produce presentations.
You get less visual control than with a tool like Keynote or PowerPoint, but automatically inserting the results of your R code into a presentation can save a huge amount of time.
Presentations work by dividing your content into slides, with a new slide beginning at each second (`##`) level header.
Additionally, first (`#`) level headers can be used to indicate the beginning of a new section with a section title slide that is by default centered in the middle.
Additionally, first (`#`) level headers indicate the beginning of a new section with a section title slide that is, by default, centered in the middle.
Quarto supports a variety of presentation formats, including:
@ -127,7 +127,7 @@ You can read more about creating presentations with Quarto at [https://quarto.or
## Dashboards
Dashboards are a useful way to communicate large amounts of information visually and quickly.
Dashboards are a useful way to communicate information visually and quickly.
A dashboard-like look can be achieved with Quarto using document layout options like sidebars, tabsets, multi-column layouts, etc.
For example, you can produce this dashboard:
@ -153,7 +153,7 @@ To learn more about Quarto component layouts, visit <https://quarto.org/docs/int
## Interactivity
Any HTML documents can contain interactive components.
Any HTML document can contain interactive components.
### htmlwidgets
@ -175,7 +175,7 @@ All the details are wrapped inside the package, so you don't need to worry about
There are many packages that provide htmlwidgets, including:
- **dygraphs**, [https://rstudio.github.io/dygraphs](https://rstudio.github.io/dygraphs/){.uri}, for interactive time series visualisations.
- **dygraphs**, [https://rstudio.github.io/dygraphs](https://rstudio.github.io/dygraphs/){.uri}, for interactive time series visualizations.
- **DT**, [https://rstudio.github.io/DT/](https://rstudio.github.io/DT){.uri}, for interactive tables.
@ -183,16 +183,16 @@ There are many packages that provide htmlwidgets, including:
- **DiagrammeR**, <https://rich-iannone.github.io/DiagrammeR> for diagrams (like flow charts and simple node-link diagrams).
To learn more about htmlwidgets and see a more complete list of packages that provide them visit <https://www.htmlwidgets.org>.
To learn more about htmlwidgets and see a complete list of packages that provide them visit <https://www.htmlwidgets.org>.
### Shiny
htmlwidgets provide **client-side** interactivity --- all the interactivity happens in the browser, independently of R.
On one hand, that's great because you can distribute the HTML file without any connection to R.
On the one hand, that's great because you can distribute the HTML file without any connection to R.
However, that fundamentally limits what you can do to things that have been implemented in HTML and JavaScript.
An alternative approach is to use **shiny**, a package that allows you to create interactivity using R code, not JavaScript.
To call Shiny code from an Quarto document, add `server: shiny` to the YAML header:
To call Shiny code from a Quarto document, add `server: shiny` to the YAML header:
``` yaml
title: "Shiny Web App"
@ -217,8 +217,8 @@ And you also need a code chunk with chunk option `context: server` which contain
#| echo: false
#| out-width: null
#| fig-alt: |
#| Two input boxes on top of each other. Top one says "What is your
#| name?", the bottom one "How old are you?".
#| Two input boxes on top of each other. Top one says, "What is your
#| name?", the bottom, "How old are you?".
knitr::include_graphics("quarto/quarto-shiny.png")
```
@ -228,14 +228,14 @@ You can then refer to the values with `input$name` and `input$age`, and the code
We can't show you a live shiny app here because shiny interactions occur on the **server-side**.
This means that you can write interactive apps without knowing JavaScript, but you need a server to run them on.
This introduces a logistical issue: Shiny apps need a Shiny server to be run online.
When you run Shiny apps on your own computer, Shiny automatically sets up a Shiny server for you, but you need a public facing Shiny server if you want to publish this sort of interactivity online.
When you run Shiny apps on your own computer, Shiny automatically sets up a Shiny server for you, but you need a public-facing Shiny server if you want to publish this sort of interactivity online.
That's the fundamental trade-off of shiny: you can do anything in a shiny document that you can do in R, but it requires someone to be running R.
For learning more about Shiny, we recommend reading Mastering Shiny by Hadley Wickham, [https://mastering-shiny.org](https://mastering-shiny.org/).
## Websites and books
With a little additional infrastructure you can use Quarto to generate a complete website or book:
With a bit of additional infrastructure, you can use Quarto to generate a complete website or book:
- Put your `.qmd` files in a single directory.
`index.qmd` will become the home page.
@ -286,16 +286,16 @@ See <https://quarto.org/docs/output-formats/all-formats.html> for a list of even
## Learning more
To learn more about effective communication in these different formats we recommend the following resources:
To learn more about effective communication in these different formats, we recommend the following resources:
- To improve your presentation skills, try [*Presentation Patterns*](https://amzn.com/0321820800), by Neal Ford, Matthew McCollough, and Nathaniel Schutta.
- To improve your presentation skills, try [*Presentation Patterns*](https://presentationpatterns.com/) by Neal Ford, Matthew McCollough, and Nathaniel Schutta.
It provides a set of effective patterns (both low- and high-level) that you can apply to improve your presentations.
- If you give academic talks, you might like the [*Leek group guide to giving talks*](https://github.com/jtleek/talkguide).
- We haven't taken it ourselves, but we've heard good things about Matt McGarrity's online course on public speaking: <https://www.coursera.org/learn/public-speaking>.
- If you are creating a lot of dashboards, make sure to read Stephen Few's [*Information Dashboard Design: The Effective Visual Communication of Data*](https://www.amazon.com/Information-Dashboard-Design-Effective-Communication/dp/0596100167).
- If you are creating many dashboards, make sure to read Stephen Few's [*Information Dashboard Design: The Effective Visual Communication of Data*](https://www.amazon.com/Information-Dashboard-Design-Effective-Communication/dp/0596100167).
It will help you create dashboards that are truly useful, not just pretty to look at.
- Effectively communicating your ideas often benefits from some knowledge of graphic design.

View File

@ -10,21 +10,21 @@ status("polishing")
## Introduction
In @sec-strings, you learned a whole bunch of useful functions for working with strings.
In this chapter we'll focusing on functions that use **regular expressions**, a concise and powerful language for describing patterns within strings.
This chapter will focus on functions that use **regular expressions**, a concise and powerful language for describing patterns within strings.
The term "regular expression" is a bit of a mouthful, so most people abbreviate it to "regex"[^regexps-1] or "regexp".
[^regexps-1]: You can pronounce it with either a hard-g (reg-x) or a soft-g (rej-x).
The chapter starts with the basics of regular expressions and the most useful stringr functions for data analysis.
We'll then expand your knowledge of patterns and cover seven important new topics (escaping, anchoring, character classes, shorthand classes, quantifiers, precedence, and grouping).
Next, we'll talk about some of the other types of patterns that stringr functions can work with, and the various "flags" that allow you to tweak the operation of regular expressions.
We'll finish up with a survey of other places in the tidyverse and base R where you might use regexes.
Next, we'll talk about some of the other types of patterns that stringr functions can work with and the various "flags" that allow you to tweak the operation of regular expressions.
We'll finish with a survey of other places in the tidyverse and base R where you might use regexes.
### Prerequisites
::: callout-important
This chapter relies on features only found in tidyr 1.3.0 which are still in development.
If you want to live life on the edge, you can get the dev versions with `devtools::install_github("tidyverse/tidyr")`.
This chapter relies on features only found in tidyr 1.3.0, which is still in development.
If you want to live on the edge, you can get the dev version with `devtools::install_github("tidyverse/tidyr")`.
:::
In this chapter, we'll use regular expression functions from stringr and tidyr, both core members of the tidyverse, as well as data from the babynames package.
@ -37,7 +37,7 @@ library(tidyverse)
library(babynames)
```
Through this chapter we'll use a mix of very simple inline examples so you can get the basic idea, the baby names data, and three character vectors from stringr:
Through this chapter, we'll use a mix of very simple inline examples so you can get the basic idea, the baby names data, and three character vectors from stringr:
- `fruit` contains the names of 80 fruits.
- `words` contains 980 common English words.
@ -78,9 +78,9 @@ str_view(fruit, "a...e")
**Quantifiers** control how many times a pattern can match:
- `?` makes a pattern optional (i.e. it matches 0 or 1 times)
- `+` lets a pattern repeat (i.e. it matches at least once)
- `*` lets a pattern be optional or repeat (i.e. it matches any number of times, including 0).
- `?` makes a pattern optional (i.e., it matches 0 or 1 times)
- `+` lets a pattern repeat (i.e., it matches at least once)
- `*` lets a pattern be optional or repeat (i.e., it matches any number of times, including 0).
```{r}
# ab? matches an "a", optionally followed by a "b".
@ -93,7 +93,7 @@ str_view(c("a", "ab", "abb"), "ab+")
str_view(c("a", "ab", "abb"), "ab*")
```
**Character classes** are defined by `[]` and let you match a set set of characters, e.g. `[abcd]` matches "a", "b", "c", or "d".
**Character classes** are defined by `[]` and let you match a set of characters, e.g. `[abcd]` matches "a", "b", "c", or "d".
You can also invert the match by starting with `^`: `[^abcd]` matches anything **except** "a", "b", "c", or "d".
We can use this idea to find the words with three vowels or four consonants in a row:
@ -103,13 +103,13 @@ str_view(words, "[^aeiou][^aeiou][^aeiou][^aeiou]")
```
You can combine character classes and quantifiers.
For example, the following regexp looks for two vowel followed by two or more consonants:
For example, the following regexp looks for two vowels followed by two or more consonants:
```{r}
str_view(words, "[aeiou][aeiou][^aeiou][^aeiou]+")
```
(We'll learn some more elegant ways to express these ideas in @sec-quantifiers.)
(We'll learn more elegant ways to express these ideas in @sec-quantifiers.)
You can use **alternation**, `|` to pick between one or more alternative patterns.
For example, the following patterns look for fruits containing "apple", "pear", or "banana", or a repeated vowel.
@ -128,11 +128,11 @@ Let's kick off that process by practicing with some useful stringr functions.
## Key functions {#sec-stringr-regex-funs}
Now that you've got the basics of regular expressions under your belt, let's use them with some stringr and tidyr functions.
In the following section, you'll learn about how to detect the presence or absence of a match, how to count the number of matches, how to replace a match with fixed text, and how to extract text using a pattern.
In the following section, you'll learn how to detect the presence or absence of a match, how to count the number of matches, how to replace a match with fixed text, and how to extract text using a pattern.
### Detect matches
`str_detect()` returns a logical vector that is `TRUE` if the pattern matched an element of the character vector and `FALSE` otherwise:
`str_detect()` returns a logical vector that is `TRUE` if the pattern matches an element of the character vector and `FALSE` otherwise:
```{r}
str_detect(c("a", "b", "c"), "[aeiou]")
@ -159,7 +159,7 @@ It looks like they've radically increased in popularity lately!
#| A time series showing the proportion of baby names that contain a
#| lower case "x".
#| fig-alt: >
#| A timeseries showing the proportion of baby names that contain the letter x.
#| A time series showing the proportion of baby names that contain the letter x.
#| The proportion declines gradually from 8 per 1000 in 1880 to 4 per 1000 in
#| 1980, then increases rapidly to 16 per 1000 in 2019.
@ -213,7 +213,7 @@ There are three ways we could fix this:
- Add the upper case vowels to the character class: `str_count(name, "[aeiouAEIOU]")`.
- Tell the regular expression to ignore case: `str_count(regex(name, ignore_case = TRUE), "[aeiou]")`. We'll talk about more in @sec-flags.
- Use `str_to_lower()` to convert the names to lower case: `str_count(str_to_lower(name), "[aeiou]")`. You learned about this function in @sec-other-languages.
- Use `str_to_lower()` to convert the names to lower case: `str_count(str_to_lower(name), "[aeiou]")`.
This variety of approaches is pretty typical when working with strings --- there are often multiple ways to reach your goal, either by making your pattern more complicated or by doing some preprocessing on your string.
If you get stuck trying one approach, it can often be useful to switch gears and tackle the problem from a different perspective.

View File

@ -10,23 +10,23 @@ status("polishing")
## Introduction
So far, you've used a bunch of strings without learning much about the details.
Now it's time to dive into them, learning what makes strings tick, and mastering some of the powerful string manipulation tool you have at your disposal.
Now it's time to dive into them, learn what makes strings tick, and master some of the powerful string manipulation tools you have at your disposal.
We'll begin with the details of creating strings and character vectors.
You'll then dive into creating strings from data, then the opposite; extracting strings from data.
We'll then discuss tools that work with individual letters.
The chapter finishes off with functions that work with individual letters and a brief discussion of where your expectations from English might steer you wrong when working with other languages.
The chapter finishes with functions that work with individual letters and a brief discussion of where your expectations from English might steer you wrong when working with other languages.
We'll keep working with strings in the next chapter, where you'll learn more about the power of regular expressions.
### Prerequisites
::: callout-important
This chapter relies on features only found in tidyr 1.3.0 which are still in development.
If you want to live life on the edge, you can get the dev versions with `devtools::install_github("tidyverse/tidyr")`.
This chapter relies on features only found in tidyr 1.3.0, which is still in development.
If you want to live on the edge, you can get the dev versions with `devtools::install_github("tidyverse/tidyr")`.
:::
In this chapter, we'll use functions from the stringr package which is part of the core tidyverse.
In this chapter, we'll use functions from the stringr package, which is part of the core tidyverse.
We'll also use the babynames data since it provides some fun strings to manipulate.
```{r}
@ -37,8 +37,8 @@ library(tidyverse)
library(babynames)
```
You can easily tell when you're using a stringr function because all stringr functions start with `str_`.
This is particularly useful if you use RStudio, because typing `str_` will trigger autocomplete, allowing you jog your memory of which functions are available.
You can quickly tell when you're using a stringr function because all stringr functions start with `str_`.
This is particularly useful if you use RStudio because typing `str_` will trigger autocomplete, allowing you to jog your memory of the available functions.
```{r}
#| echo: false
@ -48,9 +48,9 @@ knitr::include_graphics("screenshots/stringr-autocomplete.png")
## Creating a string
We've created strings in passing earlier in the book, but didn't discuss the details.
We've created strings in passing earlier in the book but didn't discuss the details.
Firstly, you can create a string using either single quotes (`'`) or double quotes (`"`).
There's no difference in behavior between the two so in the interests of consistency the [tidyverse style guide](https://style.tidyverse.org/syntax.html#character-vectors) recommends using `"`, unless the string contains multiple `"`.
There's no difference in behavior between the two, so in the interests of consistency, the [tidyverse style guide](https://style.tidyverse.org/syntax.html#character-vectors) recommends using `"`, unless the string contains multiple `"`.
```{r}
string1 <- "This is a string"
@ -64,11 +64,11 @@ If you forget to close a quote, you'll see `+`, the continuation character:
+
+ HELP I'M STUCK IN A STRING
If this happens to you and you can't figure out which quote you need to close, press Escape to cancel, and try again.
If this happens to you and you can't figure out which quote to close, press Escape to cancel and try again.
### Escapes
To include a literal single or double quote in a string you can use `\` to "escape" it:
To include a literal single or double quote in a string, you can use `\` to "escape" it:
```{r}
double_quote <- "\"" # or '"'
@ -81,7 +81,7 @@ So if you want to include a literal backslash in your string, you'll need to esc
backslash <- "\\"
```
Beware that the printed representation of a string is not the same as string itself, because the printed representation shows the escapes (in other words, when you print a string, you can copy and paste the output to recreate that string).
Beware that the printed representation of a string is not the same as the string itself because the printed representation shows the escapes (in other words, when you print a string, you can copy and paste the output to recreate that string).
To see the raw contents of the string, use `str_view()`[^strings-1]:
[^strings-1]: Or use the base R function `writeLines()`.
@ -96,7 +96,7 @@ str_view(x)
### Raw strings {#sec-raw-strings}
Creating a string with multiple quotes or backslashes gets confusing quickly.
To illustrate the problem, lets create a string that contains the contents of the code block where we define the `double_quote` and `single_quote` variables:
To illustrate the problem, let's create a string that contains the contents of the code block where we define the `double_quote` and `single_quote` variables:
```{r}
tricky <- "double_quote <- \"\\\"\" # or '\"'
@ -105,7 +105,7 @@ str_view(tricky)
```
That's a lot of backslashes!
(This is sometimes called [leaning toothpick syndrome](https://en.wikipedia.org/wiki/Leaning_toothpick_syndrome).) To eliminate the escaping you can instead use a **raw string**[^strings-2]:
(This is sometimes called [leaning toothpick syndrome](https://en.wikipedia.org/wiki/Leaning_toothpick_syndrome).) To eliminate the escaping, you can instead use a **raw string**[^strings-2]:
[^strings-2]: Available in R 4.0.0 and above.
@ -116,11 +116,11 @@ str_view(tricky)
```
A raw string usually starts with `r"(` and finishes with `)"`.
But if your string contains `)"` you can instead use `r"[]"` or `r"{}"`, and if that's still not enough, you can insert any number of dashes to make the opening and closing pairs unique, e.g. `` `r"--()--" ``, `` `r"---()---" ``, etc. Raw strings are flexible enough to handle any text.
But if your string contains `)"` you can instead use `r"[]"` or `r"{}"`, and if that's still not enough, you can insert any number of dashes to make the opening and closing pairs unique, e.g., `` `r"--()--" ``, `` `r"---()---" ``, etc. Raw strings are flexible enough to handle any text.
### Other special characters
As well as `\"`, `\'`, and `\\` there are a handful of other special characters that may come in handy. The most common are `\n`, newline, and `\t`, tab. You'll also sometimes see strings containing Unicode escapes that start with `\u` or `\U`. This is a way of writing non-English characters that works on all systems. You can see the complete list of other special characters in `?'"'`.
As well as `\"`, `\'`, and `\\`, there are a handful of other special characters that may come in handy. The most common are `\n`, newline, and `\t`, tab. You'll also sometimes see strings containing Unicode escapes that start with `\u` or `\U`. This is a way of writing non-English characters that work on all systems. You can see the complete list of other special characters in `?'"'`.
```{r}
x <- c("one\ntwo", "one\ttwo", "\u00b5", "\U0001f604")
@ -129,7 +129,7 @@ str_view(x)
```
Note that `str_view()` uses a blue background for tabs to make them easier to spot.
One of the challenges of working with text is that there's a variety of ways that white space can end up in text, so this background helps you recognize that something strange is going on.
One of the challenges of working with text is that there's a variety of ways that white space can end up in the text, so this background helps you recognize that something strange is going on.
### Exercises
@ -153,10 +153,10 @@ One of the challenges of working with text is that there's a variety of ways tha
## Creating many strings from data
Now that you've learned the basics of creating a string or two by "hand", we'll go into the details of creating strings from other strings.
This will help you solve the common problem where you have some text that you wrote that you want to combine with strings from a data frame.
For example, to create a greeting you might combine "Hello" with a `name` variable.
We'll show you how to do this with `str_c()` and `str_glue()` and how you can you use them with `mutate()`.
That naturally raises the question of what string functions you might use with `summarize()`, so we'll finish this section with a discussion of `str_flatten()` which is a summary function for strings.
This will help you solve the common problem where you have some text you wrote that you want to combine with strings from a data frame.
For example, you might combine "Hello" with a `name` variable to create a greeting.
We'll show you how to do this with `str_c()` and `str_glue()` and how you can use them with `mutate()`.
That naturally raises the question of what string functions you might use with `summarize()`, so we'll finish this section with a discussion of `str_flatten()`, which is a summary function for strings.
### `str_c()`
@ -171,7 +171,7 @@ str_c("x", "y", "z")
str_c("Hello ", c("John", "Susan"))
```
`str_c()` is designed to be used with `mutate()` so it obeys the usual rules for recycling and missing values:
`str_c()` is designed to be used with `mutate()`, so it obeys the usual rules for recycling and missing values:
```{r}
set.seed(1410)
@ -179,7 +179,7 @@ df <- tibble(name = c(wakefield::name(3), NA))
df |> mutate(greeting = str_c("Hi ", name, "!"))
```
If you want missing values to display in some other way, use `coalesce()`.
If you want missing values to display in another way, use `coalesce()`.
Depending on what you want, you might use it either inside or outside of `str_c()`:
```{r}
@ -203,7 +203,7 @@ df |> mutate(greeting = str_glue("Hi {name}!"))
As you can see, `str_glue()` currently converts missing values to the string `"NA"` unfortunately making it inconsistent with `str_c()`.
You also might wonder what happens if you need to include a regular `{` or `}` in your string.
If you guess that you'll need to somehow escape it, you're on the right track.
You're on the right track if you guess you'll need to escape it somehow.
The trick is that glue uses a slightly different escaping technique; instead of prefixing with special character like `\`, you double up the special characters:
```{r}
@ -213,7 +213,7 @@ df |> mutate(greeting = str_glue("{{Hi {name}!}}"))
### `str_flatten()`
`str_c()` and `glue()` work well with `mutate()` because their output is the same length as their inputs.
What if you want a function that works well with `summarize()`, i.e. something that always returns a single string?
What if you want a function that works well with `summarize()`, i.e., something that always returns a single string?
That's the job of `str_flatten()`[^strings-5]: it takes a character vector and combines each element of the vector into a single string:
[^strings-5]: The base R equivalent is `paste()` used with the `collapse` argument.
@ -263,24 +263,24 @@ df |>
## Extracting data from strings
It's very common for multiple variables to be crammed together into a single string.
In this section you'll learn how to use four tidyr functions to extract them:
In this section, you'll learn how to use four tidyr functions to extract them:
- `df |> separate_longer_delim(col, delim)`
- `df |> separate_longer_position(col, width)`
- `df |> separate_wider_delim(col, delim, names)`
- `df |> separate_wider_position(col, widths)`
If you look closely you can see there's a common pattern here: `separate_`, then `longer` or `wider`, then `_`, then by `delim` or `position`.
That's because these four functions are composed from two simpler primitives:
If you look closely, you can see there's a common pattern here: `separate_`, then `longer` or `wider`, then `_`, then by `delim` or `position`.
That's because these four functions are composed of two simpler primitives:
- `longer` makes input data frame longer, creating new rows; `wider` makes the input data frame wider, generating new columns.
- `longer` makes the input data frame longer, creating new rows; `wider` makes the input data frame wider, generating new columns.
- `delim` splits up a string with a delimiter like `", "` or `" "`; `position` splits at specified widths, like `c(3, 5, 2)`.
We'll come back the last member of this family, `separate_regex_wider()`, in @sec-regular-expressions.
It's the most flexible of the `wider` functions but you need to know something about regular expression before you can use it.
We'll return to the last member of this family, `separate_regex_wider()`, in @sec-regular-expressions.
It's the most flexible of the `wider` functions, but you need to know something about regular expressions before you can use it.
The next two sections will give you the basic idea behind these separate functions, first separating into rows (which is a little simpler) and then separating in to columns.
We'll finish off my discussing the tools that the `wider` functions give you to diagnose problems.
The following two sections will give you the basic idea behind these separate functions, first separating into rows (which is a little simpler) and then separating into columns.
We'll finish off by discussing the tools that the `wider` functions give you to diagnose problems.
### Separating into rows
@ -293,7 +293,7 @@ df1 |>
separate_longer_delim(x, delim = ",")
```
It's rarer to see `separate_longer_position()` in the wild, but some older datasets do use very compact format where each character is used to record a value:
It's rarer to see `separate_longer_position()` in the wild, but some older datasets do use a very compact format where each character is used to record a value:
```{r}
df2 <- tibble(x = c("1211", "131", "21"))
@ -305,8 +305,8 @@ df2 |>
Separating a string into columns tends to be most useful when there are a fixed number of components in each string, and you want to spread them into columns.
They are slightly more complicated than their `longer` equivalents because you need to name the columns.
For example, in this following dataset `x` is made up of a code, an edition number, and a year, separated by `"."`.
To use `separate_wider_delim()` we supply the delimiter and the names in two arguments:
For example, in this following dataset, `x` is made up of a code, an edition number, and a year, separated by `"."`.
To use `separate_wider_delim()`, we supply the delimiter and the names in two arguments:
```{r}
df3 <- tibble(x = c("a10.1.2022", "b10.2.2011", "e15.1.2015"))
@ -329,8 +329,8 @@ df3 |>
)
```
`separate_wider_position()` works a little differently, because you typically want to specify the width of each column.
So you give it a named integer vector, where the name gives the name of the new column and the value is the number of characters it occupies.
`separate_wider_position()` works a little differently because you typically want to specify the width of each column.
So you give it a named integer vector, where the name gives the name of the new column, and the value is the number of characters it occupies.
You can omit values from the output by not naming them:
```{r}
@ -362,7 +362,7 @@ df |>
)
```
You'll notice that we get an error, but the error gives us some suggestions as to how you might proceed.
You'll notice that we get an error, but the error gives us some suggestions on how you might proceed.
Let's start by debugging the problem:
```{r}
@ -376,7 +376,7 @@ debug <- df |>
debug
```
When you use the debug mode you get three extra columns add to the output: `x_ok`, `x_pieces`, and `x_remainder` (if you separate variable with a different name, you'll get a different prefix).
When you use the debug mode, you get three extra columns added to the output: `x_ok`, `x_pieces`, and `x_remainder` (if you separate a variable with a different name, you'll get a different prefix).
Here, `x_ok` lets you quickly find the inputs that failed:
```{r}
@ -387,9 +387,9 @@ debug |> filter(!x_ok)
`x_remainder` isn't useful when there are too few pieces, but we'll see it again shortly.
Sometimes looking at this debugging information will reveal a problem with your delimiter strategy or suggest that you need to do more preprocessing before separating.
In that case, fix the problem upstream and make sure to remove `too_few = "debug"` to ensure that new problem become errors.
In that case, fix the problem upstream and make sure to remove `too_few = "debug"` to ensure that new problems become errors.
In other cases you may just want to fill in the missing pieces with `NA`s and move on.
In other cases, you may want to fill in the missing pieces with `NA`s and move on.
That's the job of `too_few = "align_start"` and `too_few = "align_end"` which allow you to control where the `NA`s should go:
```{r}
@ -416,7 +416,7 @@ df |>
)
```
But now when we debug the result, you can see the purpose of `x_remainder`:
But now, when we debug the result, you can see the purpose of `x_remainder`:
```{r}
debug <- df |>
@ -463,7 +463,7 @@ You'll learn how to find the length of a string, extract substrings, and handle
str_length(c("a", "R for data science", NA))
```
You could use this with `count()` to find the distribution of lengths of US babynames, and then with `filter()` to look at the longest names[^strings-7]:
You could use this with `count()` to find the distribution of lengths of US babynames and then with `filter()` to look at the longest names[^strings-7]:
[^strings-7]: Looking at these entries, we'd guess that the babynames data drops spaces or hyphens and truncates after 15 letters.
@ -510,14 +510,14 @@ babynames |>
### Long strings
Sometimes the reason you care about the length of a string is because you're trying to fit it into a label on a plot or in a table.
Sometimes you care about the length of a string because you're trying to fit it into a label on a plot or table.
stringr provides two useful tools for cases where your string is too long:
- `str_trunc(x, 30)` ensures that no string is longer than 30 characters, replacing any letters after 30 with `…`.
- `str_wrap(x, 30)` wraps a string introducing new lines so that each line is at most 30 characters (it doesn't hyphenate, however, so any word longer than 30 characters will make a longer line)
The following code shows these functions in action with a made up string:
The following code shows these functions in action with a made-up string:
```{r}
x <- "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat."
@ -534,14 +534,15 @@ str_view(str_wrap(x, 30))
## Non-English text {#sec-other-languages}
So far, we've focused on English language text which is particularly easy to work with for two reasons.
Firstly, the English alphabet is fairly simple: there are just 26 letters.
Firstly, the English alphabet is relatively simple: there are just 26 letters.
Secondly (and maybe more importantly), the computing infrastructure we use today was predominantly designed by English speakers.
Unfortunately we don't have room for a full treatment of non-English languages, but I wanted to draw your attention to some of the biggest challenges you might encounter: encoding, letter variations, and locale dependent functions.
Unfortunately, we don't have room for a full treatment of non-English languages.
Still, we wanted to draw your attention to some of the biggest challenges you might encounter: encoding, letter variations, and locale-dependent functions.
### Encoding
When working with non-English text the first challenge is often the **encoding**.
To understand what's going on, we need to dive into the details of how computers represent strings.
When working with non-English text, the first challenge is often the **encoding**.
To understand what's going on, we need to dive into how computers represent strings.
In R, we can get at the underlying representation of a string using `charToRaw()`:
```{r}
@ -549,20 +550,20 @@ charToRaw("Hadley")
```
Each of these six hexadecimal numbers represents one letter: `48` is H, `61` is a, and so on.
The mapping from hexadecimal number to character is called the encoding, and in this case the encoding is called ASCII.
ASCII does a great job of representing English characters, because it's the **American** Standard Code for Information Interchange.
The mapping from hexadecimal number to character is called the encoding, and in this case, the encoding is called ASCII.
ASCII does a great job of representing English characters because it's the **American** Standard Code for Information Interchange.
Things aren't so easy for languages other than English.
In the early days of computing there were many competing standards for encoding non-English characters.
For example, there were two different encodings for Europe: Latin1 (aka ISO-8859-1) was used for Western European languages and Latin2 (aka ISO-8859-2) was used for Central European languages.
In the early days of computing, there were many competing standards for encoding non-English characters.
For example, there were two different encodings for Europe: Latin1 (aka ISO-8859-1) was used for Western European languages, and Latin2 (aka ISO-8859-2) was used for Central European languages.
In Latin1, the byte `b1` is "±", but in Latin2, it's "ą"!
Fortunately, today there is one standard that is supported almost everywhere: UTF-8.
UTF-8 can encode just about every character used by humans today, as well as many extra symbols like emojis.
UTF-8 can encode just about every character used by humans today and many extra symbols like emojis.
readr uses UTF-8 everywhere.
This is a good default but will fail for data produced by older systems that don't use UTF-8.
If this happens to you, your strings will look weird when you print them.
Sometimes just one or two characters might be messed up; other times you'll get complete gibberish.
If this happens, your strings will look weird when you print them.
Sometimes just one or two characters might be messed up; other times, you'll get complete gibberish.
For example here are two inline CSVs with unusual encodings[^strings-8]:
[^strings-8]: Here I'm using the special `\x` to encode binary data directly into a string.
@ -576,7 +577,7 @@ x2 <- "text\n\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd"
read_csv(x2)
```
To read these correctly you specify the encoding via the `locale` argument:
To read these correctly, you specify the encoding via the `locale` argument:
```{r}
#| message: false
@ -588,7 +589,7 @@ read_csv(x2, locale = locale(encoding = "Shift-JIS"))
How do you find the correct encoding?
If you're lucky, it'll be included somewhere in the data documentation.
Unfortunately, that's rarely the case, so readr provides `guess_encoding()` to help you figure it out.
It's not foolproof, and it works better when you have lots of text (unlike here), but it's a reasonable place to start.
It's not foolproof and works better when you have lots of text (unlike here), but it's a reasonable place to start.
Expect to try a few different encodings before you find the right one.
```{r}
@ -596,12 +597,12 @@ guess_encoding(x1)
guess_encoding(x2)
```
Encodings are a rich and complex topic, and we've only scratched the surface here.
If you'd like to learn more we recommend reading the detailed explanation at <http://kunststube.net/encoding/>.
Encodings are a rich and complex topic; we've only scratched the surface here.
If you'd like to learn more, we recommend reading the detailed explanation at <http://kunststube.net/encoding/>.
### Letter variations
If you're working with individual letters (e.g. with `str_length()` and `str_sub()`) there's an important challenge if you're working with an language that has accents because letters might be represented as an individual character or by combing an unaccented letter (e.g. ü) with a diacritic mark (e.g. ¨).
If you're working with individual letters (e.g., with `str_length()` and `str_sub()`), there's an important challenge if you're working with a language that has accents because letters might be represented as an individual character or by combing an unaccented letter (e.g., ü) with a diacritic mark (e.g., ¨).
For example, this code shows two ways of representing ü that look identical:
```{r}
@ -609,14 +610,14 @@ u <- c("\u00fc", "u\u0308")
str_view(u)
```
But they have different lengths and the first characters are different:
But they have different lengths, and the first characters are different:
```{r}
str_length(u)
str_sub(u, 1, 1)
```
Finally note that these strings look differently when you compare them with `==`, for which is stringr provides the handy `str_equal()` function:
Finally, note that these strings look differently when you compare them with `==`, for which stringr provides the handy `str_equal()` function:
```{r}
u[[1]] == u[[2]]
@ -627,17 +628,17 @@ str_equal(u[[1]], u[[2]])
### Locale-dependent function
Finally, there are a handful of stringr functions whose behavior depends on your **locale**.
A locale is similar to a language, but includes an optional region specifier to handle regional variations within a language.
A locale is specified by lower-case language abbreviation, optionally followed by a `_` and a upper-case region identifier.
A locale is similar to a language but includes an optional region specifier to handle regional variations within a language.
A locale is specified by a lower-case language abbreviation, optionally followed by a `_` and an upper-case region identifier.
For example, "en" is English, "en_GB" is British English, and "en_US" is American English.
If you don't already know the code for your language, [Wikipedia](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) has a good list, and you can see which are supported in stringr by looking at `stringi::stri_locale_list()`.
Base R string functions automatically use the locale set by your operating system.
This means that base R string functions do what you expect for your language, but your code might work differently if you share it with someone who lives in different country.
To avoid this problem, stringr defaults to using English rules, by using the "en" locale, and requires you to specify the `locale` argument to override it.
Fortunately there are two sets of functions where the locale really matters: changing case and sorting.
This means that base R string functions do what you expect for your language, but your code might work differently if you share it with someone who lives in a different country.
To avoid this problem, stringr defaults to using English rules by using the "en" locale and requires you to specify the `locale` argument to override it.
Fortunately, there are two sets of functions where the locale really matters: changing case and sorting.
**T**he rules for changing case are not the same in every language.
The rules for changing cases are not the same in every language.
For example, Turkish has two i's: with and without a dot, and it capitalizes them in a different way to English:
```{r}
@ -645,7 +646,7 @@ str_to_upper(c("i", "ı"))
str_to_upper(c("i", "ı"), locale = "tr")
```
Sorting strings depends on the order of the alphabet, and order of the alphabet is not the same in every language[^strings-9]!
Sorting strings depends on the order of the alphabet, and the order of the alphabet is not the same in every language[^strings-9]!
Here's an example: in Czech, "ch" is a compound letter that appears after `h` in the alphabet.
[^strings-9]: Sorting in languages that don't have an alphabet, like Chinese, is more complicated still.
@ -655,10 +656,10 @@ str_sort(c("a", "c", "ch", "h", "z"))
str_sort(c("a", "c", "ch", "h", "z"), locale = "cs")
```
This also comes up when sorting strings with `dplyr::arrange()` which is why it also has a `locale` argument.
This also comes up when sorting strings with `dplyr::arrange()`, which is why it also has a `locale` argument.
## Summary
In this chapter you've learned about some of the power of the stringr package: you learned how to create, combine, and extract strings, and about some of the challenges you might face with non-English strings.
Now it's time to learn one of the most important and powerful tools for working withr strings: regular expressions.
Regular expressions are very concise, but very expressive, language for describing patterns within strings, and are the topic of the next chapter.
In this chapter, you've learned about some of the power of the stringr package: how to create, combine, and extract strings, and about some of the challenges you might face with non-English strings.
Now it's time to learn one of the most important and powerful tools for working with strings: regular expressions.
Regular expressions are a very concise but very expressive language for describing patterns within strings and are the topic of the next chapter.