Eliminate parsing chapter (#1128)

Originally the plan was to have two chapters about reading text files, a brief introduction in the whole game, and then a more detailed exploration later in the book. This organisation didn't seem to work very well because the second chapter didn't have much content, so I've removed it, integrating its content elsewhere in the book:

* Column parsing types moved back into data-import
* Specifics of parsing various data types (e.g. `col_number()`, `col_date()`, and `col_factor()`) moved into the corresponding data type chapters.
* String encoding has moved to the strings chapter

While I was in here I also removed the unused `import-other.qmd`; we had planned to survey other options but I no longer think this is worth it.
This commit is contained in:
Hadley Wickham 2022-11-17 09:56:08 -06:00 committed by GitHub
parent 7ff2b15021
commit bfa06daab5
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
11 changed files with 469 additions and 693 deletions

View File

@ -49,7 +49,6 @@ book:
- part: wrangle.qmd
chapters:
- parsing.qmd
- spreadsheets.qmd
- databases.qmd
- rectangling.qmd

View File

@ -23,26 +23,11 @@ In this chapter, you'll learn how to load flat files in R with the **readr** pac
library(tidyverse)
```
## Getting started
Most of readr's functions are concerned with turning flat files into data frames:
- `read_csv()` reads comma delimited files, `read_csv2()` reads semicolon separated files (common in countries where `,` is used as the decimal place), `read_tsv()` reads tab delimited files, and `read_delim()` reads in files with any delimiter.
- `read_fwf()` reads fixed width files.
You can specify fields either by their widths with `fwf_widths()` or their position with `fwf_positions()`.
`read_table()` reads a common variation of fixed width files where columns are separated by white space.
- `read_log()` reads Apache style log files.
(But also check out [webreadr](https://github.com/Ironholds/webreadr) which is built on top of `read_log()` and provides many more helpful tools.)
These functions all have similar syntax: once you've mastered one, you can use the others with ease.
For the rest of this chapter we'll focus on `read_csv()`.
Not only are csv files one of the most common forms of data storage, but once you understand `read_csv()`, you can easily apply your knowledge to all the other functions in readr.
## Reading data from a file
Here is what a simple CSV file with a row for column names (also commonly referred to as the header row) and six rows of data looks like.
To begin we'll focus on the most rectangular data file type: the CSV, short for comma separate values.
Here is what a simple CSV file looks like.
The first row, commonly called the header row, gives the column names, and the following six rows give the data.
```{r}
#| echo: false
@ -51,7 +36,6 @@ Here is what a simple CSV file with a row for column names (also commonly referr
read_lines("data/students.csv") |> cat(sep = "\n")
```
Note that the `,`s separate the columns.
@tbl-students-table shows a representation of the same data as a table.
```{r}
@ -64,7 +48,8 @@ read_csv("data/students.csv") |>
knitr::kable()
```
The first argument to `read_csv()` is the most important: it's the path to the file to read.
We can read this file into R using `read_csv()`.
The first argument is the most important: it's the path to the file.
```{r}
#| message: true
@ -72,158 +57,158 @@ The first argument to `read_csv()` is the most important: it's the path to the f
students <- read_csv("data/students.csv")
```
When you run `read_csv()` it prints out a message that tells you how many rows (excluding the header row) and columns the data has along with the delimiter used, and the column specifications (names of columns organized by the type of data the column contains).
When you run `read_csv()` it prints out a message telling you the number of rows and columns of data, the delimiter that was used, and the column specifications (names of columns organized by the type of data the column contains).
It also prints out some information about how to retrieve the full column specification as well as how to quiet this message.
This message is an important part of readr, which we'll come back to in @sec-parsing-a-file on parsing a file.
This message is an important part of readr and we'll come back to in @sec-col-types.
You can also supply an inline csv file.
This is useful for experimenting with readr and for creating reproducible examples to share with others:
### Practical advice
```{r}
#| message: false
Once you read data in, the first step usually involves transforming it in some way to make it easier to work with in the rest of your analysis.
Let's take another look at the `students` data with that in mind.
read_csv("a,b,c
1,2,3
4,5,6")
```
In both cases `read_csv()` uses the first line of the data for the column names, which is a very common convention.
There are two cases where you might want to tweak this behavior:
1. Sometimes there are a few lines of metadata at the top of the file.
You can use `skip = n` to skip the first `n` lines; or use `comment = "#"` to drop all lines that start with (e.g.) `#`.
```{r}
#| message: false
read_csv("The first line of metadata
The second line of metadata
x,y,z
1,2,3", skip = 2)
read_csv("# A comment I want to skip
x,y,z
1,2,3", comment = "#")
```
2. The data might not have column names.
You can use `col_names = FALSE` to tell `read_csv()` not to treat the first row as headings, and instead label them sequentially from `X1` to `Xn`:
```{r}
#| message: false
read_csv("1,2,3\n4,5,6", col_names = FALSE)
```
(`"\n"` is a convenient shortcut for adding a new line. You'll learn more about it and other types of string escape in @sec-strings.)
Alternatively you can pass `col_names` a character vector which will be used as the column names:
```{r}
#| message: false
read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z"))
```
Another option that commonly needs tweaking is `na`: this specifies the value (or values) that are used to represent missing values in your file:
```{r}
#| message: false
read_csv("a,b,c\n1,2,.", na = ".")
```
This is all you need to know to read \~75% of CSV files that you'll encounter in practice.
You can also easily adapt what you've learned to read tab separated files with `read_tsv()` and fixed width files with `read_fwf()`.
To read in more challenging files, you'll need to learn more about how readr parses each column, turning them into R vectors.
### First steps
Let's take another look at the `students` data.
In the `favourite.food` column, there are a bunch of food items and then the character string `N/A`, which should have been an real `NA` that R will recognize as "not available".
This is something we can address using the `na` argument.
```{r}
#| message: false
students <- read_csv("data/students.csv", na = c("N/A", ""))
students
```
Once you read data in, the first step usually involves transforming it in some way to make it easier to work with in the rest of your analysis.
For example, the column names in the `students` file we read in are formatted in non-standard ways.
You might consider renaming them one by one with `dplyr::rename()` or you might use the `janitor::clean_names()` function turn them all into snake case at once.[^data-import-1]
This function takes in a data frame and returns a data frame with variable names converted to snake case.
You might also notice that the `Student ID` and `Full Name` columns are surrounded by back ticks.
That's because they contain spaces, breaking R's usual rules for variable names.
To refer to them, you need to use those back ticks:
```{r}
students |>
rename(
student_id = `Student ID`,
full_name = `Full Name`
)
```
An alternative approach is to use `janitor::clean_names()` to use some heuristics to turn them all into snake case at once[^data-import-1].
[^data-import-1]: The [janitor](http://sfirke.github.io/janitor/) package is not part of the tidyverse, but it offers handy functions for data cleaning and works well within data pipelines that uses `|>`.
```{r}
#| message: false
library(janitor)
students |>
clean_names()
students |> janitor::clean_names()
```
Another common task after reading in data is to consider variable types.
For example, `meal_type` is a categorical variable with a known set of possible values.
In R, factors can be used to work with categorical variables.
We can convert this variable to a factor using the `factor()` function.
For example, `meal_type` is a categorical variable with a known set of possible values, which in R should be represent as factor:
```{r}
students |>
janitor::clean_names() |>
mutate(
meal_plan = factor(meal_plan)
)
```
Note that the values in the `meal_type` variable has stayed exactly the same, but the type of variable denoted underneath the variable name has changed from character (`<chr>`) to factor (`<fct>`).
You'll learn more about factors in @sec-factors.
Before you move on to analyzing these data, you'll probably want to fix the `age` column as well: currently it's a character variable because of the one observation that is typed out as `five` instead of a numeric `5`.
We discuss the details of fixing this issue in @sec-import-spreadsheets.
```{r}
students <- students |>
clean_names() |>
mutate(meal_plan = factor(meal_plan))
janitor::clean_names() |>
mutate(
meal_plan = factor(meal_plan),
age = parse_number(if_else(age == "five", "5", age))
)
students
```
Note that the values in the `meal_type` variable has stayed exactly the same, but the type of variable denoted underneath the variable name has changed from character (`<chr>`) to factor (`<fct>`).
### Other arguments
Before you move on to analyzing these data, you'll probably want to fix the `age` column as well: currently it's a character variable because of the one observation that is typed out as `five` instead of a numeric `5`.
We discuss the details of fixing this issue in @sec-import-spreadsheets in further detail.
### Compared to base R
If you've used R before, you might wonder why we're not using `read.csv()`.
There are a few good reasons to favor readr functions over the base equivalents:
- They are typically much faster (\~10x) than their base equivalents.
Long running jobs have a progress bar, so you can see what's happening.
If you're looking for raw speed, try `data.table::fread()`.
It doesn't fit quite so well into the tidyverse, but it can be quite a bit faster.
- They produce tibbles, and they don't use row names or munge the column names.
These are common sources of frustration with the base R functions.
- They are more reproducible.
Base R functions inherit some behavior from your operating system and environment variables, so import code that works on your computer might not work on someone else's.
### Non-syntactic names
It's possible for a CSV file to have column names that are not valid R variable names, we refer to these as **non-syntactic** names.
For example, the variables might not start with a letter or they might contain unusual characters like a space:
There are a couple of other important arguments that we need to mention, and they'll be easier to demonstrate if we first show you a handy trick: `read_csv()` can read csv files that you've created in a string:
```{r}
df <- read_csv("data/non-syntactic.csv", col_types = list())
df
#| message: false
read_csv(
"a,b,c
1,2,3
4,5,6"
)
```
You'll notice that they print surrounded by backticks, which you'll need to use when referring to them in other functions:
Usually `read_csv()` uses the first line of the data for the column names, which is a very common convention.
But sometime there are a few lines of metadata at the top of the file.
You can use `skip = n` to skip the first `n` lines or use `comment = "#"` to drop all lines that start with (e.g.) `#`:
```{r}
df |> relocate(`2000`, .after = `:)`)
#| message: false
read_csv(
"The first line of metadata
The second line of metadata
x,y,z
1,2,3",
skip = 2
)
read_csv(
"# A comment I want to skip
x,y,z
1,2,3",
comment = "#"
)
```
These values only need special handling when they appear in column names.
If you turn them into data (e.g. with `pivot_longer()`) they are just regular strings:
In other cases, the data might not have column names.
You can use `col_names = FALSE` to tell `read_csv()` not to treat the first row as headings, and instead label them sequentially from `X1` to `Xn`:
```{r}
df |> pivot_longer(everything())
#| message: false
read_csv(
"1,2,3
4,5,6",
col_names = FALSE
)
```
Alternatively you can pass `col_names` a character vector which will be used as the column names:
```{r}
#| message: false
read_csv(
"1,2,3
4,5,6",
col_names = c("x", "y", "z")
)
```
These arguments are all you need to know to read the majority of CSV files that you'll encounter in practice.
(For the rest, you'll need to carefully inspect your `.csv` file and carefully read the documentation for `read_csv()`'s many other arguments.)
### Other file types
Once you've mastered `read_csv()`, using readr's other functions is straightforward; it's just a matter of knowing which function to reach for:
- `read_csv2()` reads semicolon separated files.
These use `;` instead of `,` to separate fields, and are common in countries that use `,` as the decimal marker.
- `read_tsv()` reads tab delimited files.
- `read_delim()` reads in files with any delimiter, attempting to automatically guess the delimited if you don't specify it.
- `read_fwf()` reads fixed width files.
You can specify fields either by their widths with `fwf_widths()` or their position with `fwf_positions()`.
- `read_table()` reads a common variation of fixed width files where columns are separated by white space.
- `read_log()` reads Apache style log files.
### Exercises
1. What function would you use to read a file where fields were separated with "\|"?
@ -269,6 +254,115 @@ df |> pivot_longer(everything())
)
```
## Controlling column types {#sec-col-types}
A CSV file doesn't contain any information about the type of each variable (i.e. whether it's a logical, number, string, etc), so readr will try to guess the type.
This section describes how the guessing process works, how to resolve some common problems that cause it to fail, and if needed, how to supply the column types yourself.
Finally, we'll mention a couple of general strategies that are a useful if readr is failing catastrophically and you need to get more insight in to the structure of your file.
### Guessing types
readr uses a heuristic to figure out the column types.
For each column, it pulls the values of 1,000[^data-import-2] rows spaced evenly from the first row to the last, ignoring an missing values.
It then works through the following questions:
[^data-import-2]: You can override the default of 1000 with the `guess_max` argument.
- Does it contain only `F`, `T`, `FALSE`, or `TRUE` (ignoring case)? If so, it's a logical.
- Does it contain only numbers (e.g. `1`, `-4.5`, `5e6`, `Inf`)? If so, it's a number.
- Does it match match the ISO8601 standard? If so, it's a date or date-time. (We'll come back to date/times in more detail in @sec-creating-datetimes).
- Otherwise, it must be a string.
You can see that behavior in action in this simple example:
```{r}
read_csv("
logical,numeric,date,string
TRUE,1,2021-01-15,abc
false,4.5,2021-02-15,def
T,Inf,2021-02-16,ghi"
)
```
This heuristic works well if you have a clean dataset, but in real life you'll encounter a selection of weird and wonderful failures.
### Missing values, column types, and problems
The most common way column detection fails is that a column contains unexpected values and you get a character column instead of a more specific type.
One of the most common causes for this a missing value, recorded using something other than the `NA` that stringr expects.
Take this simple 1 column CSV file as an example:
```{r}
csv <- "
x
10
.
20
30"
```
If we read it without any additional arguments, `x` becomes a character column:
```{r}
df <- read_csv(csv)
```
In this very small case, you can easily see the missing value `.`.
But what happens if you have thousands of rows with only a few missing values represented by `.`s speckled amongst them?
One approach is to tell readr that `x` is a numeric column, and then see where it fails.
You can do that with the `col_types` argument, which takes a named list:
```{r}
df <- read_csv(csv, col_types = list(x = col_double()))
```
Now `read_csv()` reports that there was a problem, and tells us we can find out more with `problems()`:
```{r}
problems(df)
```
This tells us that there was a problem in row 3, col 1 where readr expected a double but got a `.`.
That suggests this dataset uses `.` for missing values.
So then we set `na = "."`, the automatic guessing succeeds, giving us the numeric column that we want:
```{r}
df <- read_csv(csv, na = ".")
```
### Column types
readr provides a total of nine column types for you to use:
- `col_logical()` and `col_double()` read logicals and real numbers. They're relatively rarely needed (except as above), since readr will usually guess them for you.
- `col_integer()` reads integers. We distinguish because integers and doubles in this book because they're functionally equivalent, but reading integers explicitly can occasionally be useful because they occupy half the memory of doubles.
- `col_character()` reads strings. This is sometimes useful to specify explicitly when you have a column that is a numeric identifier, i.e. long series of digits that identifies some object, but it doesn't make sense to (e.g.) divide it in half.
- `col_factor()`, `col_date()` and `col_datetime()` create factors, dates and date-time respectively; you'll learn more about those when we get to those data types in @sec-factors and @sec-date-and-times.
- `col_number()` is a permissive numeric parser that will ignore non-numeric components, and is particularly useful for currencies. You'll learn more about it in @sec-numbers.
- `col_skip()` skips a column so it's not included in the result.
It's also possible to override the default column by switching from `list()` to `cols()`:
```{r}
csv <- "
x,y,z
1,2,3"
read_csv(csv, col_types = cols(.default = col_character()))
```
Another useful helper is `cols_only()` which will read in only the columns you specify:
```{r}
read_csv(
"x,y,z
1,2,3",
col_types = cols_only(x = col_character())
)
```
## Reading data from multiple files {#sec-readr-directory}
Sometimes your data is split across multiple files instead of being contained in a single file.
@ -285,7 +379,7 @@ This is especially helpful in circumstances where the files you're reading in do
If you have many files you want to read in, it can get cumbersome to write out their names as a list.
Instead, you can use the base `list.files()` function to find the files for you by matching a pattern in the file names.
You'll learn more about these patterns in @sec-strings.
You'll learn more about these patterns in @sec-regular-expressions.
```{r}
sales_files <- list.files("data", pattern = "sales\\.csv$", full.names = TRUE)
@ -295,13 +389,7 @@ sales_files
## Writing to a file {#sec-writing-to-a-file}
readr also comes with two useful functions for writing data back to disk: `write_csv()` and `write_tsv()`.
Both functions increase the chances of the output file being read back in correctly by:
- Always encoding strings in UTF-8.
- Saving dates and date-times in ISO8601 format so they are easily parsed elsewhere.
If you want to export a csv file to Excel, use `write_excel_csv()` --- this writes a special character (a "byte order mark") at the start of the file which tells Excel that you're using the UTF-8 encoding.
Both functions increase the chances of the output file being read back in correctly by using the standard UTF-8 encoding for strings and ISO8601 format for date-times.
The most important arguments are `x` (the data frame to save), and `file` (the location to save it).
You can also specify how missing values are written with `na`, and if you want to `append` to an existing file.
@ -325,7 +413,7 @@ read_csv("students-2.csv")
```
This makes CSVs a little unreliable for caching interim results---you need to recreate the column specification every time you load in.
There are two alternatives:
There are two main options:
1. `write_rds()` and `read_rds()` are uniform wrappers around the base functions `readRDS()` and `saveRDS()`.
These store data in R's custom binary format called RDS:
@ -335,14 +423,14 @@ There are two alternatives:
read_rds("students.rds")
```
2. The feather package implements a fast binary file format that can be shared across programming languages:
2. The arrow package allows you to read and write parquet files, a fast binary file format that can be shared across programming languages:
```{r}
#| eval: false
library(feather)
write_feather(students, "students.feather")
read_feather("students.feather")
library(arrow)
write_parquet(students, "students.parquet")
read_parquet("students.parquet")
#> # A tibble: 6 × 5
#> student_id full_name favourite_food meal_plan age
#> <dbl> <chr> <chr> <fct> <dbl>
@ -354,8 +442,7 @@ There are two alternatives:
#> 6 6 Güvenç Attila Ice cream Lunch only 6
```
Feather tends to be faster than RDS and is usable outside of R.
RDS supports list-columns (which you'll learn about in @sec-rectangling; feather currently does not.
Parquet tends to be much faster than RDS and is usable outside of R, but does require you install the arrow package.
```{r}
#| include: false

View File

@ -217,8 +217,8 @@ billboard |>
You might also wonder what happens if a song is in the top 100 for more than 76 weeks?
We can't tell from this data, but you might guess that additional columns `wk77`, `wk78`, ... would be added to the dataset.
This data is now tidy, but we could make future computation a bit easier by converting `week` into a number using `mutate()` and `parse_number()`.
You'll learn more about `parse_number()` and friends in @sec-data-import.
This data is now tidy, but we could make future computation a bit easier by converting `week` into a number using `mutate()` and `readr::parse_number()`.
`parse_number()` is a handy function that will extract the first number from a string, ignoring all other text.
```{r}
billboard_tidy <- billboard |>

View File

@ -1,2 +0,0 @@
:),x y,2000
smile,space,number
1 :) x y 2000
2 smile space number

View File

@ -46,7 +46,7 @@ library(lubridate)
library(nycflights13)
```
## Creating date/times
## Creating date/times {#sec-creating-datetimes}
There are three types of date/time data that refer to an instant in time:
@ -74,20 +74,84 @@ today()
now()
```
Otherwise, there are three ways you're likely to create a date/time:
Otherwise, the following sections describe the four ways you're likely to create a date/time:
- While reading a file with readr.
- From a string.
- From individual date-time components.
- From an existing date/time object.
They work as follows.
### During import
If your CSV contains an ISO8601 date or date-time, you don't need to do anything; readr will automatically recognize it:
```{r}
#| message: false
csv <- "
date,datetime
2022-01-02,2022-01-02 05:12
"
read_csv(csv)
```
If you haven't heard of **ISO8601** before, it's an international standard[^datetimes-2] for writing dates where the components of a date are organised from biggest to smallest separated by `-`. For example, in ISO8601 March 5 2022 is `2022-05-03`. ISO8601 dates can also include times, where hour, minute, and second are separated by `:`, and the date and time components are separated by either a `T` or a space.
For example, you could write 4:26pm on March 5 2022 as either `2022-05-03 16:26` or `2022-05-03T16:26`.
[^datetimes-2]: <https://xkcd.com/1179/>
For other date-time formats, you'll need to use `col_types` plus `col_date()` or `col_datetime()` along with a date-time format.
The date-time format used by readr is a standard used across many programming languages, describing a date component with a `%` followed by a single character.
For example, `%Y-%m-%d` specifies a date that's a year, `-`, month (as number) `-`, day.
Table @tbl-date-formats lists all the options.
| Type | Code | Meaning | Example |
|-------|-------|--------------------------------|-----------------|
| Year | `%Y` | 4 digit year | 2021 |
| | `%y` | 2 digit year | 21 |
| Month | `%m` | Number | 2 |
| | `%b` | Abbreviated name | Feb |
| | `%B` | Full name | Februrary |
| Day | `%d` | Two digits | 02 |
| | `%e` | One or two digits | 2 |
| Time | `%H` | 24-hour hour | 13 |
| | `%I` | 12-hour hour | 1 |
| | `%p` | AM/PM | pm |
| | `%M` | Minutes | 35 |
| | `%S` | Seconds | 45 |
| | `%OS` | Seconds with decimal component | 45.35 |
| | `%Z` | Time zone name | America/Chicago |
| | `%z` | Offset from UTC | +0800 |
| Other | `%.` | Skip one non-digit | : |
| | `%*` | Skip any number of non-digits | |
: All date formats understood by readr {#tbl-date-formats}
And this code shows some a few options applied to a very ambiguous date:
```{r}
#| messages: false
csv <- "
date
01/02/15
"
read_csv(csv, col_types = cols(date = col_date("%m/%d/%y")))
read_csv(csv, col_types = cols(date = col_date("%d/%m/%y")))
read_csv(csv, col_types = cols(date = col_date("%y/%m/%d")))
```
Note that no matter how you specify the date format, it's always displayed the same way once you get it into R.
If you're using `%b` or `%B` and working with non-English dates, you'll also need to provide a `locale()`.
See the list of built-in languages in `date_names_langs()`, or create your own with `date_names()`,
### From strings
Date/time data often comes as strings.
You've seen one approach to parsing strings into date-times in [date-times](#readr-datetimes).
Another approach is to use the helpers provided by lubridate.
They automatically work out the format once you specify the order of the component.
The date-time specification language is powerful, but requires careful analysis of the date format.
An alternative approach is to use lubridate's helpers which attempt to automatically determine the format once you specify the order of the component.
To use them, identify the order in which year, month, and day appear in your dates, then arrange "y", "m", and "d" in the same order.
That gives you the name of the lubridate function that will parse your date.
For example:
@ -217,7 +281,7 @@ as_date(365 * 10 + 2)
2. What does the `tzone` argument to `today()` do?
Why is it important?
3. Use the appropriate lubridate function to parse each of the following dates:
3. For each of the following date-times show how you'd parse it using a readr column-specification and a lubridate function.
```{r}
d1 <- "January 1, 2010"
@ -225,6 +289,8 @@ as_date(365 * 10 + 2)
d3 <- "06-Jun-2017"
d4 <- c("August 19 (2015)", "July 1 (2015)")
d5 <- "12/30/14" # Dec 30, 2014
t1 <- "1705"
t2 <- "11:15:10.12 PM"
```
## Date-time components

View File

@ -77,10 +77,11 @@ y2 <- factor(x2, levels = month_levels)
y2
```
If you want a warning, you can use `readr::parse_factor()`:
This seems risky, so you might want to use `fct()` instead:
```{r}
y2 <- parse_factor(x2, levels = month_levels)
#| error: true
y2 <- fct(x2, levels = month_levels)
```
If you omit the levels, they'll be taken from the data in alphabetical order:
@ -106,6 +107,19 @@ If you ever need to access the set of valid levels directly, you can do so with
levels(f2)
```
You can also create a factor when reading your data with readr with `col_factor()`:
```{r}
csv <- "
month,value
Jan,12
Feb,56
Mar,12"
df <- read_csv(csv, col_types = cols(month = col_factor(month_levels)))
df$month
```
## General Social Survey
For the rest of this chapter, we're going to use `forcats::gss_cat`.

View File

@ -1,10 +0,0 @@
# Other types of data {#sec-import-other}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
status("drafting")
```
<!--# TO DO: Write chapter. -->

View File

@ -12,8 +12,8 @@ status("polishing")
Numeric vectors are the backbone of data science, and you've already used them a bunch of times earlier in the book.
Now it's time to systematically survey what you can do with them in R, ensuring that you're well situated to tackle any future problem involving numeric vectors.
We'll start by going into a little more detail of `count()` before diving into various numeric transformations that pair well with `mutate()`.
You'll then learn about more general transformations that can be applied to other types of vector, but are often used with numeric vectors.
We'll start by giving you a couple of tools to make numbers if you have strings, and then going into a little more detail of `count()`.
Then we'll dive into various numeric transformations that pair well with `mutate()`, including more general transformations that can be applied to other types of vector, but are often used with numeric vectors.
We'll finish off by covering the summary functions that pair well with `summarise()` and show you how they can also be used with `mutate()`.
### Prerequisites
@ -30,6 +30,27 @@ library(tidyverse)
library(nycflights13)
```
## Making numbers
In most cases, you'll get numbers already recorded in one of R's numeric types: integer or double.
In some cases, however, you'll encounter them as strings, possibly because you've created them by pivoting from column headers or something has gone wrong in your data import process.
readr provides two useful functions for parsing strings into numbers: `parse_double()` and `parse_number()`.
Use `parse_double()` when you have numbers that have been written as strings:
```{r}
x <- c("1.2", "5.6", "1e3")
parse_double(x)
```
Use `parse_number()` when the string contains non-numeric text that you want to ignore.
This is particularly useful for currency data and percentages:
```{r}
x <- c("$1,234", "USD 3,513", "59%")
parse_number(x)
```
## Counts
It's surprising how much data science you can do with just counts and a little basic arithmetic, so dplyr strives to make counting as easy as possible with `count()`.
@ -753,9 +774,9 @@ For example:
## Summary
You're likely already familiar with many tools for working with numbers, and in this chapter you'll have learned how they're realized in R.
You also learned a handful of useful general transformations that are commonly, but not exclusively, applied to numeric vector like ranks and offsets.
Finally, we worked through a number of numeric summaries, and discussed a few of the statistical challenges that you should consider.
You're already familiar with many tools for working with numbers, and after reading this chapter you now know how to use them in R.
You've also learned a handful of useful general transformations that are commonly, but not exclusively, applied to numeric vectors like ranks and offsets.
Finally, you worked through a number of numeric summaries, and discussed a few of the statistical challenges that you should consider.
Over the next two chapters, we'll dive into working with strings with the stringr package.
Strings get two chapters because there really are two topics to cover: strings and regular expressions.
Strings are a big topic so they get two chapters, one on the fundamentals of strings and one on regular expressions.

View File

@ -1,472 +0,0 @@
# Parsing {#sec-import-rectangular}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
status("drafting")
```
Things that should be mentioned in this chapter:
- `rename_with`, use with janitor example: "Alternatively, we can also read the data in first and then rename the columns to follow the `snake_case` format with the `make_clean_names()` function from the **janitor** package. This is a handy approach if you have too many columns and don't want to write out the names of each, though it might not always result in the exact names you want for your columns, e.g. it won't shorten column names, it will only convert them to snake case."
- ...
<!--# Moved from original import chapter -->
```{r}
#| message: false
library(tidyverse)
```
## Parsing a vector
Before we get into the details of how readr reads files from disk, we need to take a little detour to talk about the `parse_*()` functions.
These functions take a character vector and return a more specialised vector like a logical, integer, or date:
```{r}
str(parse_logical(c("TRUE", "FALSE", "NA")))
str(parse_integer(c("1", "2", "3")))
str(parse_date(c("2010-01-01", "1979-10-14")))
```
These functions are useful in their own right, but are also an important building block for readr.
Once you've learned how the individual parsers work in this section, we'll circle back and see how they fit together to parse a complete file in the next section.
Like all functions in the tidyverse, the `parse_*()` functions are uniform: the first argument is a character vector to parse, and the `na` argument specifies which strings should be treated as missing:
```{r}
parse_integer(c("1", "231", ".", "456"), na = ".")
```
If parsing fails, you'll get a warning:
```{r}
x <- parse_integer(c("123", "345", "abc", "123.45"))
```
And the failures will be missing in the output:
```{r}
x
```
If there are many parsing failures, you'll need to use `problems()` to get the complete set.
This returns a tibble, which you can then manipulate with dplyr.
```{r}
problems(x)
```
Using parsers is mostly a matter of understanding what's available and how they deal with different types of input.
There are eight particularly important parsers:
1. `parse_logical()` and `parse_integer()` parse logicals and integers respectively.
There's basically nothing that can go wrong with these parsers so we won't describe them here further.
2. `parse_double()` is a strict numeric parser, and `parse_number()` is a flexible numeric parser.
These are more complicated than you might expect because different parts of the world write numbers in different ways.
3. `parse_character()` seems so simple that it shouldn't be necessary.
But one complication makes it quite important: character encodings.
4. `parse_factor()` create factors, the data structure that R uses to represent categorical variables with fixed and known values.
5. `parse_datetime()`, `parse_date()`, and `parse_time()` allow you to parse various date & time specifications.
These are the most complicated because there are so many different ways of writing dates.
The following sections describe these parsers in more detail.
### Numbers
It seems like it should be straightforward to parse a number, but three problems make it tricky:
1. People write numbers differently in different parts of the world.
For example, some countries use `.` in between the integer and fractional parts of a real number, while others use `,`.
2. Numbers are often surrounded by other characters that provide some context, like "\$1000" or "10%".
3. Numbers often contain "grouping" characters to make them easier to read, like "1,000,000", and these grouping characters vary around the world.
To address the first problem, readr has the notion of a "locale", an object that specifies parsing options that differ from place to place.
When parsing numbers, the most important option is the character you use for the decimal mark.
You can override the default value of `.` by creating a new locale and setting the `decimal_mark` argument:
```{r}
parse_double("1.23")
parse_double("1,23", locale = locale(decimal_mark = ","))
```
readr's default locale is US-centric, because generally R is US-centric (i.e. the documentation of base R is written in American English).
An alternative approach would be to try and guess the defaults from your operating system.
This is hard to do well, and, more importantly, makes your code fragile: even if it works on your computer, it might fail when you email it to a colleague in another country.
`parse_number()` addresses the second problem: it ignores non-numeric characters before and after the number.
This is particularly useful for currencies and percentages, but also works to extract numbers embedded in text.
```{r}
parse_number("$100")
parse_number("20%")
parse_number("It cost $123.45")
```
The final problem is addressed by the combination of `parse_number()` and the locale as `parse_number()` will ignore the "grouping mark":
```{r}
# Used in America
parse_number("$123,456,789")
# Used in many parts of Europe
parse_number("123.456.789", locale = locale(grouping_mark = "."))
# Used in Switzerland
parse_number("123'456'789", locale = locale(grouping_mark = "'"))
```
### Strings {#sec-readr-strings}
It seems like `parse_character()` should be really simple --- it could just return its input.
Unfortunately life isn't so simple, as there are multiple ways to represent the same string.
To understand what's going on, we need to dive into the details of how computers represent strings.
In R, we can get at the underlying representation of a string using `charToRaw()`:
```{r}
charToRaw("Hadley")
```
Each hexadecimal number represents a byte of information: `48` is H, `61` is a, and so on.
The mapping from hexadecimal number to character is called the encoding, and in this case the encoding is called ASCII.
ASCII does a great job of representing English characters, because it's the **American** Standard Code for Information Interchange.
Things get more complicated for languages other than English.
In the early days of computing there were many competing standards for encoding non-English characters, and to correctly interpret a string you needed to know both the values and the encoding.
For example, two common encodings are Latin1 (aka ISO-8859-1, used for Western European languages) and Latin2 (aka ISO-8859-2, used for Eastern European languages).
In Latin1, the byte `b1` is "±", but in Latin2, it's "ą"!
Fortunately, today there is one standard that is supported almost everywhere: UTF-8.
UTF-8 can encode just about every character used by humans today, as well as many extra symbols (like emoji!).
readr uses UTF-8 everywhere: it assumes your data is UTF-8 encoded when you read it, and always uses it when writing.
This is a good default, but will fail for data produced by older systems that don't understand UTF-8.
If this happens to you, your strings will look weird when you print them.
Sometimes just one or two characters might be messed up; other times you'll get complete gibberish.
For example:
```{r}
x1 <- "El Ni\xf1o was particularly bad this year"
x2 <- "\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd"
x1
x2
```
To fix the problem you need to specify the encoding in `parse_character()`:
```{r}
parse_character(x1, locale = locale(encoding = "Latin1"))
parse_character(x2, locale = locale(encoding = "Shift-JIS"))
```
How do you find the correct encoding?
If you're lucky, it'll be included somewhere in the data documentation.
Unfortunately, that's rarely the case, so readr provides `guess_encoding()` to help you figure it out.
It's not foolproof, and it works better when you have lots of text (unlike here), but it's a reasonable place to start.
Expect to try a few different encodings before you find the right one.
```{r}
guess_encoding(charToRaw(x1))
guess_encoding(charToRaw(x2))
```
The first argument to `guess_encoding()` can either be a path to a file, or, as in this case, a raw vector (useful if the strings are already in R).
Encodings are a rich and complex topic, and we've only scratched the surface here.
If you'd like to learn more we recommend reading the detailed explanation at <http://kunststube.net/encoding/>.
### Factors {#sec-readr-factors}
R uses factors to represent categorical variables that have a known set of possible values.
Give `parse_factor()` a vector of known `levels` to generate a warning whenever an unexpected value is present:
```{r}
fruit <- c("apple", "banana")
parse_factor(c("apple", "banana", "bananana"), levels = fruit)
```
But if you have many problematic entries, it's often easier to leave them as character vectors and then use the tools you'll learn about in [strings](#readr-strings) and [factors](#readr-factors) to clean them up.
### Dates, date-times, and times {#sec-readr-datetimes}
You pick between three parsers depending on whether you want a date (the number of days since 1970-01-01), a date-time (the number of seconds since midnight 1970-01-01), or a time (the number of seconds since midnight).
When called without any additional arguments:
- `parse_datetime()` expects an ISO8601 date-time.
ISO8601 is an international standard in which the components of a date are organised from biggest to smallest: year, month, day, hour, minute, second.
```{r}
parse_datetime("2010-10-01T2010")
# If time is omitted, it will be set to midnight
parse_datetime("20101010")
```
This is the most important date/time standard, and if you work with dates and times frequently, we recommend reading <https://en.wikipedia.org/wiki/ISO_8601>.
- `parse_date()` expects a four digit year, a `-` or `/`, the month, a `-` or `/`, then the day:
```{r}
parse_date("2010-10-01")
```
- `parse_time()` expects the hour, `:`, minutes, optionally `:` and seconds, and an optional am/pm specifier:
```{r}
library(hms)
parse_time("01:10 am")
parse_time("20:10:01")
```
Base R doesn't have a great built in class for time data, so we use the one provided in the hms package.
If these defaults don't work for your data you can supply your own date-time `format`, built up of the following pieces:
Year
: `%Y` (4 digits).
: `%y` (2 digits); 00-69 -\> 2000-2069, 70-99 -\> 1970-1999.
Month
: `%m` (2 digits).
: `%b` (abbreviated name, like "Jan").
: `%B` (full name, "January").
Day
: `%d` (2 digits).
: `%e` (optional leading space).
Time
: `%H` 0-23 hour.
: `%I` 0-12, must be used with `%p`.
: `%p` AM/PM indicator.
: `%M` minutes.
: `%S` integer seconds.
: `%OS` real seconds.
: `%Z` Time zone (as name, e.g. `America/Chicago`).
Beware of abbreviations: if you're American, note that "EST" is a Canadian time zone that does not have daylight savings time.
It is *not* Eastern Standard Time!
We'll come back to this \[time zones\].
: `%z` (as offset from UTC, e.g. `+0800`).
Non-digits
: `%.` skips one non-digit character.
: `%*` skips any number of non-digits.
The best way to figure out the correct format is to create a few examples in a character vector, and test with one of the parsing functions.
For example:
```{r}
parse_date("01/02/15", "%m/%d/%y")
parse_date("01/02/15", "%d/%m/%y")
parse_date("01/02/15", "%y/%m/%d")
```
If you're using `%b` or `%B` with non-English month names, you'll need to set the `lang` argument to `locale()`.
See the list of built-in languages in `date_names_langs()`, or if your language is not already included, create your own with `date_names()`.
```{r}
parse_date("1 janvier 2015", "%d %B %Y", locale = locale("fr"))
```
### Exercises
1. What are the most important arguments to `locale()`?
2. What happens if you try and set `decimal_mark` and `grouping_mark` to the same character?
What happens to the default value of `grouping_mark` when you set `decimal_mark` to ","?
What happens to the default value of `decimal_mark` when you set the `grouping_mark` to "."?
3. We didn't discuss the `date_format` and `time_format` options to `locale()`.
What do they do?
Construct an example that shows when they might be useful.
4. If you live outside the US, create a new locale object that encapsulates the settings for the types of file you read most commonly.
5. What's the difference between `read_csv()` and `read_csv2()`?
6. What are the most common encodings used in Europe?
What are the most common encodings used in Asia?
Do some googling to find out.
7. Generate the correct format string to parse each of the following dates and times:
```{r}
d1 <- "January 1, 2010"
d2 <- "2015-Mar-07"
d3 <- "06-Jun-2017"
d4 <- c("August 19 (2015)", "July 1 (2015)")
d5 <- "12/30/14" # Dec 30, 2014
t1 <- "1705"
t2 <- "11:15:10.12 PM"
```
## Parsing a file {#sec-parsing-a-file}
Now that you've learned how to parse an individual vector, it's time to return to the beginning and explore how readr parses a file.
There are two new things that you'll learn about in this section:
1. How readr automatically guesses the type of each column.
2. How to override the default specification.
### Strategy
readr uses a heuristic to figure out the type of each column: it reads the first 1000 rows and uses some (moderately conservative) heuristics to figure out the type of each column.
You can emulate this process with a character vector using `guess_parser()`, which returns readr's best guess, and `parse_guess()` which uses that guess to parse the column:
```{r}
guess_parser("2010-10-01")
guess_parser("15:01")
guess_parser(c("TRUE", "FALSE"))
guess_parser(c("1", "5", "9"))
guess_parser(c("12,352,561"))
str(parse_guess("2010-10-10"))
```
The heuristic tries each of the following types, stopping when it finds a match:
- logical: contains only "F", "T", "FALSE", or "TRUE".
- integer: contains only numeric characters (and `-`).
- double: contains only valid doubles (including numbers like `4.5e-5`).
- number: contains valid doubles with the grouping mark inside.
- time: matches the default `time_format`.
- date: matches the default `date_format`.
- date-time: any ISO8601 date.
If none of these rules apply, then the column will stay as a vector of strings.
### Problems
These defaults don't always work for larger files.
There are two basic problems:
1. The first thousand rows might be a special case, and readr guesses a type that is not sufficiently general.
For example, you might have a column of doubles that only contains integers in the first 1000 rows.
2. The column might contain a lot of missing values.
If the first 1000 rows contain only `NA`s, readr will guess that it's a logical vector, whereas you probably want to parse it as something more specific.
readr contains a challenging CSV that illustrates both of these problems:
```{r}
challenge <- read_csv(readr_example("challenge.csv"))
```
(Note the use of `readr_example()` which finds the path to one of the files included with the package.)
There are two printed outputs: the column specification generated by looking at the first 1000 rows, and the first five parsing failures.
It's always a good idea to explicitly pull out the `problems()`, so you can explore them in more depth:
```{r}
problems(challenge)
```
A good strategy is to work column by column until there are no problems remaining.
Here we can see that there are a lot of parsing problems with the `y` column.
If we look at the last few rows, you'll see that they're dates stored in a character vector:
```{r}
tail(challenge)
```
That suggests we need to use a date parser instead.
To fix the call, start by copying and pasting the column specification into your original call:
```{r}
#| eval: false
challenge <- read_csv(
readr_example("challenge.csv"),
col_types = cols(
x = col_double(),
y = col_logical()
)
)
```
Then you can fix the type of the `y` column by specifying that `y` is a date column:
```{r}
challenge <- read_csv(
readr_example("challenge.csv"),
col_types = cols(
x = col_double(),
y = col_date()
)
)
tail(challenge)
```
Every `parse_xyz()` function has a corresponding `col_xyz()` function.
You use `parse_xyz()` when the data is in a character vector in R already; you use `col_xyz()` when you want to tell readr how to load the data.
We highly recommend always supplying `col_types`, building up from the print-out provided by readr.
This ensures that you have a consistent and reproducible data import script.
If you rely on the default guesses and your data changes, readr will continue to read it in.
If you want to be really strict, use `stop_for_problems()`: that will throw an error and stop your script if there are any parsing problems.
### Other strategies
There are a few other general strategies to help you parse files:
- In the previous example, we just got unlucky: if we look at just one more row than the default, we can correctly parse in one shot:
```{r}
challenge2 <- read_csv(readr_example("challenge.csv"), guess_max = 1001)
challenge2
```
- Sometimes it's easier to diagnose problems if you just read in all the columns as character vectors:
```{r}
challenge2 <- read_csv(readr_example("challenge.csv"),
col_types = cols(.default = col_character())
)
```
This is particularly useful in conjunction with `type_convert()`, which applies the parsing heuristics to the character columns in a data frame.
```{r}
df <- tribble(
~x, ~y,
"1", "1.21",
"2", "2.32",
"3", "4.56"
)
df
# Note the column types
type_convert(df)
```
- If you're reading a very large file, you might want to set `n_max` to a smallish number like 10,000 or 100,000.
That will accelerate your iterations while you eliminate common problems.
- If you're having major parsing problems, sometimes it's easier to just read into a character vector of lines with `read_lines()`, or even a character vector of length 1 with `read_file()`.
Then you can use the string parsing skills you'll learn later to parse more exotic formats.
<!--# TO DO: Write chapter. -->

View File

@ -12,7 +12,7 @@ status("drafting")
So far you have learned about importing data from plain text files, e.g. `.csv` and `.tsv` files.
Sometimes you need to analyze data that lives in a spreadsheet.
In this chapter we will introduce you to tools for working with data in Excel spreadsheets and Google Sheets.
This will build on much of what you've learned in @sec-data-import and @sec-import-rectangular, but we will also discuss additional considerations and complexities when working with data from spreadsheets.
This will build on much of what you've learned in @sec-data-import but we will also discuss additional considerations and complexities when working with data from spreadsheets.
If you or your collaborators are using spreadsheets for organizing data, we strongly recommend reading the paper "Data Organization in Spreadsheets" by Karl Broman and Kara Woo: <https://doi.org/10.1080/00031305.2017.1375989>.
The best practices presented in this paper will save you much headache down the line when you import the data from a spreadsheet into R to analyse and visualise.

View File

@ -14,7 +14,7 @@ Now it's time to dive into them, learning what makes strings tick, and mastering
We'll begin with the details of creating strings and character vectors.
You'll then dive into creating strings from data, then the opposite; extracting strings from data.
The chapter finishes up with functions that work with individual letters and a brief discussion of where your expectations from English might steer you wrong when working with other languages.
We'll then discuss tools that work with individual letters. The chapter finishes off with functions that work with individual letters and a brief discussion of where your expectations from English might steer you wrong when working with other languages.
We'll keep working with strings in the next chapter, where you'll learn more about the power of regular expressions.
@ -451,12 +451,8 @@ df |>
## Letters
This section discusses stringr functions that work with individual letters.
This is straightforward for English because it uses an alphabet with 26 letters, but things rapidly get complicated when you move beyond English.
Even languages that use the same alphabet but add additional accents (e.g. å, é, ï, ô, ū) are non-trivial because those letters might be represented as an individual character or by combining an unaccented letter (e.g. e) with a diacritic mark (e.g. ´).
And other languages "letters" look quite different: in Japanese each "letter" is a syllable, in Chinese each "letter" is a complex logogram, and in Arabic letters look radically different depending on their location in the word.
In this section, we'll assume that you're working with English text as we introduce to functions for finding the length of a string, extracting substrings, and handling long strings in plots and tables.
In this section, we'll introduce you to functions that allow you to work with the individual letters within a string.
You'll learn how to find the length of a string, extract substrings, and handle long strings in plots and tables.
### Length
@ -534,58 +530,135 @@ str_view(str_wrap(x, 30))
1. Use `str_length()` and `str_sub()` to extract the middle letter from each baby name. What will you do if the string has an even number of characters?
2. Are there any major trends in the length of babynames over time? What about the popularity of first and last letters?
## Locale dependent {#sec-other-languages}
## Non-English text {#sec-other-languages}
There are a handful of stringr functions whose behavior depends on your **locale**.
Locale is similar to language, but includes an optional region specifier to handle the fact that (e.g.) many countries speak Spanish, but with regional variations.
So far, we've focused on English language text which is particularly easy to work with for two reasons.
Firstly, the English alphabet is fairly simple: there are just 26 letters.
Secondly (and maybe more importantly), the computing infrastructure we use today was predominantly designed by English speakers.
Unfortunately we don't have room for a full treatment of non-English languages, but I wanted to draw your attention to some of the biggest challenges you might encounter: encoding, letter variations, and locale dependent functions.
### Encoding
When working with non-English text the first challenge is often the **encoding**.
To understand what's going on, we need to dive into the details of how computers represent strings.
In R, we can get at the underlying representation of a string using `charToRaw()`:
```{r}
charToRaw("Hadley")
```
Each of these six hexadecimal numbers represents one letter: `48` is H, `61` is a, and so on.
The mapping from hexadecimal number to character is called the encoding, and in this case the encoding is called ASCII.
ASCII does a great job of representing English characters, because it's the **American** Standard Code for Information Interchange.
Things aren't so easy for languages other than English.
In the early days of computing there were many competing standards for encoding non-English characters.
For example, there were two different encodings for Europe: Latin1 (aka ISO-8859-1) was used for Western European languages and Latin2 (aka ISO-8859-2) was used for Central European languages.
In Latin1, the byte `b1` is "±", but in Latin2, it's "ą"!
Fortunately, today there is one standard that is supported almost everywhere: UTF-8.
UTF-8 can encode just about every character used by humans today, as well as many extra symbols like emojis.
readr uses UTF-8 everywhere.
This is a good default but will fail for data produced by older systems that don't use UTF-8.
If this happens to you, your strings will look weird when you print them.
Sometimes just one or two characters might be messed up; other times you'll get complete gibberish.
For example here are two inline CSVs with unusual encodings[^strings-8]:
[^strings-8]: Here I'm using the special `\x` to encode binary data directly into a string.
```{r}
#| message: false
x1 <- "text\nEl Ni\xf1o was particularly bad this year"
read_csv(x1)
x2 <- "text\n\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd"
read_csv(x2)
```
To read these correctly you specify the encoding via the `locale` argument:
```{r}
#| message: false
read_csv(x1, locale = locale(encoding = "Latin1"))
read_csv(x2, locale = locale(encoding = "Shift-JIS"))
```
How do you find the correct encoding?
If you're lucky, it'll be included somewhere in the data documentation.
Unfortunately, that's rarely the case, so readr provides `guess_encoding()` to help you figure it out.
It's not foolproof, and it works better when you have lots of text (unlike here), but it's a reasonable place to start.
Expect to try a few different encodings before you find the right one.
```{r}
guess_encoding(x1)
guess_encoding(x2)
```
Encodings are a rich and complex topic, and we've only scratched the surface here.
If you'd like to learn more we recommend reading the detailed explanation at <http://kunststube.net/encoding/>.
### Letter variations
If you're working with individual letters (e.g. with `str_length()` and `str_sub()`) there's an important challenge if you're working with an language that has accents because letters might be represented as an individual character or by combing an unaccented letter (e.g. ü) with a diacritic mark (e.g. ¨).
For example, this code shows two ways of representing ü that look identical:
```{r}
u <- c("\u00fc", "u\u0308")
str_view(u)
```
But they have different lengths and the first characters are different:
```{r}
str_length(u)
str_sub(u, 1, 1)
```
Finally note that these strings look differently when you compare them with `==`, for which is stringr provides the handy `str_equal()` function:
```{r}
u[[1]] == u[[2]]
str_equal(u[[1]], u[[2]])
```
### Locale-dependent function
Finally, there are a handful of stringr functions whose behavior depends on your **locale**.
A locale is similar to a language, but includes an optional region specifier to handle regional variations within a language.
A locale is specified by lower-case language abbreviation, optionally followed by a `_` and a upper-case region identifier.
For example, "en" is English, "en_GB" is British English, and "en_US" is American English.
If you don't already know the code for your language, [Wikipedia](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) has a good list, and you can see which are supported in stringr by looking at `stringi::stri_locale_list()`.
Base R string functions automatically use the locale set by your operating system.
This means that base R string functions usually use the rules associated with your native language, but such might work differently when you share it with someone who lives in different country.
To avoid this problem, stringr defaults to the "en" locale, and requires you to specify the `locale` argument to override it.
This also makes it easy to tell if a function might behave differently in different locales.
This means that base R string functions do what you expect for your language, but your code might work differently if you share it with someone who lives in different country.
To avoid this problem, stringr defaults to using English rules, by using the "en" locale, and requires you to specify the `locale` argument to override it.
Fortunately there are two sets of functions where the locale really matters: changing case and sorting.
Fortunately there are two sets of functions where the locale matters:
**T**he rules for changing case are not the same in every language.
For example, Turkish has two i's: with and without a dot, and it capitalizes them in a different way to English:
- **Changing case**: the rules for changing case are not the same in every language.
For example, Turkish has two i's: with and without a dot, and it has a different rule to English for capitalizing them:
```{r}
str_to_upper(c("i", "ı"))
str_to_upper(c("i", "ı"), locale = "tr")
```
```{r}
str_to_upper(c("i", "ı"))
str_to_upper(c("i", "ı"), locale = "tr")
```
Sorting strings depends on the order of the alphabet, and order of the alphabet is not the same in every language[^strings-9]!
Here's an example: in Czech, "ch" is a compound letter that appears after `h` in the alphabet.
This also effects `str_equal()` which can optionally ignore:
[^strings-9]: Sorting in languages that don't have an alphabet, like Chinese, is more complicated still.
```{r}
str_equal("i", "I", ignore_case = TRUE)
str_equal("i", "I", ignore_case = TRUE, locale = "tr")
```
```{r}
str_sort(c("a", "c", "ch", "h", "z"))
str_sort(c("a", "c", "ch", "h", "z"), locale = "cs")
```
- **Sorting strings**: `str_sort()` and `str_order()` sort vectors alphabetically, but the alphabet is not the same in every language[^strings-8]!
Here's an example: in Czech, "ch" is a compound letter that appears after `h` in the alphabet.
```{r}
str_sort(c("a", "c", "ch", "h", "z"))
str_sort(c("a", "c", "ch", "h", "z"), locale = "cs")
```
A similar situation arises in Danish.
Normally, characters with diacritics (e.g. à, á, â) sort after the plain character (e.g. a).
But in Danish ø and å are their own letters that come at the end of the alphabet:
```{r}
str_sort(c("a", "å", "o", "ø", "z"))
str_sort(c("a", "å", "o", "ø", "z"), locale = "da")
```
This also comes up when sorting strings with `dplyr::arrange()` which is why it also has a `locale` argument.
[^strings-8]: Sorting in languages that don't have an alphabet (like Chinese) is more complicated still.
This also comes up when sorting strings with `dplyr::arrange()` which is why it also has a `locale` argument.
## Summary
In this chapter you've learned a wide of tools for working with strings, but you haven't learned one of the most important and powerful tools: regular expressions.
In this chapter you've learned about some of the power of the stringr package: you learned how to create, combine, and extract strings, and about some of the challenges you might face with non-English strings.
Now it's time to learn one of the most important and powerful tools for working withr strings: regular expressions.
Regular expressions are very concise, but very expressive, language for describing patterns within strings, and are the topic of the next chapter.