diff --git a/strings.Rmd b/strings.Rmd index fd884ad..2861eb8 100644 --- a/strings.Rmd +++ b/strings.Rmd @@ -6,6 +6,7 @@ output: bookdown::html_chapter ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) +library(stringr) ``` # String manipulation @@ -14,6 +15,10 @@ When working with text data, one of the most powerful tools at your disposal is In this chapter, you'll learn the basics of regular expressions using the stringr package. +```{r} +library(stringr) +``` + The chapter concludes with a brief look at the stringi package. This package is what stringr uses internally: it's more complex than stringr (and includes many many more functions). stringr includes tools to let you tackle the most common 90% of string manipulation challenges; stringi contains functions to let you tackle the last 10%. ## String basics @@ -28,56 +33,153 @@ x writeLines(x) ``` +### String length + Base R contains many functions to work with strings but we'll generally avoid them because they're inconsistent, and hard to remember. A particularly annoying inconsistency is that the function that computes the number of characters in a string, `nchar()`, returns 2 for `NA` (instead of `NA`) ```{r} # (Will be fixed in R 3.3.0) nchar(NA) - -stringr::str_length(NA) +str_length(NA) ``` -## Introduction to stringr +### Combining strings + +To combine two or more strings, use `str_c()`: ```{r} -library(stringr) +str_c("x", "y") +str_c("x", "y", "z") ``` -The stringr package contains functions for working with strings and patterns. We'll focus on three: +Use the `sep` argument to control how they're separated: -* `str_detect(string, pattern)`: does string match a pattern? -* `str_extract(string, pattern)`: extact matching pattern from string -* `str_replace(string, pattern, replacement)`: replace pattern with replacement -* `str_split(string, pattern)`. +```{r} +str_c("x", "y", sep = ", ") +``` -## Extracting patterns +Like most other functions in R, missing values are infectious. If you want them to print as `NA`, use `str_replace_na()`: -## Introduction to regular expressions +```{r} +str_c("x", NA, "y") +str_c("x", str_replace_na(NA), "y") +``` -Goal is not to be exhaustive. +`str_c()` is vectorised, and it automatically recycles the shortest vectors to the same length as the longest: -### Character classes and alternative +```{r} +str_c("prefix-", c("a", "b", "c"), "-suffix") +``` -* `.`: any character -* `\d`: a digit -* `\s`: whitespace +To collapse vectors into a single string, use `collapse`: -* `x|y`: match x or y +```{r} +str_c(c("x", "y", "z"), collapse = ", ") +``` +When creating strings you might also find `str_pad()` and `str_dup()` useful: + +```{r} +x <- c("apple", "banana", "pear") +str_pad(x, 10) + +str_c("Na ", str_dup("na ", 4), "batman!") +``` + +### Subsetting strings + +You can extract parts of a string using `str_sub()`: + +```{r} +x <- c("apple", "banana", "pear") +str_sub(x, 1, 3) +# negative numbers count backwards from end +str_sub(x, -3, -1) +``` + +You can also use `str_sub()` to modify strings: + +```{r} +str_sub(x, 1, 1) <- str_to_lower(str_sub(x, 1, 1)) +x +``` + +### Exercises + +1. In your own words, describe the difference between `sep` and `collapse`. + +## Regular expressions + +The stringr package contains functions for working with strings and patterns. We'll focus on four main categories + +* What matches the pattern? +* Does a string match a pattern? +* How can you replace a pattern with text? +* How can you split a string into pieces? + +Key to all of these functions are regular expressions. Regular expressions are a very terse language that allow to describe patterns in string. They take a little while to get your head around, but once you've got it you'll find them extremely useful. + +```{r} + +``` + +Goal is not to be exhaustive, but to give you a solid foundation that allows you to solve a wide variety of problems. We'll point you to more resources where you can learn more about regular expresssions. + +### Matching anything and escaping + +Regular expression are not limited to matching fixed string. You can also use special characters that match patterns. For example, `.` allows you to match any character: + +```{r} +str_subset(c("abc", "adc", "bef"), "a.c") +``` + +But if "`.`" matches any character, how do you match an actual "`.`"? You need to use an "escape" to tell the regular expression you want to match it exactly, not use the special behaviour. The escape character used by regular expressions is `\`. Unfortunately, that's also the escape character used by strings, so to match a literal "`.`" you need to use `\\.`. + +```{r} +# To create the regular expression, we need \\ +dot <- "\\." + +# But the expression itself only contains one: +cat(dot, "\n") + +# And this tells R to look for explicit . +str_subset(c("abc", "a.c", "bef"), "a\\.c") +``` + +If `\` is used an escape character, how do you match a literal `\`? Well you need to escape it, creating the regular expression `\\`. And in R that needs to be in a string, so you need to write `"\\\\"` - that's right, you need four backslashes to match one! + +### Character classes and alternatives + +As well as `.` there are a number of other special patterns that match more than one character: + +* `\d`: any digit +* `\s`: any whitespace (space, tab, newline) * `[abc]`: match a, b, or c * `[a-e]`: match any character between a and e * `[!abc]`: match anything except a, b, or c -### Escaping +Remember, to create a regular expression containing `\d` or `\s`, you'll need to escape the `\` for the string, so you'll type `"\\d"` or `"\\s"`. -You may have noticed that since `.` is a special regular expression character, you'll need to escape `.` +A similar idea is alternation: `x|y` matches either x or y. Note that the precedence for `|` is low, so that `abc|xyz` matches either `abc` or `xyz` not `abcyz` or `abxyz`: + +```{r} +str_detect(c("abc", "xyz"), "abc|xyz") +``` + +Like with mathematics, if precedence ever gets confusing, use parentheses to make it clear what you want: + +```{r} +str_detect(c("grey", "gray"), "gr(e|a)y") +str_detect(c("grey", "gray"), "gr(?:e|a)y") +``` + +Unfortunately parentheses have some other side-effects in regular expressions, which we'll learn about later. Technically, the parentheses you should use are `(?:)` which are called non-capturing parentheses. Most of the time this won't make any difference so it's easy to use `()`, but it sometimes helpful to be aware of `(?:)`. ### Repetition * `?`: 0 or 1 * `+`: 1 or more * `*`: 0 or more - * `{n}`: exactly n * `{n,}`: n or more * `{,m}`: at most m @@ -85,17 +187,34 @@ You may have noticed that since `.` is a special regular expression character, y (By default these matches are "greedy": they will match the longest string possible. You can make them "lazy", matching the shortest string possible by putting a `?` after them.) +Note that the precedence of these operators are high, so you write: `colou?r`. That means you'll need to use parentheses for many uses: `bana(na)+` or `ba(na){2,}`. + ### Anchors * `^` match the start of the line * `*` match the end of the line -* `\b` match boundary between words My favourite mneomic for rememember which is which (from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): begin with power (`^`), end with money (`$`). +To force a regular expression to only match a complete string: + +```{r} +str_detect(c("abcdef", "bcd"), "^bcd$") +``` + +You can also match the boundary between words with `\b`. I don't find I often use this in R, but I will sometimes use it when I'm doing a find all in RStudio when I want to find the name of a function that's a component of other functions. For example, I'll search for `\bsum\b` to avoid matching `summarise`, `summary`, `rowsum` and so on. + +### Exercises + +1. Replace all `/` in a string with `\`. ## Detecting matches +`str_detect()`, `str_subset()`, `str_count()` + +## Extracting matches + +`str_extract()`, `str_extract_all()` ### Groups @@ -103,8 +222,54 @@ My favourite mneomic for rememember which is which (from [Evan Misshula](https:/ ## Replacing patterns +`str_replace()`, `str_replace_all()` + +## Splitting + +`str_split()`, `str_split_fixed()`. + ## Other types of pattern -* `fixed()` -* `coll()` -* `boundary()` +When you use a pattern that's a string, it's automatically wrapped into a call to `regex()`. Sometimes it's useful to call it explicitly so you can control the + +* `fixed()`: matches exactly that sequence of characters (i.e. ignored + all special regular expression pattern). + +* `coll()`: compare strings using standard **coll**ation rules. This is + useful for doing case insensitive matching. Note that `coll()` takes a + `locale` parameter that controls which rules are used for comparing + characters. Unfortunately different parts of the world use different rules! + +```{r} +# Turkish has two i's: with and without a dot, and it +# has a different rule for capitalising them: +str_to_upper(c("i", "ı")) +str_to_upper(c("i", "ı"), locale = "tr") + +# That means you also need to be aware of the difference +# when doing case insensitive matches: +i <- c("I", "İ", "i", "ı") +i + +str_subset(i, fixed("i", TRUE)) +str_subset(i, coll("i", TRUE)) +str_subset(i, coll("i", TRUE, locale = "tr")) +``` + +## Other uses of regular expressions + +There are a few other functions in base R that accept regular expressions: + +* `apropos()` searchs all objects avaiable from the global environment. This + is useful if you can't quite remember the name of the function. + +* `ls()` is similar to `apropos()` but only works in the current + environment. However, if you have so many objects in your environment + that you have to use a regular expression to filter them all, you + need to think about what you're doing! (And probably use a list instead). + +* `dir()` lists all the files in a directory. The `pattern` argument takes + a regular expression and only return file names that match the pattern. + For example, you can find all csv files with `dir(pattern = "\\.csv$")`. + (If you're more comfortable with "globs" like `*.csv`, you can convert + them to regular expressions with `glob2rx()`)