diff --git a/strings.Rmd b/strings.Rmd index 2861eb8..793e97a 100644 --- a/strings.Rmd +++ b/strings.Rmd @@ -21,9 +21,17 @@ library(stringr) The chapter concludes with a brief look at the stringi package. This package is what stringr uses internally: it's more complex than stringr (and includes many many more functions). stringr includes tools to let you tackle the most common 90% of string manipulation challenges; stringi contains functions to let you tackle the last 10%. +For example, if you have + +```{r} + +``` + +The goal of this chapter is not to teach you every detail of regular expressions. Instead we'll give you a solid foundation that allows you to solve a wide variety of problems and point you to resources where you can learn more. + ## String basics -In R, strings are always stored in a character vector. You can create strings with either single quotes or double quotes: there is no different in behaviour. +In R, strings are always stored in a character vector. You can create strings with either single quotes or double quotes: there is no difference in behaviour. I recommend always using `"`, unless you want to create a string that contains multiple `"`, in which case use `'`. To include a literal single or double quote in a string you can use `\` to "escape". Note that when you print a string, you see the escapes. To see the raw contents of the string, use `writeLines()` (or for a length-1 character vector, `cat(x, "\n")`). @@ -33,6 +41,17 @@ x writeLines(x) ``` +There are a handful of other special characters. The most common used are `"\n"`, new line, and `"\t"`, tab, but you can see the complete list by requesting help on `"`: `?'"'`, or `?"'"`. + +You'll also sometimes strings like `"\u00b5"`, this is a way of writing special characters that works on all platforms: + +```R +x <- "\u00b5" +x +``` + +Remember that the representation of a string is different from the string itself. + ### String length Base R contains many functions to work with strings but we'll generally avoid them because they're inconsistent, and hard to remember. A particularly annoying inconsistency is that the function that computes the number of characters in a string, `nchar()`, returns 2 for `NA` (instead of `NA`) @@ -61,11 +80,12 @@ str_c("x", "y", sep = ", ") Like most other functions in R, missing values are infectious. If you want them to print as `NA`, use `str_replace_na()`: ```{r} -str_c("x", NA, "y") -str_c("x", str_replace_na(NA), "y") +x <- c("abc", NA) +str_c("|-", x, "-|") +str_c("|-", str_replace_na(x), "-|") ``` -`str_c()` is vectorised, and it automatically recycles the shortest vectors to the same length as the longest: +As shown above, `str_c()` is vectorised, automatically recycling the shortest vectors to the same length as the longest: ```{r} str_c("prefix-", c("a", "b", "c"), "-suffix") @@ -108,25 +128,10 @@ x 1. In your own words, describe the difference between `sep` and `collapse`. -## Regular expressions - -The stringr package contains functions for working with strings and patterns. We'll focus on four main categories - -* What matches the pattern? -* Does a string match a pattern? -* How can you replace a pattern with text? -* How can you split a string into pieces? +## Regular expressions basics Key to all of these functions are regular expressions. Regular expressions are a very terse language that allow to describe patterns in string. They take a little while to get your head around, but once you've got it you'll find them extremely useful. -```{r} - -``` - -Goal is not to be exhaustive, but to give you a solid foundation that allows you to solve a wide variety of problems. We'll point you to more resources where you can learn more about regular expresssions. - -### Matching anything and escaping - Regular expression are not limited to matching fixed string. You can also use special characters that match patterns. For example, `.` allows you to match any character: ```{r} @@ -148,6 +153,14 @@ str_subset(c("abc", "a.c", "bef"), "a\\.c") If `\` is used an escape character, how do you match a literal `\`? Well you need to escape it, creating the regular expression `\\`. And in R that needs to be in a string, so you need to write `"\\\\"` - that's right, you need four backslashes to match one! +```{r} +x <- "a\\b" +cat(x, "\n") + +y <- str_replace(x, "\\\\", "-slash-") +cat(y, "\n") +``` + ### Character classes and alternatives As well as `.` there are a number of other special patterns that match more than one character: @@ -166,7 +179,7 @@ A similar idea is alternation: `x|y` matches either x or y. Note that the preced str_detect(c("abc", "xyz"), "abc|xyz") ``` -Like with mathematics, if precedence ever gets confusing, use parentheses to make it clear what you want: +Like with mathematical expression, if precedence ever gets confusing, use parentheses to make it clear what you want: ```{r} str_detect(c("grey", "gray"), "gr(e|a)y") @@ -191,42 +204,57 @@ Note that the precedence of these operators are high, so you write: `colou?r`. T ### Anchors -* `^` match the start of the line -* `*` match the end of the line +Regular expressions can also match things that are not characters. The most important non-character matches are: -My favourite mneomic for rememember which is which (from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): begin with power (`^`), end with money (`$`). +* `^`: the start of the line. +* `*`: the end of the line. -To force a regular expression to only match a complete string: +To force a regular expression to only match a complete string, anchor it with both `^` and `$`.: ```{r} str_detect(c("abcdef", "bcd"), "^bcd$") ``` +My favourite mneomic for rememember which is which (from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): begin with power (`^`), end with money (`$`). + You can also match the boundary between words with `\b`. I don't find I often use this in R, but I will sometimes use it when I'm doing a find all in RStudio when I want to find the name of a function that's a component of other functions. For example, I'll search for `\bsum\b` to avoid matching `summarise`, `summary`, `rowsum` and so on. ### Exercises 1. Replace all `/` in a string with `\`. -## Detecting matches +## Regular expression operations + +The stringr package contains functions for working with strings and patterns. We'll focus on four main categories: + +* What matches the pattern? +* Does a string match a pattern? +* How can you replace a pattern with text? +* How can you split a string into pieces? + +### Detecting matches `str_detect()`, `str_subset()`, `str_count()` -## Extracting matches +### Extracting matches `str_extract()`, `str_extract_all()` -### Groups +### Extracting grouped matches `str_match()`, `str_match_all()` -## Replacing patterns +### Replacing patterns `str_replace()`, `str_replace_all()` -## Splitting +### Splitting -`str_split()`, `str_split_fixed()`. +`str_split()`, `str_split_fixed()`. + +### Finding locations + +`str_locate()`, `str_locate_all()` gives you the starting and ending positions of each match. These are particularly useful when none of the other functions does exactly what you want. You can use `str_locate()` to find the matching pattern, `str_sub()` to extract and/or modify them. ## Other types of pattern