More about strings

2015-10-22 13:17:00 -05:00 · 2015-10-22 13:17:00 -05:00 · ec529ef1fa
parent 88626be626
commit ec529ef1fa
1 changed files with 189 additions and 24 deletions
--- a/strings.Rmd
+++ b/strings.Rmd
@ -6,6 +6,7 @@ output: bookdown::html_chapter
 ```{r setup, include=FALSE}
 knitr::opts_chunk$set(echo = TRUE)
 library(stringr)
 ```
 # String manipulation
@ -14,6 +15,10 @@ When working with text data, one of the most powerful tools at your disposal is
 In this chapter, you'll learn the basics of regular expressions using the stringr package. 
 ```{r}
 library(stringr)
 ```
 The chapter concludes with a brief look at the stringi package. This package is what stringr uses internally: it's more complex than stringr (and includes many many more functions). stringr includes tools to let you tackle the most common 90% of string manipulation challenges; stringi contains functions to let you tackle the last 10%.
 ## String basics
@ -28,56 +33,153 @@ x
 writeLines(x)
 ```
 ### String length
 Base R contains many functions to work with strings but we'll generally avoid them because they're inconsistent, and hard to remember. A particularly annoying inconsistency is that the function that computes the number of characters in a string, `nchar()`, returns 2 for `NA` (instead of `NA`)
 ```{r}
 # (Will be fixed in R 3.3.0)
 nchar(NA)
-
+str_length(NA)
 stringr::str_length(NA)
 ```
-## Introduction to stringr
+### Combining strings
 To combine two or more strings, use `str_c()`:
 ```{r}
-library(stringr)
+str_c("x", "y")
 str_c("x", "y", "z")
 ```
-The stringr package contains functions for working with strings and patterns. We'll focus on three:
+Use the `sep` argument to control how they're separated:
-* `str_detect(string, pattern)`: does string match a pattern?
+```{r}
-* `str_extract(string, pattern)`: extact matching pattern from string
+str_c("x", "y", sep = ", ")
-* `str_replace(string, pattern, replacement)`: replace pattern with replacement
+```
 * `str_split(string, pattern)`.
-## Extracting patterns
+Like most other functions in R, missing values are infectious. If you want them to print as `NA`, use `str_replace_na()`:
-## Introduction to regular expressions
+```{r}
 str_c("x", NA, "y")
 str_c("x", str_replace_na(NA), "y")
 ```
-Goal is not to be exhaustive.
+`str_c()` is vectorised, and it automatically recycles the shortest vectors to the same length as the longest:
-### Character classes and alternative
+```{r}
 str_c("prefix-", c("a", "b", "c"), "-suffix")
 ```
-* `.`: any character
+To collapse vectors into a single string, use `collapse`:
 * `\d`: a digit
 * `\s`: whitespace
-* `x|y`: match x or y
+```{r}
 str_c(c("x", "y", "z"), collapse = ", ")
 ```
 When creating strings you might also find `str_pad()` and `str_dup()` useful:
 ```{r}
 x <- c("apple", "banana", "pear")
 str_pad(x, 10)
 str_c("Na ", str_dup("na ", 4), "batman!") 
 ```
 ### Subsetting strings
 You can extract parts of a string using `str_sub()`:
 ```{r}
 x <- c("apple", "banana", "pear")
 str_sub(x, 1, 3)
 # negative numbers count backwards from end
 str_sub(x, -3, -1)
 ```
 You can also use `str_sub()` to modify strings:
 ```{r}
 str_sub(x, 1, 1) <- str_to_lower(str_sub(x, 1, 1))
 x
 ```
 ### Exercises
 1.  In your own words, describe the difference between `sep` and `collapse`.
 ## Regular expressions
 The stringr package contains functions for working with strings and patterns. We'll focus on four main categories
 * What matches the pattern?
 * Does a string match a pattern? 
 * How can you replace a pattern with text?
 * How can you split a string into pieces?
 Key to all of these functions are regular expressions. Regular expressions are a very terse language that allow to describe patterns in string. They take a little while to get your head around, but once you've got it you'll find them extremely useful.
 ```{r}
 ```
 Goal is not to be exhaustive, but to give you a solid foundation that allows you to solve a wide variety of problems. We'll point you to more resources where you can learn more about regular expresssions.
 ### Matching anything and escaping
 Regular expression are not limited to matching fixed string. You can also use special characters that match patterns. For example, `.` allows you to match any character:
 ```{r}
 str_subset(c("abc", "adc", "bef"), "a.c")
 ```
 But if "`.`" matches any character, how do you match an actual "`.`"? You need to use an "escape" to tell the regular expression you want to match it exactly, not use the special behaviour. The escape character used by regular expressions is `\`. Unfortunately, that's also the escape character used by strings, so to match a literal "`.`" you need to use `\\.`.
 ```{r}
 # To create the regular expression, we need \\
 dot <- "\\."
 # But the expression itself only contains one:
 cat(dot, "\n")
 # And this tells R to look for explicit .
 str_subset(c("abc", "a.c", "bef"), "a\\.c")
 ```
 If `\` is used an escape character, how do you match a literal `\`? Well you need to escape it, creating the regular expression `\\`. And in R that needs to be in a string, so you need to write `"\\\\"` - that's right, you need four backslashes to match one!
 ### Character classes and alternatives
 As well as `.` there are a number of other special patterns that match more than one character:
 * `\d`: any digit
 * `\s`: any whitespace (space, tab, newline)
 * `[abc]`: match a, b, or c
 * `[a-e]`: match any character between a and e
 * `[!abc]`: match anything except a, b, or c
-### Escaping
+Remember, to create a regular expression containing `\d` or `\s`, you'll need to escape the `\` for the string, so you'll type `"\\d"` or `"\\s"`.
-You may have noticed that since `.` is a special regular expression character, you'll need to escape `.`
+A similar idea is alternation: `x|y` matches either x or y. Note that the precedence for `|` is low, so that `abc|xyz` matches either `abc` or `xyz` not `abcyz` or `abxyz`:
 ```{r}
 str_detect(c("abc", "xyz"), "abc|xyz")
 ```
 Like with mathematics, if precedence ever gets confusing, use parentheses to make it clear what you want:
 ```{r}
 str_detect(c("grey", "gray"), "gr(e|a)y")
 str_detect(c("grey", "gray"), "gr(?:e|a)y")
 ```
 Unfortunately parentheses have some other side-effects in regular expressions, which we'll learn about later. Technically, the parentheses you should use are `(?:)` which are called non-capturing parentheses. Most of the time this won't make any difference so it's easy to use `()`, but it sometimes helpful to be aware of `(?:)`.
 ### Repetition
 * `?`: 0 or 1
 * `+`: 1 or more
 * `*`: 0 or more
 * `{n}`: exactly n
 * `{n,}`: n or more
 * `{,m}`: at most m
@ -85,17 +187,34 @@ You may have noticed that since `.` is a special regular expression character, y
 (By default these matches are "greedy": they will match the longest string possible. You can make them "lazy", matching the shortest string possible by putting a `?` after them.)
 Note that the precedence of these operators are high, so you write: `colou?r`. That means you'll need to use parentheses for many uses: `bana(na)+` or `ba(na){2,}`.
 ### Anchors
 * `^` match the start of the line
 * `*` match the end of the line
 * `\b` match boundary between words
 My favourite mneomic for rememember which is which (from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): begin with power (`^`), end with money (`$`).
 To force a regular expression to only match a complete string:
 ```{r}
 str_detect(c("abcdef", "bcd"), "^bcd$")
 ```
 You can also match the boundary between words with `\b`. I don't find I often use this in R, but I will sometimes use it when I'm doing a find all in RStudio when I want to find the name of a function that's a component of other functions. For example, I'll search for `\bsum\b` to avoid matching `summarise`, `summary`, `rowsum` and so on.
 ### Exercises
 1.   Replace all `/` in a string with `\`.
 ## Detecting matches
 `str_detect()`, `str_subset()`, `str_count()`
 ## Extracting matches
 `str_extract()`, `str_extract_all()`
 ### Groups
@ -103,8 +222,54 @@ My favourite mneomic for rememember which is which (from [Evan Misshula](https:/
 ## Replacing patterns
 `str_replace()`, `str_replace_all()`
 ## Splitting
 `str_split()`, `str_split_fixed()`.
 ## Other types of pattern
-* `fixed()`
+When you use a pattern that's a string, it's automatically wrapped into a call to `regex()`. Sometimes it's useful to call it explicitly so you can control the 
-* `coll()`
+
-* `boundary()`
+* `fixed()`: matches exactly that sequence of characters (i.e. ignored
  all special regular expression pattern).
 * `coll()`: compare strings using standard **coll**ation rules. This is 
  useful for doing case insensitive matching. Note that `coll()` takes a
  `locale` parameter that controls which rules are used for comparing
  characters. Unfortunately different parts of the world use different rules!
 ```{r}
 # Turkish has two i's: with and without a dot, and it
 # has a different rule for capitalising them:
 str_to_upper(c("i", "ı"))
 str_to_upper(c("i", "ı"), locale = "tr")
 # That means you also need to be aware of the difference
 # when doing case insensitive matches:
 i <- c("I", "İ", "i", "ı")
 i
 str_subset(i, fixed("i", TRUE))
 str_subset(i, coll("i", TRUE))
 str_subset(i, coll("i", TRUE, locale = "tr"))
 ```
 ## Other uses of regular expressions
 There are a few other functions in base R that accept regular expressions:
 *   `apropos()` searchs all objects avaiable from the global environment. This
    is useful if you can't quite remember the name of the function.
 *   `ls()` is similar to `apropos()` but only works in the current 
    environment. However, if you have so many objects in your environment
    that you have to use a regular expression to filter them all, you 
    need to think about what you're doing! (And probably use a list instead).
 *   `dir()` lists all the files in a directory. The `pattern` argument takes
    a regular expression and only return file names that match the pattern.
    For example, you can find all csv files with `dir(pattern = "\\.csv$")`.
    (If you're more comfortable with "globs" like `*.csv`, you can convert
    them to regular expressions with `glob2rx()`)