Some string tweaking
This commit is contained in:
parent
ec529ef1fa
commit
979289c50b
88
strings.Rmd
88
strings.Rmd
|
@ -21,9 +21,17 @@ library(stringr)
|
||||||
|
|
||||||
The chapter concludes with a brief look at the stringi package. This package is what stringr uses internally: it's more complex than stringr (and includes many many more functions). stringr includes tools to let you tackle the most common 90% of string manipulation challenges; stringi contains functions to let you tackle the last 10%.
|
The chapter concludes with a brief look at the stringi package. This package is what stringr uses internally: it's more complex than stringr (and includes many many more functions). stringr includes tools to let you tackle the most common 90% of string manipulation challenges; stringi contains functions to let you tackle the last 10%.
|
||||||
|
|
||||||
|
For example, if you have
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
The goal of this chapter is not to teach you every detail of regular expressions. Instead we'll give you a solid foundation that allows you to solve a wide variety of problems and point you to resources where you can learn more.
|
||||||
|
|
||||||
## String basics
|
## String basics
|
||||||
|
|
||||||
In R, strings are always stored in a character vector. You can create strings with either single quotes or double quotes: there is no different in behaviour.
|
In R, strings are always stored in a character vector. You can create strings with either single quotes or double quotes: there is no difference in behaviour. I recommend always using `"`, unless you want to create a string that contains multiple `"`, in which case use `'`.
|
||||||
|
|
||||||
To include a literal single or double quote in a string you can use `\` to "escape". Note that when you print a string, you see the escapes. To see the raw contents of the string, use `writeLines()` (or for a length-1 character vector, `cat(x, "\n")`).
|
To include a literal single or double quote in a string you can use `\` to "escape". Note that when you print a string, you see the escapes. To see the raw contents of the string, use `writeLines()` (or for a length-1 character vector, `cat(x, "\n")`).
|
||||||
|
|
||||||
|
@ -33,6 +41,17 @@ x
|
||||||
writeLines(x)
|
writeLines(x)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
There are a handful of other special characters. The most common used are `"\n"`, new line, and `"\t"`, tab, but you can see the complete list by requesting help on `"`: `?'"'`, or `?"'"`.
|
||||||
|
|
||||||
|
You'll also sometimes strings like `"\u00b5"`, this is a way of writing special characters that works on all platforms:
|
||||||
|
|
||||||
|
```R
|
||||||
|
x <- "\u00b5"
|
||||||
|
x
|
||||||
|
```
|
||||||
|
|
||||||
|
Remember that the representation of a string is different from the string itself.
|
||||||
|
|
||||||
### String length
|
### String length
|
||||||
|
|
||||||
Base R contains many functions to work with strings but we'll generally avoid them because they're inconsistent, and hard to remember. A particularly annoying inconsistency is that the function that computes the number of characters in a string, `nchar()`, returns 2 for `NA` (instead of `NA`)
|
Base R contains many functions to work with strings but we'll generally avoid them because they're inconsistent, and hard to remember. A particularly annoying inconsistency is that the function that computes the number of characters in a string, `nchar()`, returns 2 for `NA` (instead of `NA`)
|
||||||
|
@ -61,11 +80,12 @@ str_c("x", "y", sep = ", ")
|
||||||
Like most other functions in R, missing values are infectious. If you want them to print as `NA`, use `str_replace_na()`:
|
Like most other functions in R, missing values are infectious. If you want them to print as `NA`, use `str_replace_na()`:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
str_c("x", NA, "y")
|
x <- c("abc", NA)
|
||||||
str_c("x", str_replace_na(NA), "y")
|
str_c("|-", x, "-|")
|
||||||
|
str_c("|-", str_replace_na(x), "-|")
|
||||||
```
|
```
|
||||||
|
|
||||||
`str_c()` is vectorised, and it automatically recycles the shortest vectors to the same length as the longest:
|
As shown above, `str_c()` is vectorised, automatically recycling the shortest vectors to the same length as the longest:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
str_c("prefix-", c("a", "b", "c"), "-suffix")
|
str_c("prefix-", c("a", "b", "c"), "-suffix")
|
||||||
|
@ -108,25 +128,10 @@ x
|
||||||
|
|
||||||
1. In your own words, describe the difference between `sep` and `collapse`.
|
1. In your own words, describe the difference between `sep` and `collapse`.
|
||||||
|
|
||||||
## Regular expressions
|
## Regular expressions basics
|
||||||
|
|
||||||
The stringr package contains functions for working with strings and patterns. We'll focus on four main categories
|
|
||||||
|
|
||||||
* What matches the pattern?
|
|
||||||
* Does a string match a pattern?
|
|
||||||
* How can you replace a pattern with text?
|
|
||||||
* How can you split a string into pieces?
|
|
||||||
|
|
||||||
Key to all of these functions are regular expressions. Regular expressions are a very terse language that allow to describe patterns in string. They take a little while to get your head around, but once you've got it you'll find them extremely useful.
|
Key to all of these functions are regular expressions. Regular expressions are a very terse language that allow to describe patterns in string. They take a little while to get your head around, but once you've got it you'll find them extremely useful.
|
||||||
|
|
||||||
```{r}
|
|
||||||
|
|
||||||
```
|
|
||||||
|
|
||||||
Goal is not to be exhaustive, but to give you a solid foundation that allows you to solve a wide variety of problems. We'll point you to more resources where you can learn more about regular expresssions.
|
|
||||||
|
|
||||||
### Matching anything and escaping
|
|
||||||
|
|
||||||
Regular expression are not limited to matching fixed string. You can also use special characters that match patterns. For example, `.` allows you to match any character:
|
Regular expression are not limited to matching fixed string. You can also use special characters that match patterns. For example, `.` allows you to match any character:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
|
@ -148,6 +153,14 @@ str_subset(c("abc", "a.c", "bef"), "a\\.c")
|
||||||
|
|
||||||
If `\` is used an escape character, how do you match a literal `\`? Well you need to escape it, creating the regular expression `\\`. And in R that needs to be in a string, so you need to write `"\\\\"` - that's right, you need four backslashes to match one!
|
If `\` is used an escape character, how do you match a literal `\`? Well you need to escape it, creating the regular expression `\\`. And in R that needs to be in a string, so you need to write `"\\\\"` - that's right, you need four backslashes to match one!
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
x <- "a\\b"
|
||||||
|
cat(x, "\n")
|
||||||
|
|
||||||
|
y <- str_replace(x, "\\\\", "-slash-")
|
||||||
|
cat(y, "\n")
|
||||||
|
```
|
||||||
|
|
||||||
### Character classes and alternatives
|
### Character classes and alternatives
|
||||||
|
|
||||||
As well as `.` there are a number of other special patterns that match more than one character:
|
As well as `.` there are a number of other special patterns that match more than one character:
|
||||||
|
@ -166,7 +179,7 @@ A similar idea is alternation: `x|y` matches either x or y. Note that the preced
|
||||||
str_detect(c("abc", "xyz"), "abc|xyz")
|
str_detect(c("abc", "xyz"), "abc|xyz")
|
||||||
```
|
```
|
||||||
|
|
||||||
Like with mathematics, if precedence ever gets confusing, use parentheses to make it clear what you want:
|
Like with mathematical expression, if precedence ever gets confusing, use parentheses to make it clear what you want:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
str_detect(c("grey", "gray"), "gr(e|a)y")
|
str_detect(c("grey", "gray"), "gr(e|a)y")
|
||||||
|
@ -191,43 +204,58 @@ Note that the precedence of these operators are high, so you write: `colou?r`. T
|
||||||
|
|
||||||
### Anchors
|
### Anchors
|
||||||
|
|
||||||
* `^` match the start of the line
|
Regular expressions can also match things that are not characters. The most important non-character matches are:
|
||||||
* `*` match the end of the line
|
|
||||||
|
|
||||||
My favourite mneomic for rememember which is which (from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): begin with power (`^`), end with money (`$`).
|
* `^`: the start of the line.
|
||||||
|
* `*`: the end of the line.
|
||||||
|
|
||||||
To force a regular expression to only match a complete string:
|
To force a regular expression to only match a complete string, anchor it with both `^` and `$`.:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
str_detect(c("abcdef", "bcd"), "^bcd$")
|
str_detect(c("abcdef", "bcd"), "^bcd$")
|
||||||
```
|
```
|
||||||
|
|
||||||
|
My favourite mneomic for rememember which is which (from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): begin with power (`^`), end with money (`$`).
|
||||||
|
|
||||||
You can also match the boundary between words with `\b`. I don't find I often use this in R, but I will sometimes use it when I'm doing a find all in RStudio when I want to find the name of a function that's a component of other functions. For example, I'll search for `\bsum\b` to avoid matching `summarise`, `summary`, `rowsum` and so on.
|
You can also match the boundary between words with `\b`. I don't find I often use this in R, but I will sometimes use it when I'm doing a find all in RStudio when I want to find the name of a function that's a component of other functions. For example, I'll search for `\bsum\b` to avoid matching `summarise`, `summary`, `rowsum` and so on.
|
||||||
|
|
||||||
### Exercises
|
### Exercises
|
||||||
|
|
||||||
1. Replace all `/` in a string with `\`.
|
1. Replace all `/` in a string with `\`.
|
||||||
|
|
||||||
## Detecting matches
|
## Regular expression operations
|
||||||
|
|
||||||
|
The stringr package contains functions for working with strings and patterns. We'll focus on four main categories:
|
||||||
|
|
||||||
|
* What matches the pattern?
|
||||||
|
* Does a string match a pattern?
|
||||||
|
* How can you replace a pattern with text?
|
||||||
|
* How can you split a string into pieces?
|
||||||
|
|
||||||
|
### Detecting matches
|
||||||
|
|
||||||
`str_detect()`, `str_subset()`, `str_count()`
|
`str_detect()`, `str_subset()`, `str_count()`
|
||||||
|
|
||||||
## Extracting matches
|
### Extracting matches
|
||||||
|
|
||||||
`str_extract()`, `str_extract_all()`
|
`str_extract()`, `str_extract_all()`
|
||||||
|
|
||||||
### Groups
|
### Extracting grouped matches
|
||||||
|
|
||||||
`str_match()`, `str_match_all()`
|
`str_match()`, `str_match_all()`
|
||||||
|
|
||||||
## Replacing patterns
|
### Replacing patterns
|
||||||
|
|
||||||
`str_replace()`, `str_replace_all()`
|
`str_replace()`, `str_replace_all()`
|
||||||
|
|
||||||
## Splitting
|
### Splitting
|
||||||
|
|
||||||
`str_split()`, `str_split_fixed()`.
|
`str_split()`, `str_split_fixed()`.
|
||||||
|
|
||||||
|
### Finding locations
|
||||||
|
|
||||||
|
`str_locate()`, `str_locate_all()` gives you the starting and ending positions of each match. These are particularly useful when none of the other functions does exactly what you want. You can use `str_locate()` to find the matching pattern, `str_sub()` to extract and/or modify them.
|
||||||
|
|
||||||
## Other types of pattern
|
## Other types of pattern
|
||||||
|
|
||||||
When you use a pattern that's a string, it's automatically wrapped into a call to `regex()`. Sometimes it's useful to call it explicitly so you can control the
|
When you use a pattern that's a string, it's automatically wrapped into a call to `regex()`. Sometimes it's useful to call it explicitly so you can control the
|
||||||
|
|
Loading…
Reference in New Issue