Start on strings
This commit is contained in:
parent
da42f0d571
commit
88626be626
|
@ -22,7 +22,7 @@ install:
|
||||||
|
|
||||||
# Install R packages
|
# Install R packages
|
||||||
- ./travis-tool.sh r_binary_install knitr png
|
- ./travis-tool.sh r_binary_install knitr png
|
||||||
- ./travis-tool.sh r_install ggplot2 dplyr tidyr pryr
|
- ./travis-tool.sh r_install ggplot2 dplyr tidyr pryr stringr
|
||||||
- ./travis-tool.sh github_package hadley/bookdown garrettgman/DSR hadley/readr
|
- ./travis-tool.sh github_package hadley/bookdown garrettgman/DSR hadley/readr
|
||||||
|
|
||||||
script: jekyll build
|
script: jekyll build
|
||||||
|
|
|
@ -3,8 +3,8 @@
|
||||||
<li><a href="visualize.html">Visualize</a></li>
|
<li><a href="visualize.html">Visualize</a></li>
|
||||||
-->
|
-->
|
||||||
<li><a href="transform.html">Transform</a></li>
|
<li><a href="transform.html">Transform</a></li>
|
||||||
|
<li><a href="strings.html">String manipulation</a></li>
|
||||||
<!--
|
<!--
|
||||||
<li><a href="strings.html">Regular expresssions</a></li>
|
|
||||||
<li><a href="dates.html">Dates and times</a></li>
|
<li><a href="dates.html">Dates and times</a></li>
|
||||||
-->
|
-->
|
||||||
<li><a href="tidy.html">Tidy</a></li>
|
<li><a href="tidy.html">Tidy</a></li>
|
||||||
|
|
|
@ -0,0 +1,110 @@
|
||||||
|
---
|
||||||
|
layout: default
|
||||||
|
title: String manipulation
|
||||||
|
output: bookdown::html_chapter
|
||||||
|
---
|
||||||
|
|
||||||
|
```{r setup, include=FALSE}
|
||||||
|
knitr::opts_chunk$set(echo = TRUE)
|
||||||
|
```
|
||||||
|
|
||||||
|
# String manipulation
|
||||||
|
|
||||||
|
When working with text data, one of the most powerful tools at your disposal is regular expressions. Regular expressions are a very concise language for describing patterns in strings. When you first look at them, you'll think a cat walked across your keyboard, but as you learn more, you'll see how they allow you to express complex patterns very concisely.
|
||||||
|
|
||||||
|
In this chapter, you'll learn the basics of regular expressions using the stringr package.
|
||||||
|
|
||||||
|
The chapter concludes with a brief look at the stringi package. This package is what stringr uses internally: it's more complex than stringr (and includes many many more functions). stringr includes tools to let you tackle the most common 90% of string manipulation challenges; stringi contains functions to let you tackle the last 10%.
|
||||||
|
|
||||||
|
## String basics
|
||||||
|
|
||||||
|
In R, strings are always stored in a character vector. You can create strings with either single quotes or double quotes: there is no different in behaviour.
|
||||||
|
|
||||||
|
To include a literal single or double quote in a string you can use `\` to "escape". Note that when you print a string, you see the escapes. To see the raw contents of the string, use `writeLines()` (or for a length-1 character vector, `cat(x, "\n")`).
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
x <- c("\"", "\\")
|
||||||
|
x
|
||||||
|
writeLines(x)
|
||||||
|
```
|
||||||
|
|
||||||
|
Base R contains many functions to work with strings but we'll generally avoid them because they're inconsistent, and hard to remember. A particularly annoying inconsistency is that the function that computes the number of characters in a string, `nchar()`, returns 2 for `NA` (instead of `NA`)
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
# (Will be fixed in R 3.3.0)
|
||||||
|
nchar(NA)
|
||||||
|
|
||||||
|
stringr::str_length(NA)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Introduction to stringr
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
library(stringr)
|
||||||
|
```
|
||||||
|
|
||||||
|
The stringr package contains functions for working with strings and patterns. We'll focus on three:
|
||||||
|
|
||||||
|
* `str_detect(string, pattern)`: does string match a pattern?
|
||||||
|
* `str_extract(string, pattern)`: extact matching pattern from string
|
||||||
|
* `str_replace(string, pattern, replacement)`: replace pattern with replacement
|
||||||
|
* `str_split(string, pattern)`.
|
||||||
|
|
||||||
|
## Extracting patterns
|
||||||
|
|
||||||
|
## Introduction to regular expressions
|
||||||
|
|
||||||
|
Goal is not to be exhaustive.
|
||||||
|
|
||||||
|
### Character classes and alternative
|
||||||
|
|
||||||
|
* `.`: any character
|
||||||
|
* `\d`: a digit
|
||||||
|
* `\s`: whitespace
|
||||||
|
|
||||||
|
* `x|y`: match x or y
|
||||||
|
|
||||||
|
* `[abc]`: match a, b, or c
|
||||||
|
* `[a-e]`: match any character between a and e
|
||||||
|
* `[!abc]`: match anything except a, b, or c
|
||||||
|
|
||||||
|
### Escaping
|
||||||
|
|
||||||
|
You may have noticed that since `.` is a special regular expression character, you'll need to escape `.`
|
||||||
|
|
||||||
|
### Repetition
|
||||||
|
|
||||||
|
* `?`: 0 or 1
|
||||||
|
* `+`: 1 or more
|
||||||
|
* `*`: 0 or more
|
||||||
|
|
||||||
|
* `{n}`: exactly n
|
||||||
|
* `{n,}`: n or more
|
||||||
|
* `{,m}`: at most m
|
||||||
|
* `{n,m}`: between n and m
|
||||||
|
|
||||||
|
(By default these matches are "greedy": they will match the longest string possible. You can make them "lazy", matching the shortest string possible by putting a `?` after them.)
|
||||||
|
|
||||||
|
### Anchors
|
||||||
|
|
||||||
|
* `^` match the start of the line
|
||||||
|
* `*` match the end of the line
|
||||||
|
* `\b` match boundary between words
|
||||||
|
|
||||||
|
My favourite mneomic for rememember which is which (from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): begin with power (`^`), end with money (`$`).
|
||||||
|
|
||||||
|
|
||||||
|
## Detecting matches
|
||||||
|
|
||||||
|
|
||||||
|
### Groups
|
||||||
|
|
||||||
|
`str_match()`, `str_match_all()`
|
||||||
|
|
||||||
|
## Replacing patterns
|
||||||
|
|
||||||
|
## Other types of pattern
|
||||||
|
|
||||||
|
* `fixed()`
|
||||||
|
* `coll()`
|
||||||
|
* `boundary()`
|
Loading…
Reference in New Issue