From 88626be626c7e77bfb4d2d47b561b600954758cd Mon Sep 17 00:00:00 2001 From: hadley Date: Wed, 21 Oct 2015 09:31:15 -0500 Subject: [PATCH] Start on strings --- .travis.yml | 2 +- _includes/package-nav.html | 2 +- strings.Rmd | 110 +++++++++++++++++++++++++++++++++++++ 3 files changed, 112 insertions(+), 2 deletions(-) create mode 100644 strings.Rmd diff --git a/.travis.yml b/.travis.yml index c381aa4..eaedd29 100644 --- a/.travis.yml +++ b/.travis.yml @@ -22,7 +22,7 @@ install: # Install R packages - ./travis-tool.sh r_binary_install knitr png - - ./travis-tool.sh r_install ggplot2 dplyr tidyr pryr + - ./travis-tool.sh r_install ggplot2 dplyr tidyr pryr stringr - ./travis-tool.sh github_package hadley/bookdown garrettgman/DSR hadley/readr script: jekyll build diff --git a/_includes/package-nav.html b/_includes/package-nav.html index dd85dad..fdf6305 100644 --- a/_includes/package-nav.html +++ b/_includes/package-nav.html @@ -3,8 +3,8 @@
  • Visualize
  • -->
  • Transform
  • +
  • String manipulation
  • Tidy
  • diff --git a/strings.Rmd b/strings.Rmd new file mode 100644 index 0000000..fd884ad --- /dev/null +++ b/strings.Rmd @@ -0,0 +1,110 @@ +--- +layout: default +title: String manipulation +output: bookdown::html_chapter +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE) +``` + +# String manipulation + +When working with text data, one of the most powerful tools at your disposal is regular expressions. Regular expressions are a very concise language for describing patterns in strings. When you first look at them, you'll think a cat walked across your keyboard, but as you learn more, you'll see how they allow you to express complex patterns very concisely. + +In this chapter, you'll learn the basics of regular expressions using the stringr package. + +The chapter concludes with a brief look at the stringi package. This package is what stringr uses internally: it's more complex than stringr (and includes many many more functions). stringr includes tools to let you tackle the most common 90% of string manipulation challenges; stringi contains functions to let you tackle the last 10%. + +## String basics + +In R, strings are always stored in a character vector. You can create strings with either single quotes or double quotes: there is no different in behaviour. + +To include a literal single or double quote in a string you can use `\` to "escape". Note that when you print a string, you see the escapes. To see the raw contents of the string, use `writeLines()` (or for a length-1 character vector, `cat(x, "\n")`). + +```{r} +x <- c("\"", "\\") +x +writeLines(x) +``` + +Base R contains many functions to work with strings but we'll generally avoid them because they're inconsistent, and hard to remember. A particularly annoying inconsistency is that the function that computes the number of characters in a string, `nchar()`, returns 2 for `NA` (instead of `NA`) + +```{r} +# (Will be fixed in R 3.3.0) +nchar(NA) + +stringr::str_length(NA) +``` + +## Introduction to stringr + +```{r} +library(stringr) +``` + +The stringr package contains functions for working with strings and patterns. We'll focus on three: + +* `str_detect(string, pattern)`: does string match a pattern? +* `str_extract(string, pattern)`: extact matching pattern from string +* `str_replace(string, pattern, replacement)`: replace pattern with replacement +* `str_split(string, pattern)`. + +## Extracting patterns + +## Introduction to regular expressions + +Goal is not to be exhaustive. + +### Character classes and alternative + +* `.`: any character +* `\d`: a digit +* `\s`: whitespace + +* `x|y`: match x or y + +* `[abc]`: match a, b, or c +* `[a-e]`: match any character between a and e +* `[!abc]`: match anything except a, b, or c + +### Escaping + +You may have noticed that since `.` is a special regular expression character, you'll need to escape `.` + +### Repetition + +* `?`: 0 or 1 +* `+`: 1 or more +* `*`: 0 or more + +* `{n}`: exactly n +* `{n,}`: n or more +* `{,m}`: at most m +* `{n,m}`: between n and m + +(By default these matches are "greedy": they will match the longest string possible. You can make them "lazy", matching the shortest string possible by putting a `?` after them.) + +### Anchors + +* `^` match the start of the line +* `*` match the end of the line +* `\b` match boundary between words + +My favourite mneomic for rememember which is which (from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): begin with power (`^`), end with money (`$`). + + +## Detecting matches + + +### Groups + +`str_match()`, `str_match_all()` + +## Replacing patterns + +## Other types of pattern + +* `fixed()` +* `coll()` +* `boundary()`