Break up strings chapter
This commit is contained in:
parent
18253a1d52
commit
58f7f16db1
|
@ -46,6 +46,7 @@ rmd_files: [
|
|||
"functions.Rmd",
|
||||
"vectors.Rmd",
|
||||
"iteration.Rmd",
|
||||
"prog-strings.Rmd",
|
||||
|
||||
"communicate.Rmd",
|
||||
"rmarkdown.Rmd",
|
||||
|
|
|
@ -0,0 +1,190 @@
|
|||
## Programming with strings
|
||||
|
||||
```{r}
|
||||
library(stringr)
|
||||
library(tidyr)
|
||||
library(tibble)
|
||||
```
|
||||
|
||||
### Extract
|
||||
|
||||
```{r}
|
||||
colours <- c("red", "orange", "yellow", "green", "blue", "purple")
|
||||
colour_match <- str_c(colours, collapse = "|")
|
||||
colour_match
|
||||
|
||||
more <- sentences[str_count(sentences, colour_match) > 1]
|
||||
str_extract_all(more, colour_match)
|
||||
```
|
||||
|
||||
If you use `simplify = TRUE`, `str_extract_all()` will return a matrix with short matches expanded to the same length as the longest:
|
||||
|
||||
```{r}
|
||||
|
||||
str_extract_all(more, colour_match, simplify = TRUE)
|
||||
|
||||
x <- c("a", "a b", "a b c")
|
||||
str_extract_all(x, "[a-z]", simplify = TRUE)
|
||||
```
|
||||
|
||||
We don't talk about matrices here, but they are useful elsewhere.
|
||||
|
||||
### Exercises
|
||||
|
||||
1. From the Harvard sentences data, extract:
|
||||
|
||||
1. The first word from each sentence.
|
||||
2. All words ending in `ing`.
|
||||
3. All plurals.
|
||||
|
||||
## Grouped matches
|
||||
|
||||
Earlier in this chapter we talked about the use of parentheses for clarifying precedence and for backreferences when matching.
|
||||
You can also use parentheses to extract parts of a complex match.
|
||||
For example, imagine we want to extract nouns from the sentences.
|
||||
As a heuristic, we'll look for any word that comes after "a" or "the".
|
||||
Defining a "word" in a regular expression is a little tricky, so here I use a simple approximation: a sequence of at least one character that isn't a space.
|
||||
|
||||
```{r}
|
||||
noun <- "(a|the) ([^ ]+)"
|
||||
|
||||
has_noun <- sentences %>%
|
||||
str_subset(noun) %>%
|
||||
head(10)
|
||||
has_noun %>%
|
||||
str_extract(noun)
|
||||
```
|
||||
|
||||
`str_extract()` gives us the complete match; `str_match()` gives each individual component.
|
||||
Instead of a character vector, it returns a matrix, with one column for the complete match followed by one column for each group:
|
||||
|
||||
```{r}
|
||||
has_noun %>%
|
||||
str_match(noun)
|
||||
```
|
||||
|
||||
(Unsurprisingly, our heuristic for detecting nouns is poor, and also picks up adjectives like smooth and parked.)
|
||||
|
||||
## Spitting
|
||||
|
||||
Use `str_split()` to split a string up into pieces.
|
||||
For example, we could split sentences into words:
|
||||
|
||||
```{r}
|
||||
sentences %>%
|
||||
head(5) %>%
|
||||
str_split(" ")
|
||||
```
|
||||
|
||||
Because each component might contain a different number of pieces, this returns a list.
|
||||
If you're working with a length-1 vector, the easiest thing is to just extract the first element of the list:
|
||||
|
||||
```{r}
|
||||
"a|b|c|d" %>%
|
||||
str_split("\\|") %>%
|
||||
.[[1]]
|
||||
```
|
||||
|
||||
Otherwise, like the other stringr functions that return a list, you can use `simplify = TRUE` to return a matrix:
|
||||
|
||||
```{r}
|
||||
sentences %>%
|
||||
head(5) %>%
|
||||
str_split(" ", simplify = TRUE)
|
||||
```
|
||||
|
||||
You can also request a maximum number of pieces:
|
||||
|
||||
```{r}
|
||||
fields <- c("Name: Hadley", "Country: NZ", "Age: 35")
|
||||
fields %>% str_split(": ", n = 2, simplify = TRUE)
|
||||
```
|
||||
|
||||
Instead of splitting up strings by patterns, you can also split up by character, line, sentence and word `boundary()`s:
|
||||
|
||||
```{r}
|
||||
x <- "This is a sentence. This is another sentence."
|
||||
str_view_all(x, boundary("word"))
|
||||
|
||||
str_split(x, " ")[[1]]
|
||||
str_split(x, boundary("word"))[[1]]
|
||||
```
|
||||
|
||||
## Replace with function
|
||||
|
||||
## Locations
|
||||
|
||||
`str_locate()` and `str_locate_all()` give you the starting and ending positions of each match.
|
||||
These are particularly useful when none of the other functions does exactly what you want.
|
||||
You can use `str_locate()` to find the matching pattern, `str_sub()` to extract and/or modify them.
|
||||
|
||||
## stringi
|
||||
|
||||
stringr is built on top of the **stringi** package.
|
||||
stringr is useful when you're learning because it exposes a minimal set of functions, which have been carefully picked to handle the most common string manipulation functions.
|
||||
stringi, on the other hand, is designed to be comprehensive.
|
||||
It contains almost every function you might ever need: stringi has `r length(getNamespaceExports("stringi"))` functions to stringr's `r length(getNamespaceExports("stringr"))`.
|
||||
|
||||
If you find yourself struggling to do something in stringr, it's worth taking a look at stringi.
|
||||
The packages work very similarly, so you should be able to translate your stringr knowledge in a natural way.
|
||||
The main difference is the prefix: `str_` vs. `stri_`.
|
||||
|
||||
### Exercises
|
||||
|
||||
1. Find the stringi functions that:
|
||||
|
||||
a. Count the number of words.
|
||||
b. Find duplicated strings.
|
||||
c. Generate random text.
|
||||
|
||||
2. How do you control the language that `stri_sort()` uses for sorting?
|
||||
|
||||
### Exercises
|
||||
|
||||
1. What do the `extra` and `fill` arguments do in `separate()`?
|
||||
Experiment with the various options for the following two toy datasets.
|
||||
|
||||
```{r, eval = FALSE}
|
||||
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
|
||||
separate(x, c("one", "two", "three"))
|
||||
|
||||
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
|
||||
separate(x, c("one", "two", "three"))
|
||||
```
|
||||
|
||||
2. Both `unite()` and `separate()` have a `remove` argument.
|
||||
What does it do?
|
||||
Why would you set it to `FALSE`?
|
||||
|
||||
3. Compare and contrast `separate()` and `extract()`.
|
||||
Why are there three variations of separation (by position, by separator, and with groups), but only one unite?
|
||||
|
||||
4. In the following example we're using `unite()` to create a `date` column from `month` and `day` columns.
|
||||
How would you achieve the same outcome using `mutate()` and `paste()` instead of unite?
|
||||
|
||||
```{r, eval = FALSE}
|
||||
events <- tribble(
|
||||
~month, ~day,
|
||||
1 , 20,
|
||||
1 , 21,
|
||||
1 , 22
|
||||
)
|
||||
|
||||
events %>%
|
||||
unite("date", month:day, sep = "-", remove = FALSE)
|
||||
```
|
||||
|
||||
5. You can also pass a vector of integers to `sep`. `separate()` will interpret the integers as positions to split at.
|
||||
Positive values start at 1 on the far-left of the strings; negative value start at -1 on the far-right of the strings.
|
||||
Use `separate()` to represent location information in the following tibble in two columns: `state` (represented by the first two characters) and `county`.
|
||||
Do this in two ways: using a positive and a negative value for `sep`.
|
||||
|
||||
```{r}
|
||||
baker <- tribble(
|
||||
~location,
|
||||
"FLBaker County",
|
||||
"GABaker County",
|
||||
"ORBaker County",
|
||||
)
|
||||
baker
|
||||
```
|
17
regexps.Rmd
17
regexps.Rmd
|
@ -1,5 +1,11 @@
|
|||
# Regular expressions
|
||||
|
||||
## Introduction
|
||||
|
||||
The focus of this chapter will be on regular expressions, or regexps for short.
|
||||
Regular expressions are useful because strings usually contain unstructured or semi-structured data, and regexps are a concise language for describing patterns in strings.
|
||||
When you first look at a regexp, you'll think a cat walked across your keyboard, but as your understanding improves they will soon start to make sense.
|
||||
|
||||
## Matching patterns with regular expressions
|
||||
|
||||
Regexps are a very terse language that allow you to describe patterns in strings.
|
||||
|
@ -229,7 +235,7 @@ Collectively, these operators are called **quantifiers** because they quantify h
|
|||
b. Have three or more vowels in a row.
|
||||
c. Have two or more vowel-consonant pairs in a row.
|
||||
|
||||
4. Solve the beginner regexp crosswords at [<https://regexcrossword.com/challenges/beginner>](https://regexcrossword.com/challenges/beginner){.uri}.
|
||||
4. Solve the beginner regexp crosswords at [\<https://regexcrossword.com/challenges/beginner\>](https://regexcrossword.com/challenges/beginner){.uri}.
|
||||
|
||||
## Grouping and backreferences
|
||||
|
||||
|
@ -245,6 +251,14 @@ str_view(fruit, "(..)\\1", match = TRUE)
|
|||
|
||||
(Shortly, you'll also see how they're useful in conjunction with `str_match()`.)
|
||||
|
||||
Also use for replacement:
|
||||
|
||||
```{r}
|
||||
sentences %>%
|
||||
str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>%
|
||||
head(5)
|
||||
```
|
||||
|
||||
### Exercises
|
||||
|
||||
1. Describe, in words, what these expressions will match:
|
||||
|
@ -380,3 +394,4 @@ See the Stack Overflow discussion at <http://stackoverflow.com/a/201378> for mor
|
|||
Don't forget that you're in a programming language and you have other tools at your disposal.
|
||||
Instead of creating one complex regular expression, it's often easier to write a series of simpler regexps.
|
||||
If you get stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down into smaller pieces, solving each challenge before moving onto the next one.
|
||||
|
||||
|
|
287
strings.Rmd
287
strings.Rmd
|
@ -3,9 +3,8 @@
|
|||
## Introduction
|
||||
|
||||
This chapter introduces you to string manipulation in R.
|
||||
You'll learn the basics of how strings work and how to create them by hand, but the focus of this chapter will be on regular expressions, or regexps for short.
|
||||
Regular expressions are useful because strings usually contain unstructured or semi-structured data, and regexps are a concise language for describing patterns in strings.
|
||||
When you first look at a regexp, you'll think a cat walked across your keyboard, but as your understanding improves they will soon start to make sense.
|
||||
You'll learn the basics of how strings work and how to create them by hand.
|
||||
Big topic so spread over three chapters.
|
||||
|
||||
### Prerequisites
|
||||
|
||||
|
@ -15,7 +14,7 @@ This chapter will focus on the **stringr** package for string manipulation, whic
|
|||
library(tidyverse)
|
||||
```
|
||||
|
||||
## String basics
|
||||
## Creating a string
|
||||
|
||||
You can create strings with either single quotes or double quotes.
|
||||
Unlike other languages, there is no difference in behaviour.
|
||||
|
@ -44,6 +43,8 @@ single_quote <- '\'' # or "'"
|
|||
|
||||
That means if you want to include a literal backslash, you'll need to double it up: `"\\"`.
|
||||
|
||||
TODO: raw string.
|
||||
|
||||
Beware that the printed representation of a string is not the same as string itself, because the printed representation shows the escapes.
|
||||
To see the raw contents of the string, use `writeLines()`:
|
||||
|
||||
|
@ -68,7 +69,7 @@ Multiple strings are often stored in a character vector, which you can create wi
|
|||
c("one", "two", "three")
|
||||
```
|
||||
|
||||
### String length
|
||||
## String length
|
||||
|
||||
Base R contains many functions to work with strings but we'll avoid them because they can be inconsistent, which makes them hard to remember.
|
||||
Instead we'll use functions from stringr.
|
||||
|
@ -79,13 +80,15 @@ For example, `str_length()` tells you the number of characters in a string:
|
|||
str_length(c("a", "R for data science", NA))
|
||||
```
|
||||
|
||||
What is a letter?
|
||||
|
||||
The common `str_` prefix is particularly useful if you use RStudio, because typing `str_` will trigger autocomplete, allowing you to see all stringr functions:
|
||||
|
||||
```{r, echo = FALSE}
|
||||
knitr::include_graphics("screenshots/stringr-autocomplete.png")
|
||||
```
|
||||
|
||||
### Combining strings
|
||||
## Combining strings
|
||||
|
||||
To combine two or more strings, use `str_c()`:
|
||||
|
||||
|
@ -115,7 +118,7 @@ As shown above, `str_c()` is vectorised, and it automatically recycles shorter v
|
|||
str_c("prefix-", c("a", "b", "c"), "-suffix")
|
||||
```
|
||||
|
||||
Objects of length 0 are silently dropped.
|
||||
`NULL`s are silently dropped.
|
||||
This is particularly useful in conjunction with `if`:
|
||||
|
||||
```{r}
|
||||
|
@ -136,7 +139,7 @@ To collapse a vector of strings into a single string, use `collapse`:
|
|||
str_c(c("x", "y", "z"), collapse = ", ")
|
||||
```
|
||||
|
||||
### Subsetting strings
|
||||
## Subsetting strings
|
||||
|
||||
You can extract parts of a string using `str_sub()`.
|
||||
As well as the string, `str_sub()` takes `start` and `end` arguments which give the (inclusive) position of the substring:
|
||||
|
@ -161,7 +164,9 @@ str_sub(x, 1, 1) <- str_to_lower(str_sub(x, 1, 1))
|
|||
x
|
||||
```
|
||||
|
||||
### Locales
|
||||
TODO: `separate()`
|
||||
|
||||
## Locales
|
||||
|
||||
Above I used `str_to_lower()` to change the text to lower case.
|
||||
You can also use `str_to_upper()` or `str_to_title()`.
|
||||
|
@ -214,18 +219,7 @@ TODO: add connection to `arrange()`
|
|||
6. Write a function that turns (e.g.) a vector `c("a", "b", "c")` into the string `a, b, and c`.
|
||||
Think carefully about what it should do if given a vector of length 0, 1, or 2.
|
||||
|
||||
## Tools
|
||||
|
||||
Now that you've learned the basics of regular expressions, it's time to learn how to apply them to real problems.
|
||||
In this section you'll learn a wide array of stringr functions that let you:
|
||||
|
||||
- Determine which strings match a pattern.
|
||||
- Find the positions of matches.
|
||||
- Extract the content of matches.
|
||||
- Replace matches with new values.
|
||||
- Split a string based on a match.
|
||||
|
||||
### Detect matches
|
||||
## Detect matches
|
||||
|
||||
To determine if a character vector matches a pattern, use `str_detect()`.
|
||||
It returns a logical vector the same length as the input:
|
||||
|
@ -235,6 +229,8 @@ x <- c("apple", "banana", "pear")
|
|||
str_detect(x, "e")
|
||||
```
|
||||
|
||||
TODO: add basic intro to regexps.
|
||||
|
||||
Remember that when you use a logical vector in a numeric context, `FALSE` becomes 0 and `TRUE` becomes 1.
|
||||
That makes `sum()` and `mean()` useful if you want to answer questions about matches across a larger vector:
|
||||
|
||||
|
@ -307,11 +303,7 @@ str_count("abababa", "aba")
|
|||
str_view_all("abababa", "aba")
|
||||
```
|
||||
|
||||
Note the use of `str_view_all()`.
|
||||
As you'll shortly learn, many stringr functions come in pairs: one function works with a single match, and the other works with all matches.
|
||||
The second function will have the suffix `_all`.
|
||||
|
||||
#### Exercises
|
||||
### Exercises
|
||||
|
||||
1. For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple `str_detect()` calls.
|
||||
|
||||
|
@ -323,7 +315,33 @@ The second function will have the suffix `_all`.
|
|||
What word has the highest proportion of vowels?
|
||||
(Hint: what is the denominator?)
|
||||
|
||||
### Extract matches
|
||||
## Replacing matches
|
||||
|
||||
`str_replace_all()` allow you to replace matches with new strings.
|
||||
The simplest use is to replace a pattern with a fixed string:
|
||||
|
||||
```{r}
|
||||
x <- c("apple", "pear", "banana")
|
||||
str_replace_all(x, "[aeiou]", "-")
|
||||
```
|
||||
|
||||
With `str_replace_all()` you can perform multiple replacements by supplying a named vector:
|
||||
|
||||
```{r}
|
||||
x <- c("1 house", "2 cars", "3 people")
|
||||
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
|
||||
```
|
||||
|
||||
#### Exercises
|
||||
|
||||
1. Replace all forward slashes in a string with backslashes.
|
||||
|
||||
2. Implement a simple version of `str_to_lower()` using `str_replace_all()`.
|
||||
|
||||
3. Switch the first and last letters in `words`.
|
||||
Which of those strings are still words?
|
||||
|
||||
## Extract full matches
|
||||
|
||||
To extract the actual text of a match, use `str_extract()`.
|
||||
To show that off, we're going to need a more complicated example.
|
||||
|
@ -364,61 +382,14 @@ str_extract(more, colour_match)
|
|||
|
||||
This is a common pattern for stringr functions, because working with a single match allows you to use much simpler data structures.
|
||||
To get all matches, use `str_extract_all()`.
|
||||
It returns a list:
|
||||
It returns a list, so we'll come back to this later on.
|
||||
|
||||
```{r}
|
||||
str_extract_all(more, colour_match)
|
||||
```
|
||||
|
||||
You'll learn more about lists in Section \@ref(lists) on lists and Chapter \@ref(iteration) on iteration.
|
||||
|
||||
If you use `simplify = TRUE`, `str_extract_all()` will return a matrix with short matches expanded to the same length as the longest:
|
||||
|
||||
```{r}
|
||||
str_extract_all(more, colour_match, simplify = TRUE)
|
||||
|
||||
x <- c("a", "a b", "a b c")
|
||||
str_extract_all(x, "[a-z]", simplify = TRUE)
|
||||
```
|
||||
|
||||
#### Exercises
|
||||
### Exercises
|
||||
|
||||
1. In the previous example, you might have noticed that the regular expression matched "flickered", which is not a colour.
|
||||
Modify the regex to fix the problem.
|
||||
|
||||
2. From the Harvard sentences data, extract:
|
||||
|
||||
1. The first word from each sentence.
|
||||
2. All words ending in `ing`.
|
||||
3. All plurals.
|
||||
|
||||
### Grouped matches
|
||||
|
||||
Earlier in this chapter we talked about the use of parentheses for clarifying precedence and for backreferences when matching.
|
||||
You can also use parentheses to extract parts of a complex match.
|
||||
For example, imagine we want to extract nouns from the sentences.
|
||||
As a heuristic, we'll look for any word that comes after "a" or "the".
|
||||
Defining a "word" in a regular expression is a little tricky, so here I use a simple approximation: a sequence of at least one character that isn't a space.
|
||||
|
||||
```{r}
|
||||
noun <- "(a|the) ([^ ]+)"
|
||||
|
||||
has_noun <- sentences %>%
|
||||
str_subset(noun) %>%
|
||||
head(10)
|
||||
has_noun %>%
|
||||
str_extract(noun)
|
||||
```
|
||||
|
||||
`str_extract()` gives us the complete match; `str_match()` gives each individual component.
|
||||
Instead of a character vector, it returns a matrix, with one column for the complete match followed by one column for each group:
|
||||
|
||||
```{r}
|
||||
has_noun %>%
|
||||
str_match(noun)
|
||||
```
|
||||
|
||||
(Unsurprisingly, our heuristic for detecting nouns is poor, and also picks up adjectives like smooth and parked.)
|
||||
## Extract part of matches
|
||||
|
||||
If your data is in a tibble, it's often easier to use `tidyr::extract()`.
|
||||
It works like `str_match()` but requires you to name the matches, which are then placed in new columns:
|
||||
|
@ -441,88 +412,7 @@ Like `str_extract()`, if you want all matches for each string, you'll need `str_
|
|||
2. Find all contractions.
|
||||
Separate out the pieces before and after the apostrophe.
|
||||
|
||||
### Replacing matches
|
||||
|
||||
`str_replace()` and `str_replace_all()` allow you to replace matches with new strings.
|
||||
The simplest use is to replace a pattern with a fixed string:
|
||||
|
||||
```{r}
|
||||
x <- c("apple", "pear", "banana")
|
||||
str_replace(x, "[aeiou]", "-")
|
||||
str_replace_all(x, "[aeiou]", "-")
|
||||
```
|
||||
|
||||
With `str_replace_all()` you can perform multiple replacements by supplying a named vector:
|
||||
|
||||
```{r}
|
||||
x <- c("1 house", "2 cars", "3 people")
|
||||
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
|
||||
```
|
||||
|
||||
Instead of replacing with a fixed string you can use backreferences to insert components of the match.
|
||||
In the following code, I flip the order of the second and third words.
|
||||
|
||||
```{r}
|
||||
sentences %>%
|
||||
str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>%
|
||||
head(5)
|
||||
```
|
||||
|
||||
#### Exercises
|
||||
|
||||
1. Replace all forward slashes in a string with backslashes.
|
||||
|
||||
2. Implement a simple version of `str_to_lower()` using `str_replace_all()`.
|
||||
|
||||
3. Switch the first and last letters in `words`.
|
||||
Which of those strings are still words?
|
||||
|
||||
### Splitting
|
||||
|
||||
Use `str_split()` to split a string up into pieces.
|
||||
For example, we could split sentences into words:
|
||||
|
||||
```{r}
|
||||
sentences %>%
|
||||
head(5) %>%
|
||||
str_split(" ")
|
||||
```
|
||||
|
||||
Because each component might contain a different number of pieces, this returns a list.
|
||||
If you're working with a length-1 vector, the easiest thing is to just extract the first element of the list:
|
||||
|
||||
```{r}
|
||||
"a|b|c|d" %>%
|
||||
str_split("\\|") %>%
|
||||
.[[1]]
|
||||
```
|
||||
|
||||
Otherwise, like the other stringr functions that return a list, you can use `simplify = TRUE` to return a matrix:
|
||||
|
||||
```{r}
|
||||
sentences %>%
|
||||
head(5) %>%
|
||||
str_split(" ", simplify = TRUE)
|
||||
```
|
||||
|
||||
You can also request a maximum number of pieces:
|
||||
|
||||
```{r}
|
||||
fields <- c("Name: Hadley", "Country: NZ", "Age: 35")
|
||||
fields %>% str_split(": ", n = 2, simplify = TRUE)
|
||||
```
|
||||
|
||||
Instead of splitting up strings by patterns, you can also split up by character, line, sentence and word `boundary()`s:
|
||||
|
||||
```{r}
|
||||
x <- "This is a sentence. This is another sentence."
|
||||
str_view_all(x, boundary("word"))
|
||||
|
||||
str_split(x, " ")[[1]]
|
||||
str_split(x, boundary("word"))[[1]]
|
||||
```
|
||||
|
||||
### Separate
|
||||
## Separate
|
||||
|
||||
`separate()` pulls apart one column into multiple columns, by splitting wherever a separator character appears.
|
||||
Take `table3`:
|
||||
|
@ -553,7 +443,7 @@ table3 %>%
|
|||
separate(rate, into = c("cases", "population"), sep = "/")
|
||||
```
|
||||
|
||||
#### Exercises
|
||||
### Exercises
|
||||
|
||||
1. Split up a string like `"apples, pears, and bananas"` into individual components.
|
||||
|
||||
|
@ -562,12 +452,6 @@ table3 %>%
|
|||
3. What does splitting with an empty string (`""`) do?
|
||||
Experiment, and then read the documentation.
|
||||
|
||||
### Find matches
|
||||
|
||||
`str_locate()` and `str_locate_all()` give you the starting and ending positions of each match.
|
||||
These are particularly useful when none of the other functions does exactly what you want.
|
||||
You can use `str_locate()` to find the matching pattern, `str_sub()` to extract and/or modify them.
|
||||
|
||||
## Other types of pattern
|
||||
|
||||
When you use a pattern that's a string, it's automatically wrapped into a call to `regex()`:
|
||||
|
@ -689,74 +573,3 @@ There are three other functions you can use instead of `regex()`:
|
|||
1. How would you find all strings containing `\` with `regex()` vs. with `fixed()`?
|
||||
|
||||
2. What are the five most common words in `sentences`?
|
||||
|
||||
## stringi
|
||||
|
||||
stringr is built on top of the **stringi** package.
|
||||
stringr is useful when you're learning because it exposes a minimal set of functions, which have been carefully picked to handle the most common string manipulation functions.
|
||||
stringi, on the other hand, is designed to be comprehensive.
|
||||
It contains almost every function you might ever need: stringi has `r length(getNamespaceExports("stringi"))` functions to stringr's `r length(getNamespaceExports("stringr"))`.
|
||||
|
||||
If you find yourself struggling to do something in stringr, it's worth taking a look at stringi.
|
||||
The packages work very similarly, so you should be able to translate your stringr knowledge in a natural way.
|
||||
The main difference is the prefix: `str_` vs. `stri_`.
|
||||
|
||||
### Exercises
|
||||
|
||||
1. Find the stringi functions that:
|
||||
|
||||
a. Count the number of words.
|
||||
b. Find duplicated strings.
|
||||
c. Generate random text.
|
||||
|
||||
2. How do you control the language that `stri_sort()` uses for sorting?
|
||||
|
||||
### Exercises
|
||||
|
||||
1. What do the `extra` and `fill` arguments do in `separate()`?
|
||||
Experiment with the various options for the following two toy datasets.
|
||||
|
||||
```{r, eval = FALSE}
|
||||
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
|
||||
separate(x, c("one", "two", "three"))
|
||||
|
||||
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
|
||||
separate(x, c("one", "two", "three"))
|
||||
```
|
||||
|
||||
2. Both `unite()` and `separate()` have a `remove` argument.
|
||||
What does it do?
|
||||
Why would you set it to `FALSE`?
|
||||
|
||||
3. Compare and contrast `separate()` and `extract()`.
|
||||
Why are there three variations of separation (by position, by separator, and with groups), but only one unite?
|
||||
|
||||
4. In the following example we're using `unite()` to create a `date` column from `month` and `day` columns.
|
||||
How would you achieve the same outcome using `mutate()` and `paste()` instead of unite?
|
||||
|
||||
```{r, eval = FALSE}
|
||||
events <- tribble(
|
||||
~month, ~day,
|
||||
1 , 20,
|
||||
1 , 21,
|
||||
1 , 22
|
||||
)
|
||||
|
||||
events %>%
|
||||
unite("date", month:day, sep = "-", remove = FALSE)
|
||||
```
|
||||
|
||||
5. You can also pass a vector of integers to `sep`. `separate()` will interpret the integers as positions to split at.
|
||||
Positive values start at 1 on the far-left of the strings; negative value start at -1 on the far-right of the strings.
|
||||
Use `separate()` to represent location information in the following tibble in two columns: `state` (represented by the first two characters) and `county`.
|
||||
Do this in two ways: using a positive and a negative value for `sep`.
|
||||
|
||||
```{r}
|
||||
baker <- tribble(
|
||||
~location,
|
||||
"FLBaker County",
|
||||
"GABaker County",
|
||||
"ORBaker County",
|
||||
)
|
||||
baker
|
||||
```
|
||||
|
|
Loading…
Reference in New Issue