Merge branch 'master' of github.com:hadley/r4ds

This commit is contained in:
hadley 2016-02-12 06:13:28 -06:00
commit f3877c66d4
2 changed files with 59 additions and 59 deletions

View File

@ -17,9 +17,9 @@ diamonds <- ggplot2::diamonds
Code is a tool of communication, not just to the computer, but to other people. This is important because every project you undertake is fundamentally collaborative. Even if you're not working with other people, you'll definitely be working with future-you. You want to write clear code so that future-you doesn't curse present-you when you look at a project again after several months have passed.
To me, improving your communication skills is a key part of mastering R as a programming language. Over time, you want your code to becomes more and more clear, and easier to write. In this chapter, you'll learn three important skills that help you to move in this direction:
To me, improving your communication skills is a key part of mastering R as a programming language. Over time, you want your code to become more and more clear, and easier to write. In this chapter, you'll learn three important skills that help you move in this direction:
1. We'll dive deep in to the __pipe__, `%>%`, talking more about how it works
1. We'll dive deep into the __pipe__, `%>%`, talking more about how it works
and how it gives you a new tool for rewriting your code. You'll also learn
about when not to use the pipe!
@ -34,7 +34,7 @@ To me, improving your communication skills is a key part of mastering R as a pro
common patterns of for loops and put them in a function. We'll come back to
that idea in XYZ.
Removing duplication is an important part of expressing yourself clearly because it lets the reader (i.e. future-you!) focus on what's different between operations rather than what's the same. The goal is not just to write better funtions or to do things that you couldn't do before, but to code with more "ease". As you internalise the ideas in this chapter, you should find it easier to re-tackle problems that you've solved in the past with much effort.
Removing duplication is an important part of expressing yourself clearly because it lets the reader (i.e. future-you!) focus on what's different between operations rather than what's the same. The goal is not just to write better functions or to do things that you couldn't do before, but to code with more "ease". As you internalise the ideas in this chapter, you should find it easier to re-tackle problems that you've solved in the past with much effort.
Writing code is similar in many ways to writing prose. One parallel which I find particularly useful is that in both cases rewriting is key to clarity. The first expression of your ideas is unlikely to be particularly clear, and you may need to rewrite multiple times. After solving a data analysis challenge, it's often worth looking at your code and thinking about whether or not it's obvious what you've done. If you spend a little time rewriting your code while the ideas are fresh, you can save a lot of time later trying to recreate what your code did. But this doesn't mean you should rewrite every function: you need to balance what you need to achieve now with saving time in the long run. (But the more you rewrite your functions the more likely you'll first attempt will be clear.)
@ -51,7 +51,7 @@ To explore how you can write the same code in many different ways, let's use cod
> Scooping up the field mice
> And bopping them on the head
We'll start by defining an object to represent litte bunny Foo Foo:
We'll start by defining an object to represent little bunny Foo Foo:
```{r, eval = FALSE}
foo_foo <- little_bunny()
@ -95,7 +95,7 @@ object_size(diamonds, diamonds2)
* `diamonds2` takes up 3.89 MB,
* `diamonds` and `diamonds2` together take up 3.89 MB!
How can that work? Well, `diamonds2` has 10 columns in common with `diamonds`: there's no need to duplicate all that data so both data frames share the vectors. R will only create a copy of a vector if you modify it. Modifying a single value will mean that the data frames can no longer share as much memory. The individual sizes will be unchange, but the collective size will increase:
How can that work? Well, `diamonds2` has 10 columns in common with `diamonds`: there's no need to duplicate all that data so both data frames share the vectors. R will only create a copy of a vector if you modify it. Modifying a single value will mean that the data frames can no longer share as much memory. The individual sizes will be unchanged, but the collective size will increase:
```{r}
diamonds$carat[1] <- NA
@ -121,7 +121,7 @@ This is less typing (and less thinking), so you're less likely to make mistakes.
1. It will make debugging painful: if you make a mistake you'll need to start
again from scratch.
1. The reptition of the object being transformed (we've written `foo_foo` six
1. The repetition of the object being transformed (we've written `foo_foo` six
times!) obscures what's changing on each line.
#### Function composition
@ -205,7 +205,7 @@ library(magrittr)
cor(disp, mpg)
```
* For assignment. magrittr provides the `%<>%` operator which allows you to
* For assignment magrittr provides the `%<>%` operator which allows you to
replace code like:
```R
@ -219,7 +219,7 @@ library(magrittr)
```
I'm not a fan of this operator because I think assignment is such a
special operation that it should always be clear when it's occuring.
special operation that it should always be clear when it's occurring.
In my opinion, a little bit of duplication (i.e. repeating the
name of the object twice), is fine in return for making assignment
more explicit.
@ -237,19 +237,19 @@ The pipe is a powerful tool, but it's not the only tool at your disposal, and it
* You have multiple inputs or outputs. If there is not one primary object
being transformed, write code the regular ways.
* You are start to think about a directed graph with a complex
* You are starting to think about a directed graph with a complex
dependency structure. Pipes are fundamentally linear and expressing
complex relationships with them typically does not yield clear code.
### Pipes in production
When you run a pipe interactively, it's easy to see if something goes wrong. When you start writing pipes that are used in production, i.e. they're run automatically and a human doesn't immediately look at the output it's a really good idea to include some assertions that verify the data looks like expect. One great way to do this is the ensurer package, writen by Stefan Milton Bache (the author of magrittr).
When you run a pipe interactively, it's easy to see if something goes wrong. When you start writing pipes that are used in production, i.e. they're run automatically and a human doesn't immediately look at the output it's a really good idea to include some assertions that verify the data looks like expected. One great way to do this is the ensurer package, written by Stefan Milton Bache (the author of magrittr).
<http://www.r-statistics.com/2014/11/the-ensurer-package-validation-inside-pipes/>
## Functions
One of the best ways to grow in your capabilities as a user of R for data science is to write functions. Functions allow you to automate common tasks, instead of using copy-and-paste. Writing good functions is a lifetime journey: you won't learn everything but you'll hopefully get start walking in the right direction.
One of the best ways to grow in your capabilities as a user of R for data science is to write functions. Functions allow you to automate common tasks, instead of using copy-and-paste. Writing good functions is a lifetime journey: you won't learn everything but you'll hopefully get to start walking in the right direction.
Whenever you've copied and pasted code more than twice, you need to take a look at it and see if you can extract out the common components and make a function. For example, take a look at this code. What does it do?
@ -344,7 +344,7 @@ foo <- function(x = 1, y = TRUE, z = 10:1) {
}
```
Default values can depend on other arguments but don't over use this technique as it's possible to create code that is very difficult to understand:
Default values can depend on other arguments but don't overuse this technique as it's possible to create code that is very difficult to understand:
```{r}
bar <- function(x = y + 1, y = x + 1) {
@ -352,7 +352,7 @@ bar <- function(x = y + 1, y = x + 1) {
}
```
On other aspect of arguments you'll commonly see is `...`. This captures any other arguments not otherwise matched. It's useful because you can then send those `...` on to another argument. This is a useful catch all if your function primarily wraps another function. For example, you might have written your own wrapper designed to add linear model lines to a ggplot:
On other aspect of arguments you'll commonly see is `...`. This captures any other arguments not otherwise matched. It's useful because you can then send those `...` on to another argument. This is a useful catch-all if your function primarily wraps another function. For example, you might have written your own wrapper designed to add linear model lines to a ggplot:
```{r}
geom_lm <- function(formula = y ~ x, colour = alpha("steelblue", 0.5),
@ -362,7 +362,7 @@ geom_lm <- function(formula = y ~ x, colour = alpha("steelblue", 0.5),
}
```
This allows you to use any other arguments of `geom_smooth()`, even thoses that aren't explicitly listed in your wrapper (and even arguments that don't exist yet in the version of ggplot2 that you're using).
This allows you to use any other arguments of `geom_smooth()`, even those that aren't explicitly listed in your wrapper (and even arguments that don't exist yet in the version of ggplot2 that you're using).
Note that arguments in R are lazily evaluated: they're not computed until they're needed. That means if they're never used, they're never called:
@ -493,9 +493,9 @@ f(10)
You should avoid functions that work like this because it makes it harder to predict what your function will return.
This behaviour seems like a recipe for bugs, but by and large it doesn't cause too many, especially as you become a more experienced R programmer. The advantage of this behaviour is from a language stand point it allows R to be very consistent. Every name is looked up using the same set of rules. For `f()` that includes the behaviour of two things that you might not expect: `{` and `+`.
This behaviour seems like a recipe for bugs, but by and large it doesn't cause too many, especially as you become a more experienced R programmer. The advantage of this behaviour is that from a language standpoint it allows R to be very consistent. Every name is looked up using the same set of rules. For `f()` that includes the behaviour of two things that you might not expect: `{` and `+`.
This consistent set of rules allows for a number of powerful tool that are unfortunately beyond the scope of this book, but you can read about in "Advanced R".
This consistent set of rules allows for a number of powerful tools that are unfortunately beyond the scope of this book, but you can read about in "Advanced R".
#### Exercises
@ -577,9 +577,9 @@ mean_by <- function(data, group_var, mean_var, n = 10) {
}
```
Because this tells dplyr to group by `group_var` and compute the mean of `mean_var` neither of which exist in the data frame. A similar problem exists in ggplot2.
This fails because it tells dplyr to group by `group_var` and compute the mean of `mean_var` neither of which exist in the data frame. A similar problem exists in ggplot2.
I've only really recently understood this problem well, so the solutions are currently rather complicated and beyond the scope of this book. You can learn them online techniques with online resources:
I've only really recently understood this problem well, so the solutions are currently rather complicated and beyond the scope of this book. You can learn about these techniques online:
* Programming with ggplot2 (an excerpt from the ggplot2 book):
http://rpubs.com/hadley/97970
@ -649,7 +649,7 @@ df$d <- rescale01(df$d)
In this case the output is already present: we're modifying an existing object.
Need to think about a data frame as a list of column (we'll make this definition precise later on). The length of a data frame is the number of columns. To extract a single column, you use `[[`.
Think about a data frame as a list of columns (we'll make this definition precise later on). The length of a data frame is the number of columns. To extract a single column, you use `[[`.
That makes our for loop quite simple:
@ -678,7 +678,7 @@ There are three basic ways to loop over a vector:
but it's difficult to save the output efficiently.
1. Loop over the numeric indices: `for (i in seq_along(xs))`. Most common
form if you want to know the element (`xs[[i]]`) and it's position.
form if you want to know the element (`xs[[i]]`) and its position.
1. Loop over the names: `for (nm in names(xs))`. Gives you both the name
and the position. This is useful if you want to use the name in a

View File

@ -37,7 +37,7 @@ single_quote <- '\'' # or "'"
That means if you want to include a literal `\`, you'll need to double it up: `"\\"`.
Beware that the printed representation of the string is not the same as string itself, because the printed representation shows the escapes. To see the raw contents of the string, use writeLines()`:
Beware that the printed representation of the string is not the same as string itself, because the printed representation shows the escapes. To see the raw contents of the string, use `writeLines()`:
```{r}
x <- c("\"", "\\")
@ -45,7 +45,7 @@ x
writeLines(x)
```
There are a handful of other special characters. The most common used are `"\n"`, new line, and `"\t"`, tab, but you can see the complete list by requesting help on `"`: `?'"'`, or `?"'"`. You'll also sometimes strings like `"\u00b5"`, this is a way of writing non-English characters that works on all platforms:
There are a handful of other special characters. The most common used are `"\n"`, newline, and `"\t"`, tab, but you can see the complete list by requesting help on `"`: `?'"'`, or `?"'"`. You'll also sometimes see strings like `"\u00b5"`, this is a way of writing non-English characters that works on all platforms:
```{r}
x <- "\u00b5"
@ -54,7 +54,7 @@ x
### String length
Base R contains many functions to work with strings but we'll generally avoid them because they're inconsistent and hard to remember. Their behaviour is particularly inconsistent when it comes to missing values. For examle, `nchar()`, which gives the length of a string, returns 2 for `NA` (instead of `NA`)
Base R contains many functions to work with strings but we'll generally avoid them because they're inconsistent and hard to remember. Their behaviour is particularly inconsistent when it comes to missing values. For example, `nchar()`, which gives the length of a string, returns 2 for `NA` (instead of `NA`)
```{r}
# Bug will be fixed in R 3.3.0
@ -147,7 +147,7 @@ x
### Locales
Above I used`str_to_lower()` to change to lower case. You can also use `str_to_upper()` or `str_to_title()`. However, changing case is more complicated than it might at first seem because different languages have different rules for changing case. You can pick which set of rules to use by specifying a locale:
Above I used `str_to_lower()` to change to lower case. You can also use `str_to_upper()` or `str_to_title()`. However, changing case is more complicated than it might at first seem because different languages have different rules for changing case. You can pick which set of rules to use by specifying a locale:
```{r}
# Turkish has two i's: with and without a dot, and it
@ -158,7 +158,7 @@ str_to_upper(c("i", "ı"), locale = "tr")
The locale is specified as ISO 639 language codes, which are two or three letter abbreviations. If you don't already know the code for your language, [Wikipedia](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) has a good list. If you leave the locale blank, it will use the current locale.
Another important operation that's affected by the locale is sorting. The base R `order()` and `sort()` functions sort strings using the currect locale. If you want robust behaviour across different computers, you may want to use `str_sort()` and `str_order()` which take an additional `locale` argument:
Another important operation that's affected by the locale is sorting. The base R `order()` and `sort()` functions sort strings using the current locale. If you want robust behaviour across different computers, you may want to use `str_sort()` and `str_order()` which take an additional `locale` argument:
```{r}
x <- c("apple", "eggplant", "banana")
@ -191,9 +191,9 @@ str_sort(x, locale = "haw") # Hawaiian
Regular expressions, regexps for short, are a very terse language that allow to describe patterns in strings. They take a little while to get your head around, but once you've got it you'll find them extremely useful.
To learn regular expressions, we'll use `str_show()` and `str_show_all()`. These functions take a character vector and a regular expression, and shows you how they match. We'll start with very simple regular expressions and then gradually get more and more complicated. Once you've mastered pattern matching, you'll learn how to apply those ideas with various stringr functions.
To learn regular expressions, we'll use `str_show()` and `str_show_all()`. These functions take a character vector and a regular expression, and show you how they match. We'll start with very simple regular expressions and then gradually get more and more complicated. Once you've mastered pattern matching, you'll learn how to apply those ideas with various stringr functions.
### Basics matches
### Basic matches
The simplest patterns match exact strings:
@ -202,7 +202,7 @@ x <- c("apple", "banana", "pear")
str_view(x, "an")
```
The next step up in complexity is `.`, which matches any character (except a new line):
The next step up in complexity is `.`, which matches any character (except a newline):
```{r, cache = FALSE}
str_view(x, ".a.")
@ -254,7 +254,7 @@ str_view(x, "^a")
str_view(x, "a$")
```
To remember which is which, try this mneomic which I learned from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): if you begin with power (`^`), you end up with money (`$`).
To remember which is which, try this mnemonic which I learned from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): if you begin with power (`^`), you end up with money (`$`).
To force a regular expression to only match a complete string, anchor it with both `^` and `$`.:
@ -289,7 +289,7 @@ You can also match the boundary between words with `\b`. I don't find I often us
There are number of other special patterns that match more than one character:
* `.`: any character apart from a new line.
* `.`: any character apart from a newline.
* `\d`: any digit.
* `\s`: any whitespace (space, tab, newline).
* `[abc]`: match a, b, or c.
@ -303,7 +303,7 @@ You can use _alternation_ to pick between one or more alternative patterns. For
str_view(c("abc", "xyz"), "abc|xyz")
```
Like with mathematical expression, if precedence ever gets confusing, use parentheses to make it clear what you want:
Like with mathematical expressions, if precedence ever gets confusing, use parentheses to make it clear what you want:
```{r, cache = FALSE}
str_view(c("grey", "gray"), "gr(e|a)y")
@ -315,7 +315,7 @@ str_view(c("grey", "gray"), "gr(e|a)y")
1. Start with a vowel.
1. That only contain constants. (Hint: thinking about matching
1. That only contain consonants. (Hint: thinking about matching
"not"-vowels.)
1. End with `ed`, but not with `eed`.
@ -348,12 +348,12 @@ By default these matches are "greedy": they will match the longest string possib
```{r}
```
Note that the precedence of these operators are high, so you can write: `colou?r` to match either American or British spellings. That means most uses will need parentheses, like `bana(na)+` or `ba(na){2,}`.
Note that the precedence of these operators is high, so you can write: `colou?r` to match either American or British spellings. That means most uses will need parentheses, like `bana(na)+` or `ba(na){2,}`.
#### Exercises
1. Describe in words what these regular expressions match:
(read carefully to see I'm using a regular expression or a string
(read carefully to see if I'm using a regular expression or a string
that defines a regular expression.)
1. `^.*$`
@ -364,12 +364,12 @@ Note that the precedence of these operators are high, so you can write: `colou?r
1. Create regular expressions to find all words that:
1. Have three or more vowels in a row.
1. Start with three consonants
1. Have two or more vowel-consontant pairs in a row.
1. Start with three consonants.
1. Have two or more vowel-consonant pairs in a row.
### Grouping and backreferences
You learned about parentheses earlier as a way to disambiguate complex expression. They do one other special thing: they also define numeric groups that you can refer to with _backreferences_, `\1`, `\2` etc.For example, the following regular expression finds all fruits that have a pair letters that's repeated.
You learned about parentheses earlier as a way to disambiguate complex expression. They do one other special thing: they also define numeric groups that you can refer to with _backreferences_, `\1`, `\2` etc.For example, the following regular expression finds all fruits that have a pair of letters that's repeated.
```{r, cache = FALSE}
str_view(fruit, "(..)\\1", match = TRUE)
@ -400,15 +400,15 @@ str_detect(c("grey", "gray"), "gr(?:e|a)y")
## Tools
Now that you've learned the basics of regular expression, it's time to learn how to apply to real problems. In this section you'll learn a wide array of stringr functions that let you:
Now that you've learned the basics of regular expressions, it's time to learn how to apply them to real problems. In this section you'll learn a wide array of stringr functions that let you:
* Determine which elements match a pattern.
* Find the positions of matches.
* Extract the content of matches.
* Replace matches with new values.
* How can you split a string into based on a match.
* How can you split a string based on a match.
Because regular expressions are so powerful, it's easy to try and solve every problem with a single regular expression. But since you're in a programming language, it's often easy to break the problem down into smaller pieces. If you find yourself getting stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down in to smaller pieces, solving each challenge before moving onto the next one.
Because regular expressions are so powerful, it's easy to try and solve every problem with a single regular expression. But since you're in a programming language, it's often easy to break the problem down into smaller pieces. If you find yourself getting stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down into smaller pieces, solving each challenge before moving onto the next one.
### Detect matches
@ -419,7 +419,7 @@ x <- c("apple", "banana", "pear")
str_detect(x, "e")
```
Remember that when you use a logical vector in a numeric context, `FALSE` becomes 0 and `TRUE` becomes 1. That makes `sum()` and `mean()` useful if you want answer questions about matches across a larger vector:
Remember that when you use a logical vector in a numeric context, `FALSE` becomes 0 and `TRUE` becomes 1. That makes `sum()` and `mean()` useful if you want to answer questions about matches across a larger vector:
```{r}
# How many common words start with t?
@ -438,7 +438,7 @@ no_vowels_2 <- str_detect(common, "^[^aeiou]+$")
all.equal(no_vowels_1, no_vowels_2)
```
The results are identical, but I think the first approach is significantly easier to understand. So if you find your regular expression is getting overly complicated, try breaking it up into smaller pieces, giving each piece a name, and then combining with logical operations.
The results are identical, but I think the first approach is significantly easier to understand. So if you find your regular expression is getting overly complicated, try breaking it up into smaller pieces, giving each piece a name, and then combining them with logical operations.
A common use of `str_detect()` is to select the elements that match a pattern. You can do this with logical subsetting, or the convenient `str_subset()` wrapper:
@ -468,7 +468,7 @@ Note the use of `str_view_all()`. As you'll shortly learn, many stringr function
### Exercises
1. For each of the following challenges, try solving it both a single
1. For each of the following challenges, try solving it by using both a single
regular expression, and a combination of multiple `str_detect()` calls.
1. Find all words that start or end with `x`.
@ -483,7 +483,7 @@ Note the use of `str_view_all()`. As you'll shortly learn, many stringr function
### Extract matches
To extract the actual text of a match, use `str_extract()`. To show that off, we're going to need a more complicated example. I'm going to use the [Harvard sentences](https://en.wikipedia.org/wiki/Harvard_sentences), which were designed to tested VOIP systems, but are also useful for practicing regexs.
To extract the actual text of a match, use `str_extract()`. To show that off, we're going to need a more complicated example. I'm going to use the [Harvard sentences](https://en.wikipedia.org/wiki/Harvard_sentences), which were designed to test VOIP systems, but are also useful for practicing regexes.
```{r}
length(sentences)
@ -543,7 +543,7 @@ str_extract_all(x, "[a-z]", simplify = TRUE)
### Grouped matches
Earlier in this chapter we talked about the use of parentheses for clarifying precedence and to use with backreferences when matching. You can also parentheses to extract parts of a complex match. For example, imagine we want to extract nouns from the sentences. As a heuristic, we'll look for any word that comes after "a" or "the". Defining a "word" in a regular expression is a little tricky. Here I use a sequence of at least one character that isn't a space.
Earlier in this chapter we talked about the use of parentheses for clarifying precedence and for backreferences when matching. You can also use parentheses to extract parts of a complex match. For example, imagine we want to extract nouns from the sentences. As a heuristic, we'll look for any word that comes after "a" or "the". Defining a "word" in a regular expression is a little tricky. Here I use a sequence of at least one character that isn't a space.
```{r}
noun <- "(a|the) ([^ ]+)"
@ -607,7 +607,7 @@ sentences %>%
#### Exercises
1. Replace all `/` in a string with `\`.
1. Replace all `/`'s in a string with `\`'s.
### Splitting
@ -619,7 +619,7 @@ sentences %>%
str_split(" ")
```
Because each component might contain a different number of pieces, this returns a list. If you're working with a length-1 vector, the easiest thing is to just extra the first element of the list:
Because each component might contain a different number of pieces, this returns a list. If you're working with a length-1 vector, the easiest thing is to just extract the first element of the list:
```{r}
"a|b|c|d" %>%
@ -635,7 +635,7 @@ sentences %>%
str_split(" ", simplify = TRUE)
```
You can also request a maximum number of pieces;
You can also request a maximum number of pieces:
```{r}
fields <- c("Name: Hadley", "County: NZ", "Age: 35")
@ -657,7 +657,7 @@ str_split(x, boundary("word"))[[1]]
1. Split up a string like `"apples, pears, and bananas"` into individual
components.
1. Why is it's better to split up by `boundary("word")` than `" "`?
1. Why is it better to split up by `boundary("word")` than `" "`?
1. What does splitting with an empty string (`""`) do?
@ -697,7 +697,7 @@ You can use the other arguments of `regex()` to control details of the match:
```
* `comments = TRUE` allows you to use comments and white space to make
complex regular expressions more understand. Space are ignored, as is
complex regular expressions more understandable. Spaces are ignored, as is
everything after `#`. To match a literal space, you'll need to escape it:
`"\\ "`.
@ -707,7 +707,7 @@ There are three other functions you can use instead of `regex()`:
* `fixed()`: matches exactly the specified sequence of bytes. It ignores
all special regular expressions and operates at a very low level.
This allows you to avoid complex escaping can be much faster than
This allows you to avoid complex escaping and can be much faster than
regular expressions:
```{r}
@ -732,7 +732,7 @@ There are three other functions you can use instead of `regex()`:
```
They render identically, but because they're defined differently,
`fixed()` does find a match. Instead, you can use `coll()`, defined
`fixed()` doesn't find a match. Instead, you can use `coll()`, defined
next to respect human character comparison rules:
```{r}
@ -764,12 +764,12 @@ There are three other functions you can use instead of `regex()`:
stringi::stri_locale_info()
```
The downside of `coll()` is because the rules for recognising which
The downside of `coll()` is speed; because the rules for recognising which
characters are the same are complicated, `coll()` is relatively slow
compared to `regex()` and `fixed()`.
* As you saw with `str_split()` you can use `boundary()` to match boundaries.
You can also use it with the other functions, all though
You can also use it with the other functions:
```{r, cache = FALSE}
x <- "This is a sentence."
@ -788,7 +788,7 @@ There are three other functions you can use instead of `regex()`:
There are a few other functions in base R that accept regular expressions:
* `apropos()` searchs all objects avaiable from the global environment. This
* `apropos()` searches all objects available from the global environment. This
is useful if you can't quite remember the name of the function.
```{r}
@ -796,7 +796,7 @@ There are a few other functions in base R that accept regular expressions:
```
* `dir()` lists all the files in a directory. The `pattern` argument takes
a regular expression and only return file names that match the pattern.
a regular expression and only returns file names that match the pattern.
For example, you can find all the rmarkdown files in the current
directory with:
@ -818,9 +818,9 @@ There are a few other functions in base R that accept regular expressions:
### The stringi package
stringr is built on top of the __stringi__ package. stringr is useful when you're learning because it exposes a minimal set of functions, that have been carefully picked to handle the most common string manipulation functions. stringi on the other hand is designed to be comprehensive. It contains almost every function you might ever need. stringi has `length(ls("package:stringi"))` functions to stringr's `length(ls("package:stringr"))`.
stringr is built on top of the __stringi__ package. stringr is useful when you're learning because it exposes a minimal set of functions, that have been carefully picked to handle the most common string manipulation functions. stringi on the other hand is designed to be comprehensive. It contains almost every function you might ever need. stringi has `r length(ls("package:stringi"))` functions to stringr's `r length(ls("package:stringr"))`.
So if you find yourself struggling to do something that doesn't seem natural in stringr, it's worth taking a look at stringi. The use of the two packages are very similar because stringr was designed to mimic stringi's interface. The main difference is the prefix: `str_` vs `stri_`.
So if you find yourself struggling to do something that doesn't seem natural in stringr, it's worth taking a look at stringi. The use of the two packages is very similar because stringr was designed to mimic stringi's interface. The main difference is the prefix: `str_` vs `stri_`.
### Encoding
@ -832,7 +832,7 @@ Complicated and fraught with difficulty. Best approach is to convert to UTF-8 as
Generally, you should fix encoding problems during the data import phase.
Detect encoding operates statistically, by comparing frequency of byte fragments across languages and encodings. Fundamentally heuristic and works better with larger amounts of text (i.e. a whole file, not a single string from that file).
Detect encoding operates statistically, by comparing frequency of byte fragments across languages and encodings. It's fundamentally heuristic and works better with larger amounts of text (i.e. a whole file, not a single string from that file).
```{r}
x <- "\xc9migr\xe9 cause c\xe9l\xe8bre d\xe9j\xe0 vu."