TR review feedback for logicals-factors (#1310)

This commit is contained in:
Hadley Wickham 2023-02-27 16:42:29 -06:00 committed by GitHub
parent b03248a66f
commit c0f0375d44
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
5 changed files with 104 additions and 154 deletions

View File

@ -12,10 +12,11 @@ status("complete")
Factors are used for categorical variables, variables that have a fixed and known set of possible values.
They are also useful when you want to display character vectors in a non-alphabetical order.
We'll start by motivating why factors are needed for data analysis and how you can create them with `factor()`.
We'll then introduce you to the `gss_cat` dataset which contains a bunch of categorical variables to experiment with.
We'll start by motivating why factors are needed for data analysis[^factors-1] and how you can create them with `factor()`. We'll then introduce you to the `gss_cat` dataset which contains a bunch of categorical variables to experiment with.
You'll then use that dataset to practice modifying the order and values of factors, before we finish up with a discussion of ordered factors.
[^factors-1]: They're also really important for modelling.
### Prerequisites
Base R provides some basic tools for creating and manipulating factors.
@ -77,7 +78,7 @@ y2 <- factor(x2, levels = month_levels)
y2
```
This seems risky, so you might want to use `fct()` instead:
This seems risky, so you might want to use `forcats::fct()` instead:
```{r}
#| error: true
@ -90,21 +91,17 @@ If you omit the levels, they'll be taken from the data in alphabetical order:
factor(x1)
```
Sometimes you'd prefer that the order of the levels matches the order of the first appearance in the data.
You can do that when creating the factor by setting levels to `unique(x)`, or after the fact, with `fct_inorder()`:
Sorting alphabetically is slightly risky because not every computer will sort strings in the same way.
So `forcats::fct()` orders by first appearance:
```{r}
f1 <- factor(x1, levels = unique(x1))
f1
f2 <- x1 |> factor() |> fct_inorder()
f2
fct(x1)
```
If you ever need to access the set of valid levels directly, you can do so with `levels()`:
```{r}
levels(f2)
levels(y2)
```
You can also create a factor when reading your data with readr with `col_factor()`:
@ -169,7 +166,6 @@ For example, imagine you want to explore the average number of hours spent watch
relig_summary <- gss_cat |>
group_by(relig) |>
summarize(
age = mean(age, na.rm = TRUE),
tvhours = mean(tvhours, na.rm = TRUE),
n = n()
)
@ -223,7 +219,6 @@ rincome_summary <- gss_cat |>
group_by(rincome) |>
summarize(
age = mean(age, na.rm = TRUE),
tvhours = mean(tvhours, na.rm = TRUE),
n = n()
)
@ -274,7 +269,7 @@ This makes the plot easier to read because the colors of the line at the far rig
#| shape, and widowed starts off low but increases steeply after age
#| 60.
by_age <- gss_cat |>
filter(!is.na(age)) |>
filter(!is.na(age)) |>
count(age, marital) |>
group_by(age) |>
mutate(
@ -282,11 +277,13 @@ by_age <- gss_cat |>
)
ggplot(by_age, aes(x = age, y = prop, color = marital)) +
geom_line(na.rm = TRUE)
geom_line(linewidth = 1) +
scale_color_brewer(palette = "Set1")
ggplot(by_age, aes(x = age, y = prop, color = fct_reorder2(marital, age, prop))) +
geom_line() +
labs(color = "marital")
geom_line(linewidth = 1) +
scale_color_brewer(palette = "Set1") +
labs(color = "marital")
```
Finally, for bar plots, you can use `fct_infreq()` to order levels in decreasing frequency: this is the simplest type of reordering because it doesn't need any extra variables.

View File

@ -137,14 +137,14 @@ NA == NA
It's easiest to understand why this is true if we artificially supply a little more context:
```{r}
# Let x be Mary's age. We don't know how old she is.
x <- NA
# We don't know how old Mary is
age_mary <- NA
# Let y be John's age. We don't know how old he is.
y <- NA
# We don't know how old John is
age_john <- NA
# Are John and Mary the same age?
x == y
age_john == age_john
# We don't know!
```
@ -191,13 +191,14 @@ We'll come back to cover missing values in more depth in @sec-missing-values.
### Exercises
1. How does `dplyr::near()` work? Type `near` to see the source code.
1. How does `dplyr::near()` work? Type `near` to see the source code. Is `sqrt(2)^2` near 2?
2. Use `mutate()`, `is.na()`, and `count()` together to describe how the missing values in `dep_time`, `sched_dep_time` and `dep_delay` are connected.
## Boolean algebra
Once you have multiple logical vectors, you can combine them together using Boolean algebra.
In R, `&` is "and", `|` is "or", `!` is "not", and `xor()` is exclusive or[^logicals-2].
For example, `df |> filter(!is.na(x))` finds all rows where `x` is not missing and `df |> filter(x < -10 | x > 0)` finds all rows where `x` is smaller than -10 or bigger than 0.
@fig-bool-ops shows the complete set of Boolean operations and how they work.
[^logicals-2]: That is, `xor(x, y)` is true if x is true, or y is true, but not both.
@ -331,14 +332,15 @@ There are two main logical summaries: `any()` and `all()`.
`all(x)` is equivalent of `&`; it'll return `TRUE` only if all values of `x` are `TRUE`'s.
Like all summary functions, they'll return `NA` if there are any missing values present, and as usual you can make the missing values go away with `na.rm = TRUE`.
For example, we could use `all()` to find out if there were days where every flight was delayed:
For example, we could use `all()` and `any()` to find out if every flight was delayed by less than an hour or if any flights was delayed by over 5 hours.
And using `group_by()` allows us to do that by day:
```{r}
flights |>
group_by(year, month, day) |>
summarize(
all_delayed = all(arr_delay >= 0, na.rm = TRUE),
any_delayed = any(arr_delay >= 0, na.rm = TRUE),
all_delayed = all(dep_delay <= 60, na.rm = TRUE),
any_long_delay = any(arr_delay >= 300, na.rm = TRUE),
.groups = "drop"
)
```
@ -349,36 +351,18 @@ That leads us to the numeric summaries.
### Numeric summaries of logical vectors {#sec-numeric-summaries-of-logicals}
When you use a logical vector in a numeric context, `TRUE` becomes 1 and `FALSE` becomes 0.
This makes `sum()` and `mean()` very useful with logical vectors because `sum(x)` will give the number of `TRUE`s and `mean(x)` the proportion of `TRUE`s.
That lets us see the distribution of delays across the days of the year as shown in @fig-prop-delayed-dist
This makes `sum()` and `mean()` very useful with logical vectors because `sum(x)` gives the number of `TRUE`s and `mean(x)` gives the proportion of `TRUE`s (because `mean()` is just `sum()` divided by `length()`.
```{r}
#| label: fig-prop-delayed-dist
#| fig-cap: >
#| A histogram showing the proportion of delayed flights each day.
#| fig-alt: >
#| The distribution is unimodal and mildly right skewed. The distribution
#| peaks around 30% delayed flights.
flights |>
group_by(year, month, day) |>
summarize(
prop_delayed = mean(arr_delay > 0, na.rm = TRUE),
.groups = "drop"
) |>
ggplot(aes(x = prop_delayed)) +
geom_histogram(binwidth = 0.05)
```
Or we could ask: "How many flights left before 5am?", which are often flights that were delayed from the previous day:
That, for example, allows us to see the proportion of flights that were delayed by less than 60 minutes and the number of flights that were delayed by over 5 hours:
```{r}
flights |>
group_by(year, month, day) |>
summarize(
n_early = sum(dep_time < 500, na.rm = TRUE),
all_delayed = mean(dep_delay <= 60, na.rm = TRUE),
any_long_delay = sum(arr_delay >= 300, na.rm = TRUE),
.groups = "drop"
) |>
arrange(desc(n_early))
)
```
### Logical subsetting
@ -574,6 +558,18 @@ Here are the most important cases that are compatible:
We don't expect you to memorize these rules, but they should become second nature over time because they are applied consistently throughout the tidyverse.
### Exercises
1. A number is even if its divisible by two, which in R you can find out with `x %% 2 == 0`.
Use this fact and `if_else()` to determine whether each number between 0 and 20 is even or odd.
2. Given a vector of days like `x <- c("Monday", "Saturday", "Wednesday")`, use an `ifelse()` statement to label them as weekends or weekdays.
3. Use `ifelse()` to compute the absolute value of a numeric vector called `x`.
4. Write a `case_when()` statement that uses the `month` and `day` columns from `flights` to label a selection of important US holidays (e.g. New Years Day, 4th of July, Thanksgiving, and Christmas).
First create a logical column that is either `TRUE` or `FALSE`, and then create a character column that either gives the name of the holiday or is `NA`.
## Summary
The definition of a logical vector is simple because each value must be either `TRUE`, `FALSE`, or `NA`.

View File

@ -91,7 +91,7 @@ This means that it only works inside dplyr verbs:
n()
```
There are a couple of variants of `n()` that you might find useful:
There are a couple of variants of `n()` and `count()` that you might find useful:
- `n_distinct(x)` counts the number of distinct (unique) values of one or more variables.
For example, we could figure out which destinations are served by the most carriers:
@ -216,7 +216,7 @@ df |>
### Modular arithmetic
Modular arithmetic is the technical name for the type of math you did before you learned about real numbers, i.e. division that yields a whole number and a remainder.
Modular arithmetic is the technical name for the type of math you did before you learned about decimal places, i.e. division that yields a whole number and a remainder.
In R, `%/%` does integer division and `%%` computes the remainder:
```{r}
@ -326,7 +326,7 @@ round(x / 0.25) * 0.25
### Cutting numbers into ranges
Use `cut()`[^numbers-1] to break up a numeric vector into discrete buckets:
Use `cut()`[^numbers-1] to break up (aka bin) a numeric vector into discrete buckets:
[^numbers-1]: ggplot2 provides some helpers for common cases in `cut_interval()`, `cut_number()`, and `cut_width()`.
ggplot2 is an admittedly weird place for these functions to live, but they are useful as part of histogram computation and were written before any other parts of the tidyverse existed.
@ -395,6 +395,8 @@ If you need more complex rolling or sliding aggregates, try the [slider](https:/
Convert them to a more truthful representation of time (either fractional hours or minutes since midnight).
4. Round `dep_time` and `arr_time` to the nearest five minutes.
## General transformations
The following sections describe some general transformations which are often used with numeric vectors, but can be applied to all other column types.
@ -436,13 +438,13 @@ In this case, it'll give the number of the "current" row.
When combined with `%%` or `%/%` this can be a useful tool for dividing data into similarly sized groups:
```{r}
df <- tibble(x = runif(10))
df <- tibble(id = 1:10)
df |>
mutate(
row0 = row_number() - 1,
three_groups = row0 %% 3,
three_in_each_group = row0 %/% 3,
three_in_each_group = row0 %/% 3
)
```
@ -474,8 +476,7 @@ You can lead or lag by more than one position by using the second argument, `n`.
### Consecutive identifiers
Sometimes you want to start a new group every time some event occurs.
For example, when you're looking at website data, it's common to want to break up events into sessions, where a session is defined as a gap of more than x minutes since the last activity.
For example, when you're looking at website data, it's common to want to break up events into sessions, where you begin a new session after gap of more than `x` minutes since the last activity.
For example, imagine you have the times when someone visited a website:
```{r}
@ -485,23 +486,23 @@ events <- tibble(
```
And you've the time lag between the events, and figured out if there's a gap that's big enough to qualify:
And you've computed the time between each event, and figured out if there's a gap that's big enough to qualify:
```{r}
events <- events |>
mutate(
diff = time - lag(time, default = first(time)),
gap = diff >= 5
has_gap = diff >= 5
)
events
```
But how do we go from that logical vector to something that we can `group_by()`?
`cumsum()` from @sec-cumulative-and-rolling-aggregates comes to the rescue as each occurring gap, i.e. `gap` is `TRUE`, increments `group` by one (see @sec-numeric-summaries-of-logicals on the numerical interpretation of logicals):
`cumsum()`, from @sec-cumulative-and-rolling-aggregates, comes to the rescue as gap, i.e. `has_gap` is `TRUE`, will increment `group` by one (@sec-numeric-summaries-of-logicals):
```{r}
events |> mutate(
group = cumsum(gap)
group = cumsum(has_gap)
)
```
@ -513,11 +514,9 @@ df <- tibble(
x = c("a", "a", "a", "b", "c", "c", "d", "e", "a", "a", "b", "b"),
y = c(1, 2, 3, 2, 4, 1, 3, 9, 4, 8, 10, 199)
)
df
```
You want to keep the first row from each repeated `x`.
That's easier to express with a combination of `consecutive_id()` and `slice_head()`:
If you want to keep the first row from each repeated `x`, you could use `group_by()`, `consecutive_id()`, and `slice_head()`:
```{r}
df |>
@ -720,9 +719,7 @@ Finally, don't forget what you learned in @sec-sample-size: whenever creating nu
### Positions
There's one final type of summary that's useful for numeric vectors, but also works with every other type of value: extracting a value at a specific position.
You can do this with the base R `[` function, but we're not going to cover it in detail until @sec-subset-many, because it's a very powerful and general function.
For now we'll introduce three specialized functions that you can use to extract values at a specified position: `first(x)`, `last(x)`, and `nth(x, n)`.
There's one final type of summary that's useful for numeric vectors, but also works with every other type of value: extracting a value at a specific position: `first(x)`, `last(x)`, and `nth(x, n)`.
For example, we can find the first and last departure for each day:
@ -730,18 +727,16 @@ For example, we can find the first and last departure for each day:
flights |>
group_by(year, month, day) |>
summarize(
first_dep = first(dep_time),
fifth_dep = nth(dep_time, 5),
last_dep = last(dep_time)
first_dep = first(dep_time, na_rm = TRUE),
fifth_dep = nth(dep_time, 5, na_rm = TRUE),
last_dep = last(dep_time, na_rm = TRUE)
)
```
(These functions currently lack an `na.rm` argument but will hopefully be fixed by the time you read this book: <https://github.com/tidyverse/dplyr/issues/6242>).
(NB: Because dplyr functions use `_` to separate components of function and arguments names, these functions use `na_rm` instead of `na.rm`.)
If you're familiar with `[`, you might wonder if you ever need these functions.
There are two main reasons: the `default` argument and the `order_by` argument.
`default` allows you to set a default value that's used if the requested position doesn't exist, e.g. you're trying to get the 3rd element from a two element group.
`order_by` lets you locally override the existing ordering of the rows, so you can get the element at the position in the ordering by `order_by()`.
If you're familiar with `[`, which we'll come back to in @sec-subset-many, you might wonder if you ever need these functions.
There are three reasons: the `default` argument allows you to provide a default if the specified position doesn't exist, the `order_by` argument allows you to locally override the order of the rows, and the `na_rm` argument allows you to drop missing values.
Extracting values at positions is complementary to filtering on ranks.
Filtering gives you all variables, with each observation in a separate row:
@ -761,19 +756,17 @@ For example:
- `x / sum(x)` calculates the proportion of a total.
- `(x - mean(x)) / sd(x)` computes a Z-score (standardized to mean 0 and sd 1).
- `(x - min(x)) / (max(x) - min(x))` standardizes to range \[0, 1\].
- `x / first(x)` computes an index based on the first observation.
### Exercises
1. Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights.
Consider the following scenarios:
- A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.
- A flight is always 10 minutes late.
- A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.
- 99% of the time a flight is on time. 1% of the time it's 2 hours late.
Which do you think is more important: arrival delay or departure delay?
When is `mean()` useful?
When is `median()` useful?
When might you want to use something else?
Should you use arrival delay or departure delay?
Why might you want to use data from `planes`?
2. Which destinations show the greatest variation in air speed?

View File

@ -48,12 +48,10 @@ The simplest patterns consist of letters and numbers which match those character
```{r}
str_view(fruit, "berry")
str_view(fruit, "BERRY")
```
Letters and numbers match exactly and are called **literal characters**.
Punctuation characters like `.`, `+`, `*`, `[`, `]`, `?` have special meanings[^regexps-2] and are called **meta-characters**. For example, `.`
Most punctuation characters, like `.`, `+`, `*`, `[`, `],` and `?,` have special meanings[^regexps-2] and are called **meta-characters**. For example, `.`
will match any character[^regexps-3], so `"a."` will match any string that contains an "a" followed by another character
:
@ -90,27 +88,18 @@ str_view(c("a", "ab", "abb"), "ab*")
**Character classes** are defined by `[]` and let you match a set of characters, e.g. `[abcd]` matches "a", "b", "c", or "d".
You can also invert the match by starting with `^`: `[^abcd]` matches anything **except** "a", "b", "c", or "d".
We can use this idea to find the words with three vowels or four consonants in a row:
We can use this idea to find the words containing an "x" surrounded by vowels, or a "y" surrounded by consonants:
```{r}
str_view(words, "[aeiou][aeiou][aeiou]")
str_view(words, "[^aeiou][^aeiou][^aeiou][^aeiou]")
str_view(words, "[aeiou]x[aeiou]")
str_view(words, "[^aeiou]y[^aeiou]")
```
You can combine character classes and quantifiers.
For example, the following regexp looks for two vowels followed by two or more consonants:
You can use **alternation**, `|`, to pick between one or more alternative patterns.
For example, the following patterns look for fruits containing "apple", "melon", or "nut", or a repeated vowel.
```{r}
str_view(words, "[aeiou][aeiou][^aeiou][^aeiou]+")
```
(We'll learn more elegant ways to express these ideas in @sec-quantifiers.)
You can use **alternation**, `|` to pick between one or more alternative patterns.
For example, the following patterns look for fruits containing "apple", "pear", or "banana", or a repeated vowel.
```{r}
str_view(fruit, "apple|pear|banana")
str_view(fruit, "apple|melon|nut")
str_view(fruit, "aa|ee|ii|oo|uu")
```
@ -274,7 +263,9 @@ df |>
separate_wider_regex(
str,
patterns = c(
"<", name = "[A-Za-z]+", ">-",
"<",
name = "[A-Za-z]+",
">-",
gender = ".", "_",
age = "[0-9]+"
)
@ -289,7 +280,9 @@ If the match fails, you can use `too_short = "debug"` to figure out what went wr
What name has the highest proportion of vowels?
(Hint: what is the denominator?)
2. Replace all forward slashes in a string with backslashes.
2. Replace all forward slashes in `"a/b/c/d/e"` with backslashes.
What happens if you attempt to undo the transformation by replacing all backslashes with forward slashes?
(We'll discuss the problem very soon.)
3. Implement a simple version of `str_to_lower()` using `str_replace_all()`.
@ -404,11 +397,10 @@ str_replace_all("abc", c("$", "^", "\\b"), "--")
### Character classes
A **character class**, or character **set**, allows you to match any character in a set.
As we discussed above, you can construct your own sets with `[]`, where `[abc]` matches a, b, or c.
There are three characters that have special meaning inside of `[]:`
As we discussed above, you can construct your own sets with `[]`, where `[abc]` matches "a", "b", or "c" and `[^abc]` matches any character except "a", "b", or "c".
Apart from `^` there are two ther characters that have special meaning inside of `[]:`
- `-` defines a range, e.g. `[a-z]` matches any lower case letter and `[0-9]` matches any number.
- `^` takes the inverse of the set, e.g. `[^abc]` matches anything except a, b, or c.
- `\` escapes special characters, so `[\^\-\]]` matches `^`, `-`, or `]`.
Here are few examples:
@ -444,10 +436,10 @@ The following code demonstrates the six shortcuts with a selection of letters, n
x <- "abcd ABCD 12345 -!@#%."
str_view(x, "\\d+")
str_view(x, "\\D+")
str_view(x, "\\w+")
str_view(x, "\\W+")
str_view(x, "\\s+")
str_view(x, "\\S+")
str_view(x, "\\w+")
str_view(x, "\\W+")
```
### Quantifiers {#sec-quantifiers}
@ -562,7 +554,7 @@ str_match(x, "gr(?:e|a)y")
g. Contain at least two vowel-consonant pairs in a row.
h. Only consist of repeated vowel-consonant pairs.
4. Create 11 regular expressions that match the British or American spellings for each of the following words: grey/gray, modelling/modeling, summarize/summarize, aluminium/aluminum, defence/defense, analog/analogue, center/centre, sceptic/skeptic, aeroplane/airplane, arse/ass, doughnut/donut.
4. Create 11 regular expressions that match the British or American spellings for each of the following words: airplane/aeroplane, aluminum/aluminium, analog/analogue, ass/arse, center/centre, defense/defence, donut/doughnut, gray/grey, modeling/modelling, skeptic/sceptic, summarize/summarise.
Try and make the shortest possible regex!
5. Switch the first and last letters in `words`.
@ -627,23 +619,19 @@ phone <- regex(
r"(
\(? # optional opening parens
(\d{3}) # area code
[)\ -]? # optional closing parens, space, or dash
[)\-]? # optional closing parens or dash
\ ? # optional space
(\d{3}) # another three numbers
[\ -]? # optional space or dash
(\d{3}) # three more numbers
(\d{4}) # four more numbers
)",
comments = TRUE
)
str_match("514-791-8141", phone)
str_extract(c("514-791-8141", "(123) 456 7890", "123456"), phone)
```
If you're using comments and want to match a space, newline, or `#`, you'll need to escape it:
```{r}
str_view("x x #", regex(r"(x #)", comments = TRUE))
str_view("x x #", regex(r"(x\ \#)", comments = TRUE))
```
If you're using comments and want to match a space, newline, or `#`, you'll need to escape it with `\`.
### Fixed matches
@ -894,7 +882,7 @@ A good place to start is `vignette("regular-expressions", package = "stringr")`:
Another useful reference is [https://www.regular-expressions.info/](https://www.regular-expressions.info/tutorial.html).
It's not R specific, but you can use it to learn about the most advanced features of regexes and how they work under the hood.
It's also good to know that stringr is implemented on top of the stringi package by Marek Gagolewsk.
It's also good to know that stringr is implemented on top of the stringi package by Marek Gagolewski.
If you're struggling to find a function that does what you need in stringr, don't be afraid to look in stringi.
You'll find stringi very easy to pick up because it follows many of the the same conventions as stringr.

View File

@ -53,7 +53,7 @@ string1 <- "This is a string"
string2 <- 'If I want to include a "quote" inside a string, I use single quotes'
```
If you forget to close a quote, you'll see `+`, the continuation character:
If you forget to close a quote, you'll see `+`, the continuation prompt:
> "This is a string without a closing quote
+
@ -116,7 +116,7 @@ But if your string contains `)"` you can instead use `r"[]"` or `r"{}"`, and if
### Other special characters
As well as `\"`, `\'`, and `\\`, there are a handful of other special characters that may come in handy. The most common are `\n`, newline, and `\t`, tab. You'll also sometimes see strings containing Unicode escapes that start with `\u` or `\U`. This is a way of writing non-English characters that work on all systems. You can see the complete list of other special characters in `?'"'`.
As well as `\"`, `\'`, and `\\`, there are a handful of other special characters that may come in handy. The most common are `\n`, a new line, and `\t`, tab. You'll also sometimes see strings containing Unicode escapes that start with `\u` or `\U`. This is a way of writing non-English characters that work on all systems. You can see the complete list of other special characters in `?Quotes`.
```{r}
x <- c("one\ntwo", "one\ttwo", "\u00b5", "\U0001f604")
@ -226,7 +226,7 @@ df <- tribble(
"Marvin", "nectarine",
"Terence", "cantaloupe",
"Terence", "papaya",
"Terence", "madarine"
"Terence", "madarin"
)
df |>
group_by(name) |>
@ -244,7 +244,10 @@ df |>
str_c(letters[1:2], letters[1:3])
```
2. Convert the following expressions from `str_c()` to `str_glue()` or vice versa:
2. What's the difference between `paste()` and `paste0()`?
How can you recreate the equivalent of `paste()` with `str_c()`?
3. Convert the following expressions from `str_c()` to `str_glue()` or vice versa:
a. `str_c("The price of ", food, " is ", price)`
@ -265,7 +268,7 @@ In this section, you'll learn how to use four tidyr functions to extract them:
If you look closely, you can see there's a common pattern here: `separate_`, then `longer` or `wider`, then `_`, then by `delim` or `position`.
That's because these four functions are composed of two simpler primitives:
- `longer` makes the input data frame longer, creating new rows; `wider` makes the input data frame wider, generating new columns.
- Just like with `pivot_longer()` and `pivot_wider()`, `_longer` functions make the input data frame longer by creating new rows and `_wider` functions make the input data frame wider by generating new columns.
- `delim` splits up a string with a delimiter like `", "` or `" "`; `position` splits at specified widths, like `c(3, 5, 2)`.
We'll return to the last member of this family, `separate_regex_wider()`, in @sec-regular-expressions.
@ -455,7 +458,7 @@ You'll learn how to find the length of a string, extract substrings, and handle
str_length(c("a", "R for data science", NA))
```
You could use this with `count()` to find the distribution of lengths of US babynames and then with `filter()` to look at the longest names[^strings-6]:
You could use this with `count()` to find the distribution of lengths of US babynames and then with `filter()` to look at the longest names, which happen to have 15 letters[^strings-6]:
[^strings-6]: Looking at these entries, we'd guess that the babynames data drops spaces or hyphens and truncates after 15 letters.
@ -500,33 +503,11 @@ babynames |>
)
```
### Long strings
Sometimes you care about the length of a string because you're trying to fit it into a label on a plot or table.
stringr provides two useful tools for cases where your string is too long:
- `str_trunc(x, 30)` ensures that no string is longer than 30 characters, replacing any letters after 30 with `…`.
- `str_wrap(x, 30)` wraps a string introducing new lines so that each line is at most 30 characters (it doesn't hyphenate, however, so any word longer than 30 characters will make a longer line)
The following code shows these functions in action with a made-up string:
```{r}
x <- paste0(
"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod ",
"tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim ",
"veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea",
"commodo consequat."
)
str_view(str_trunc(x, 30))
str_view(str_wrap(x, 30))
```
### Exercises
1. Use `str_length()` and `str_sub()` to extract the middle letter from each baby name. What will you do if the string has an even number of characters?
2. Are there any major trends in the length of babynames over time? What about the popularity of first and last letters?
1. When computing the distribution of the length of babynames, why did we use `wt = n`?
2. Use `str_length()` and `str_sub()` to extract the middle letter from each baby name. What will you do if the string has an even number of characters?
3. Are there any major trends in the length of babynames over time? What about the popularity of first and last letters?
## Non-English text {#sec-other-languages}
@ -589,11 +570,6 @@ Unfortunately, that's rarely the case, so readr provides `guess_encoding()` to h
It's not foolproof and works better when you have lots of text (unlike here), but it's a reasonable place to start.
Expect to try a few different encodings before you find the right one.
```{r}
guess_encoding(x1)
guess_encoding(x2)
```
Encodings are a rich and complex topic; we've only scratched the surface here.
If you'd like to learn more, we recommend reading the detailed explanation at <http://kunststube.net/encoding/>.