Little more non-English brainstorming
This commit is contained in:
parent
a813ee1d84
commit
3a45ea5fc5
|
@ -471,4 +471,3 @@ See the Stack Overflow discussion at <http://stackoverflow.com/a/201378> for mor
|
||||||
Don't forget that you're in a programming language and you have other tools at your disposal.
|
Don't forget that you're in a programming language and you have other tools at your disposal.
|
||||||
Instead of creating one complex regular expression, it's often easier to write a series of simpler regexps.
|
Instead of creating one complex regular expression, it's often easier to write a series of simpler regexps.
|
||||||
If you get stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down into smaller pieces, solving each challenge before moving onto the next one.
|
If you get stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down into smaller pieces, solving each challenge before moving onto the next one.
|
||||||
|
|
||||||
|
|
55
strings.Rmd
55
strings.Rmd
|
@ -528,14 +528,25 @@ starwars %>%
|
||||||
Unicode is a system for representing the many writing systems used around the world.
|
Unicode is a system for representing the many writing systems used around the world.
|
||||||
Fundamental unit is a **code point**.
|
Fundamental unit is a **code point**.
|
||||||
This usually represents something like a letter or symbol, but might also be formatting like a diacritic mark or a (e.g.) the skin tone of an emoji.
|
This usually represents something like a letter or symbol, but might also be formatting like a diacritic mark or a (e.g.) the skin tone of an emoji.
|
||||||
|
Character vs grapheme cluster.
|
||||||
|
|
||||||
Encoding, and why not to trust `Encoding`.
|
Include some examples from <https://gankra.github.io/blah/text-hates-you/>.
|
||||||
As a general rule, we recommend using UTF-8 everywhere, converting as a early as possible (i.e. by using the `encoding` argument to `readr::locale()`).
|
|
||||||
|
|
||||||
All stringr functions default to the English locale.
|
All stringr functions default to the English locale.
|
||||||
This ensures that your code works the same way on every system, avoiding subtle bugs.
|
This ensures that your code works the same way on every system, avoiding subtle bugs.
|
||||||
|
|
||||||
Include some examples from [<https://gankra.github.io/blah/text-hates-you/>](https://gankra.github.io/blah/text-hates-you/){.uri}.
|
Maybe things you think are true, but aren't list?
|
||||||
|
|
||||||
|
### Encoding
|
||||||
|
|
||||||
|
You will not generally find the base R `Encoding()` to be useful because it only supports three different encodings (and interpreting what they mean is non-trivial) and it only tells you the encoding that R thinks it is, not what it really is.
|
||||||
|
And typically the problem is that the declaring encoding is wrong.
|
||||||
|
|
||||||
|
The tidyverse follows best practices[^strings-3] of using UTF-8 everywhere, so any string you create with the tidyverse will use UTF-8.
|
||||||
|
It's still possible to have problems, but they'll typically arise during data import.
|
||||||
|
Once you've diagnosed you have an encoding problem, you should fix it in data import (i.e. by using the `encoding` argument to `readr::locale()`).
|
||||||
|
|
||||||
|
[^strings-3]: <http://utf8everywhere.org>
|
||||||
|
|
||||||
### Length and subsetting
|
### Length and subsetting
|
||||||
|
|
||||||
|
@ -547,20 +558,43 @@ Four most common are Latin, Chinese, Arabic, and Devangari, which represent thre
|
||||||
|
|
||||||
- Chinese.
|
- Chinese.
|
||||||
Logograms.
|
Logograms.
|
||||||
|
Half width vs full width.
|
||||||
|
English letters are roughly twice as high as they are wide.
|
||||||
|
Chinese characters are roughly square.
|
||||||
|
|
||||||
- Arabic is an abjad, only consonants are written and vowels are optionally as diacritics.
|
- Arabic is an abjad, only consonants are written and vowels are optionally as diacritics.
|
||||||
Additionally, it's written from right-to-left, so the first letter is the letter on the far right.
|
Additionally, it's written from right-to-left, so the first letter is the letter on the far right.
|
||||||
|
|
||||||
- Devangari is an abugida where each symbol represents a consonant-vowel pair, , vowel notation secondary.
|
- Devangari is an abugida where each symbol represents a consonant-vowel pair, , vowel notation secondary.
|
||||||
|
|
||||||
|
> For instance, 'ch' is two letters in English and Latin, but considered to be one letter in Czech and Slovak.
|
||||||
|
> --- <http://utf8everywhere.org>
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
# But
|
||||||
|
str_split("check", boundary("character", locale = "cs_CZ"))
|
||||||
|
```
|
||||||
|
|
||||||
This is a problem even with Latin alphabets because many languages use **diacritics**, glyphs added to the basic alphabet.
|
This is a problem even with Latin alphabets because many languages use **diacritics**, glyphs added to the basic alphabet.
|
||||||
This is a problem because Unicode provides two ways of representing characters with accents: many common characters have a special codepoint, but others can be built up from individual components.
|
This is a problem because Unicode provides two ways of representing characters with accents: many common characters have a special codepoint, but others can be built up from individual components.
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
x <- c("\u00e1", "a\u0301")
|
x <- c("á", "x́")
|
||||||
x
|
|
||||||
str_length(x)
|
str_length(x)
|
||||||
|
# str_width(x)
|
||||||
str_sub(x, 1, 1)
|
str_sub(x, 1, 1)
|
||||||
|
|
||||||
|
# stri_width(c("全形", "ab"))
|
||||||
|
# 0, 1, or 2
|
||||||
|
# but this assumes no font substitution
|
||||||
|
```
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
cyrillic_a <- "А"
|
||||||
|
latin_a <- "A"
|
||||||
|
cyrillic_a == latin_a
|
||||||
|
stringi::stri_escape_unicode(cyrillic_a)
|
||||||
|
stringi::stri_escape_unicode(latin_a)
|
||||||
```
|
```
|
||||||
|
|
||||||
### Collation rules
|
### Collation rules
|
||||||
|
@ -621,14 +655,13 @@ Unicode collation algorithm: <https://unicode.org/reports/tr10/>
|
||||||
|
|
||||||
Another important operation that's affected by the locale is sorting.
|
Another important operation that's affected by the locale is sorting.
|
||||||
The base R `order()` and `sort()` functions sort strings using the current locale.
|
The base R `order()` and `sort()` functions sort strings using the current locale.
|
||||||
If you want robust behaviour across different computers, you may want to use `str_sort()` and `str_order()` which take an additional `locale` argument:
|
If you want robust behaviour across different computers, you may want to use `str_sort()` and `str_order()` which take an additional `locale` argument.
|
||||||
|
|
||||||
|
Can also control the "strength", which determines how accents are sorted.
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
x <- c("apple", "eggplant", "banana")
|
str_sort(c("a", "ch", "c", "h"))
|
||||||
|
str_sort(c("a", "ch", "c", "h"), locale = "cs_CZ")
|
||||||
str_sort(x, locale = "en") # English
|
|
||||||
|
|
||||||
str_sort(x, locale = "haw") # Hawaiian
|
|
||||||
```
|
```
|
||||||
|
|
||||||
TODO: add connection to `arrange()`
|
TODO: add connection to `arrange()`
|
||||||
|
|
Loading…
Reference in New Issue