Merge branch 'master' of github.com:hadley/r4ds

This commit is contained in:
Garrett 2015-11-06 17:11:27 -05:00
commit a622b71b2f
12 changed files with 2734 additions and 32 deletions

View File

@ -18,12 +18,13 @@ install:
- rm pandoc-1.12.3.zip
# Install jekyll
- travis_retry gem install jekyll mime-types
- travis_retry gem install mime-types
- travis_retry gem install jekyll -v 2.5.3
# Install R packages
- ./travis-tool.sh r_binary_install knitr png
- ./travis-tool.sh r_install ggplot2 dplyr tidyr
- ./travis-tool.sh github_package hadley/bookdown garrettgman/DSR hadley/readr
- ./travis-tool.sh r_install ggplot2 dplyr tidyr pryr stringr htmlwidgets htmltools microbenchmark
- ./travis-tool.sh github_package hadley/bookdown garrettgman/DSR hadley/readr gaborcsardi/rcorpora hadley/stringr
script: jekyll build

View File

@ -3,11 +3,12 @@
<li><a href="visualize.html">Visualize</a></li>
-->
<li><a href="transform.html">Transform</a></li>
<li><a href="strings.html">String manipulation</a></li>
<!--
<li><a href="strings.html">Regular expresssions</a></li>
<li><a href="dates.html">Dates and times</a></li>
-->
<li><a href="tidy.html">Tidy</a></li>
<li><a href="expressing-yourself.html">Expressing yourself</a></li>
<li><a href="import.html">Import</a></li>
<!--
<li><a href="models.html">Model</a></li>

View File

@ -8,6 +8,11 @@
<link href="www/bootstrap.min.css" rel="stylesheet">
<link href="www/highlight.css" rel="stylesheet">
<script src="www/jquery-1.11.0/jquery.min.js"></script>
<script src="www/htmlwidgets-0.5/htmlwidgets.js"></script>
<link href="www/str_view-0.1.0/str_view.css" rel="stylesheet" />
<script src="www/str_view-binding-1.0.0.9000/str_view.js"></script>
<link href='http://fonts.googleapis.com/css?family=Inconsolata:400,700'
rel='stylesheet' type='text/css'>
</head>

403
expressing-yourself.Rmd Normal file
View File

@ -0,0 +1,403 @@
---
layout: default
title: Expressing yourself
output: bookdown::html_chapter
---
# Expressing yourself in code
Code is a means of communication, not just to the computer, but to other people. This is important because every project you undertake is fundamentally collaborative, and even if you're not working with other people you'll definitely be working with future-you.
After solving a data analysis challenge, it's often worth looking at your code and thinking about whether or not it's obvious what you've done. If you spend a little time rewriting your code while the ideas are fresh, you can save a lot of time later trying to recreate what your code did.
To me, this is what mastering R as a programming language is all about: making it easier to express yourself, so that over time your becomes more and more clear, and easier to write. In this chapter, you'll learn some of the most important skills, but to learn more you need to study R as a programming language, not just an interactive environment for data science. We have written two books that will help you do so:
* [Hands on programming with R](http://shop.oreilly.com/product/0636920028574.do),
by Garrett Grolemund. This is an introduction to R as a programming language
and is a great place to start if R is your first programming language.
* [Advanced R](http://adv-r.had.co.nz) by Hadley Wickham. This dives into the
details of R the programming language. This is a great place to start if
you've programmed in other languages and you want to learn what makes R
special, different, and particularly well suited to data analysis.
You get better very slowly if you don't consciously practice, so this chapter brings together a number of ideas that we mention elsewhere into one focussed chapter on code as communication.
```{r}
library(magrittr)
```
This chapter is not comprehensive, but it will illustrate some patterns that in the long-term that will help you write clear and comprehensive code.
The goal is not just to write better funtions or to do things that you couldn't do before, but to code with more "ease".
## Piping
```R
foo_foo <- little_bunny()
```
There are a number of ways that you could write this:
1. Function composition:
```R
bop_on(
scoop_up(
hop_through(foo_foo, forest),
field_mouse
),
head
)
```
The disadvantage is that you have to read from inside-out, from
right-to-left, and that the arguments end up spread far apart
(sometimes called the
[dagwood sandwhich](https://en.wikipedia.org/wiki/Dagwood_sandwich)
problem).
1. Intermediate state:
```R
foo_foo_1 <- hop_through(foo_foo, forest)
foo_foo_2 <- scoop_up(foo_foo_1, field_mouse)
foo_foo_3 <- bop_on(foo_foo_2, head)
```
This avoids the nesting, but you now have to name each intermediate element.
If there are natural names, use this form. But if you're just numbering
them, I don't think it's that useful. Whenever I write code like this,
I invariably write the wrong number somewhere and then spend 10 minutes
scratching my head and trying to figure out what went wrong with my code.
You may also worry that this form creates many intermediate copies of your
data and takes up a lot of memory. First, in R, I don't think worrying about
memory is a useful way to spend your time: worry about it when it becomes
a problem (i.e. you run out of memory), not before. Second, R isn't stupid:
it will reuse the shared columns in a pipeline of data frame transformations.
You can see that using `pryr::object_size()` (unfortunately the built-in
`object.size()` doesn't have quite enough smarts to show you this super
important feature of R):
```{R}
diamonds <- ggplot2::diamonds
pryr::object_size(diamonds)
diamonds2 <- dplyr::mutate(diamonds, price_per_carat = price / carat)
pryr::object_size(diamonds2)
pryr::object_size(diamonds, diamonds2)
```
`diamonds` is 3.46 MB, and `diamonds2` is 3.89 MB, but the total size of
`diamonds` and `diamonds2` is only 3.89 MB. How does that work?
only 3.89 MB
1. Overwrite the original:
```R
foo_foo <- hop_through(foo_foo, forest)
foo_foo <- scoop_up(foo_foo, field_mouse)
foo_foo <- bop_on(foo_foo, head)
```
This is a minor variation of the previous form, where instead of giving
each intermediate element its own name, you use the same name, replacing
the previous value at each step. This is less typing (and less thinking),
so you're less likely to make mistakes. However, it can make debugging
painful, because if you make a mistake you'll need to start from
scratch again. Also, I think the reptition of the object being transformed
(here we've repeated `foo_foo` six times!) obscures the intent of the code.
1. Use the pipe
```R
foo_foo %>%
hop_through(forest) %>%
scoop_up(field_mouse) %>%
bop_on(head)
```
This is my favourite form. The downside is that you need to understand
what the pipe does, but once you've mastered that simple task, you can
read this series of function compositions like it's a set of imperative
actions.
(Behind the scenes magrittr converts this call to the previous form,
using `.` as the name of the object. This makes it easier to debug than
the first form because it avoids deeply nested fuction calls.)
## Useful intermediates
* Whenever you write your own function that is used primarily for its
side-effects, you should always return the first argument invisibly, e.g.
`invisible(x)`: that way it can easily be used in a pipe.
If a function doesn't follow this contract (e.g. `plot()` which returns
`NULL`), you can still use it with magrittr by using the "tee" operator.
`%T>%` works like `%>%` except instead it returns the LHS instead of the
RHS:
```{r}
library(magrittr)
rnorm(100) %>%
matrix(ncol = 2) %>%
plot() %>%
str()
rnorm(100) %>%
matrix(ncol = 2) %T>%
plot() %>%
str()
```
* When you run a pipe interactively, it's easy to see if something
goes wrong. When you start writing pipes that are used in production, i.e.
they're run automatically and a human doesn't immediately look at the output
it's a really good idea to include some assertions that verify the data
looks like expect. One great way to do this is the ensurer package,
writen by Stefan Milton Bache (the author of magrittr).
<http://www.r-statistics.com/2014/11/the-ensurer-package-validation-inside-pipes/>
* If you're working with functions that don't have a dataframe based API
(i.e. you pass them individual vectors, not a data frame and expressions
to be evaluated in the context of that data frame), you might find `%$%`
useful. It "explodes" out the variables in a data frame so that you can
refer to them explicitly. This is useful when working with many functions
in base R:
```{r}
mtcars %$%
cor(disp, mpg)
```
## When not to use the pipe
The pipe is a powerful tool, but it's not the only tool at your disposal, and it doesn't solve every problem! Generally, you should reach for another tool when:
* Your pipes get longer than five or six lines. It's a good idea to create
intermediate objects with meaningful names. That helps with debugging,
because it's easier to figure out when things went wrong. It also helps
understand the problem, because a good name can be very evocative of the
purpose.
* You have multiple inputs or outputs.
* Instead of creating a linear pipeline where you're primarily transforming
one object, you're starting to create a directed graphs with a complex
dependency structure. Pipes are fundamentally linear and expressing
complex relationships with them does not often yield clear code.
* For assignment. magrittr provides the `%<>%` operator which allows you to
replace code like:
```R
mtcars <- mtcars %>% transform(cyl = cyl * 2)
```
with
```R
mtcars %<>% transform(cyl = cyl * 2)
```
I'm not a fan of this operator because I think assignment is such a
special operation that it should always be clear when it's occuring.
In my opinion, a little bit of duplication (i.e. repeating the
name of the object twice), is fine in return for making assignment
more explicit.
I think it also gives you a better mental model of how assignment works
in R. The above code does not modify `mtcars`: it instead creates a
modified copy and then replaces the old version (this may seem like a
subtle point but I think it's quite important).
## Duplication
As you become a better R programming, you'll learn more techniques for reducing various types of duplication. This allows you to do more with less, and allows you to express yourself more clearly by taking advantage of powerful programming constructs.
Two main tools for reducing duplication are functions and for-loops. You tend to use for-loops less often in R than in other programming languages because R is a functional programming language. That means that you can extract out common patterns of for loops and put them in a function.
### Extracting out a function
Whenever you've copied and pasted code more than twice, you need to take a look at it and see if you can extract out the common components and make a function. For example, take a look at this code. What does it do?
```{r}
df <- data.frame(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)
df$a <- (df$a - min(df$a, na.rm = TRUE)) /
(max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$b <- (df$b - min(df$b, na.rm = TRUE)) /
(max(df$a, na.rm = TRUE) - min(df$b, na.rm = TRUE))
df$c <- (df$c - min(df$c, na.rm = TRUE)) /
(max(df$c, na.rm = TRUE) - min(df$c, na.rm = TRUE))
df$d <- (df$d - min(df$d, na.rm = TRUE)) /
(max(df$d, na.rm = TRUE) - min(df$d, na.rm = TRUE))
```
You might be able to puzzle out that this rescales each column to 0--1. Did you spot the mistake? I made an error when updating the code for `df$y`, and I forgot to change an `x` to a `y`. Extracting repeated code out into a function is a good idea because it helps make your code more understandable (because you can name the operation), and it prevents you from making this sort of update error.
To write a function you need to first analyse the operation. How many inputs does it have?
```{r, eval = FALSE}
(df$a - min(df$a, na.rm = TRUE)) /
(max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
```
It's often a good idea to rewrite the code using some temporary values. Here this function only takes one input, so I'll call it `x`:
```{r}
x <- 1:10
(x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
```
We can also see some duplication in this code: I'm computing the `min()` and `max()` multiple times, and I could instead do that in one step:
```{r}
rng <- range(x, na.rm = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
```
Now that I've simplified the code, and made sure it works, I can turn it into a function:
```{r}
rescale01 <- function(x) {
rng <- range(x, na.rm = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
}
```
Always make sure you code works on a simple test case before creating the function!
Now we can use that to simplify our original example:
```{r}
df$a <- rescale01(df$a)
df$b <- rescale01(df$b)
df$c <- rescale01(df$c)
df$d <- rescale01(df$d)
```
This makes it more clear what we're doing, and avoids one class of copy-and-paste errors. However, we still have quite a bit of duplication: we're doing the same thing to each column.
### Common looping patterns
Before we tackle the problem of rescaling each column, lets start with a simpler case. Imagine we want to summarise each column with its median. One way to do that is to use a for loop. Every for loop has three main components:
1. Creating the space for the output.
2. The sequence to loop over.
3. The body of the loop.
```{r}
medians <- vector("numeric", ncol(df))
for (i in 1:ncol(df)) {
medians[i] <- median(df[[i]])
}
medians
```
If you do this a lot, you'd probably pull make a function for it:
```{r}
col_medians <- function(df) {
out <- vector("numeric", ncol(df))
for (i in 1:ncol(df)) {
out[i] <- median(df[[i]])
}
out
}
col_medians(df)
```
Now imagine that you also want to compute the interquartile range of each column? How would you change the function? What if you also wanted to calculate the min and max?
```{r}
col_min <- function(df) {
out <- vector("numeric", ncol(df))
for (i in 1:ncol(df)) {
out[i] <- min(df[[i]])
}
out
}
col_max <- function(df) {
out <- vector("numeric", ncol(df))
for (i in 1:ncol(df)) {
out[i] <- max(df[[i]])
}
out
}
```
I've now copied-and-pasted this function three times, so it's time to think about how to generalise it. If you look at these functions, you'll notice that they are very similar: the only difference is the function that gets called.
I mentioned earlier that R is a functional programming language. Practically, what this means is that you can not only pass vectors and data frames to functions, but you can also pass other functions. So you can generalise these `col_*` functions by adding an additional argument:
```{r}
col_summary <- function(df, fun) {
out <- vector("numeric", ncol(df))
for (i in 1:ncol(df)) {
out[i] <- fun(df[[i]])
}
out
}
col_summary(df, median)
col_summary(df, min)
```
We can take this one step further and use another cool feature of R functions: "`...`". "`...`" just takes any additional arguments and allows you to pass them on to another function:
```{r}
col_summary <- function(df, fun, ...) {
out <- vector("numeric", ncol(df))
for (i in 1:ncol(df)) {
out[i] <- fun(df[[i]], ...)
}
out
}
col_summary(df, median, na.rm = TRUE)
```
If you've used R for a bit, the behaviour of function might seem familiar: it looks like the `lapply()` or `sapply()` functions. Indeed, all of the apply function in R abstract over common looping patterns.
There are two main differences with `lapply()` and `col_summary()`:
* `lapply()` returns a list. This allows it to work with any R function, not
just those that return numeric output.
* `lapply()` is written in C, not R. This gives some very minor performance
improvements.
As you learn more about R, you'll learn more functions that allow you to abstract over common patterns of for loops.
### Modifying columns
Going back to our original motivation we want to reduce the duplication in
```{r, eval = FALSE}
df$a <- rescale01(df$a)
df$b <- rescale01(df$b)
df$c <- rescale01(df$c)
df$d <- rescale01(df$d)
```
One way to do that is to combine `lapply()` with data frame subsetting:
```{r}
df[] <- lapply(df, rescale01)
```
### Exercises
1. Adapt `col_summary()` so that it only applies to numeric inputs.
You might want to start with an `is_numeric()` function that returns
a logical vector that has a TRUE corresponding to each numeric column.
1. How do `sapply()` and `vapply()` differ from `col_summary()`?

720
harvard-sentences.txt Normal file
View File

@ -0,0 +1,720 @@
The birch canoe slid on the smooth planks.
Glue the sheet to the dark blue background.
It's easy to tell the depth of a well.
These days a chicken leg is a rare dish.
Rice is often served in round bowls.
The juice of lemons makes fine punch.
The box was thrown beside the parked truck.
The hogs were fed chopped corn and garbage.
Four hours of steady work faced us.
Large size in stockings is hard to sell.
The boy was there when the sun rose.
A rod is used to catch pink salmon.
The source of the huge river is the clear spring.
Kick the ball straight and follow through.
Help the woman get back to her feet.
A pot of tea helps to pass the evening.
Smoky fires lack flame and heat.
The soft cushion broke the man's fall.
The salt breeze came across from the sea.
The girl at the booth sold fifty bonds.
The small pup gnawed a hole in the sock.
The fish twisted and turned on the bent hook.
Press the pants and sew a button on the vest.
The swan dive was far short of perfect.
The beauty of the view stunned the young boy.
Two blue fish swam in the tank.
Her purse was full of useless trash.
The colt reared and threw the tall rider.
It snowed, rained, and hailed the same morning.
Read verse out loud for pleasure.
Hoist the load to your left shoulder.
Take the winding path to reach the lake.
Note closely the size of the gas tank.
Wipe the grease off his dirty face.
Mend the coat before you go out.
The wrist was badly strained and hung limp.
The stray cat gave birth to kittens.
The young girl gave no clear response.
The meal was cooked before the bell rang.
What joy there is in living.
A king ruled the state in the early days.
The ship was torn apart on the sharp reef.
Sickness kept him home the third week.
The wide road shimmered in the hot sun.
The lazy cow lay in the cool grass.
Lift the square stone over the fence.
The rope will bind the seven books at once.
Hop over the fence and plunge in.
The friendly gang left the drug store.
Mesh mire keeps chicks inside.
The frosty air passed through the coat.
The crooked maze failed to fool the mouse.
Adding fast leads to wrong sums.
The show was a flop from the very start.
A saw is a tool used for making boards.
The wagon moved on well oiled wheels.
March the soldiers past the next hill.
A cup of sugar makes sweet fudge.
Place a rosebush near the porch steps.
Both lost their lives in the raging storm.
We talked of the slide show in the circus.
Use a pencil to write the first draft.
He ran half way to the hardware store.
The clock struck to mark the third period.
A small creek cut across the field.
Cars and busses stalled in snow drifts.
The set of china hit, the floor with a crash.
This is a grand season for hikes on the road.
The dune rose from the edge of the water.
Those words were the cue for the actor to leave.
A yacht slid around the point into the bay.
The two met while playing on the sand.
The ink stain dried on the finished page.
The walled town was seized without a fight.
The lease ran out in sixteen weeks.
A tame squirrel makes a nice pet.
The horn of the car woke the sleeping cop.
The heart beat strongly and with firm strokes.
The pearl was worn in a thin silver ring.
The fruit peel was cut in thick slices.
The Navy attacked the big task force.
See the cat glaring at the scared mouse.
There are more than two factors here.
The hat brim was wide and too droopy.
The lawyer tried to lose his case.
The grass curled around the fence post.
Cut the pie into large parts.
Men strive but seldom get rich.
Always close the barn door tight.
He lay prone and hardly moved a limb.
The slush lay deep along the street.
A wisp of cloud hung in the blue air.
A pound of sugar costs more than eggs.
The fin was sharp and cut the clear water.
The play seems dull and quite stupid.
Bail the boat, to stop it from sinking.
The term ended in late June that year.
A tusk is used to make costly gifts.
Ten pins were set in order.
The bill as paid every third week.
Oak is strong and also gives shade.
Cats and dogs each hate the other.
The pipe began to rust while new.
Open the crate but don't break the glass.
Add the sum to the product of these three.
Thieves who rob friends deserve jail.
The ripe taste of cheese improves with age.
Act on these orders with great speed.
The hog crawled under the high fence.
Move the vat over the hot fire.
The bark of the pine tree was shiny and dark.
Leaves turn brown and yellow in the fall.
The pennant waved when the wind blew.
Split the log with a quick, sharp blow.
Burn peat after the logs give out.
He ordered peach pie with ice cream.
Weave the carpet on the right hand side.
Hemp is a weed found in parts of the tropics.
A lame back kept his score low.
We find joy in the simplest things.
Type out three lists of orders.
The harder he tried the less he got done.
The boss ran the show with a watchful eye.
The cup cracked and spilled its contents.
Paste can cleanse the most dirty brass.
The slang word for raw whiskey is booze.
It caught its hind paw in a rusty trap.
The wharf could be seen at the farther shore.
Feel the heat of the weak dying flame.
The tiny girl took off her hat.
A cramp is no small danger on a swim.
He said the same phrase thirty times.
Pluck the bright rose without leaves.
Two plus seven is less than ten.
The glow deepened in the eyes of the sweet girl.
Bring your problems to the wise chief.
Write a fond note to the friend you cherish.
Clothes and lodging are free to new men.
We frown when events take a bad turn.
Port is a strong wine with s smoky taste.
The young kid jumped the rusty gate.
Guess the results from the first scores.
A salt pickle tastes fine with ham.
The just claim got the right verdict.
These thistles bend in a high wind.
Pure bred poodles have curls.
The tree top waved in a graceful way.
The spot on the blotter was made by green ink.
Mud was spattered on the front of his white shirt.
The cigar burned a hole in the desk top.
The empty flask stood on the tin tray.
A speedy man can beat this track mark.
He broke a new shoelace that day.
The coffee stand is too high for the couch.
The urge to write short stories is rare.
The pencils have all been used.
The pirates seized the crew of the lost ship.
We tried to replace the coin but failed.
She sewed the torn coat quite neatly.
The sofa cushion is red and of light weight.
The jacket hung on the back of the wide chair.
At that high level the air is pure.
Drop the two when you add the figures.
A filing case is now hard to buy.
An abrupt start does not win the prize.
Wood is best for making toys and blocks.
The office paint was a dull sad tan.
He knew the skill of the great young actress.
A rag will soak up spilled water.
A shower of dirt fell from the hot pipes.
Steam hissed from the broken valve.
The child almost hurt the small dog.
There was a sound of dry leaves outside.
The sky that morning was clear and bright blue.
Torn scraps littered the stone floor.
Sunday is the best part of the week.
The doctor cured him with these pills.
The new girl was fired today at noon.
They felt gay when the ship arrived in port.
Add the store's account to the last cent.
Acid burns holes in wool cloth.
Fairy tales should be fun to write.
Eight miles of woodland burned to waste.
The third act was dull and tired the players.
A young child should not suffer fright.
Add the column and put the sum here.
We admire and love a good cook.
There the flood mark is ten inches.
He carved a head from the round block of marble.
She has st smart way of wearing clothes.
The fruit of a fig tree is apple-shaped.
Corn cobs can be used to kindle a fire.
Where were they when the noise started.
The paper box is full of thumb tacks.
Sell your gift to a buyer at a good gain.
The tongs lay beside the ice pail.
The petals fall with the next puff of wind.
Bring your best compass to the third class.
They could laugh although they were sad.
Farmers came in to thresh the oat crop.
The brown house was on fire to the attic.
The lure is used to catch trout and flounder.
Float the soap on top of the bath water.
A blue crane is a tall wading bird.
A fresh start will work such wonders.
The club rented the rink for the fifth night.
After the dance they went straight home.
The hostess taught the new maid to serve.
He wrote his last novel there at the inn.
Even the worst will beat his low score.
The cement had dried when he moved it.
The loss of the second ship was hard to take.
The fly made its way along the wall.
Do that with a wooden stick.
Lire wires should be kept covered.
The large house had hot water taps.
It is hard to erase blue or red ink.
Write at once or you may forget it.
The doorknob was made of bright clean brass.
The wreck occurred by the bank on Main Street.
A pencil with black lead writes best.
Coax a young calf to drink from a bucket.
Schools for ladies teach charm and grace.
The lamp shone with a steady green flame.
They took the axe and the saw to the forest.
The ancient coin was quite dull and worn.
The shaky barn fell with a loud crash.
Jazz and swing fans like fast music.
Rake the rubbish up and then burn it.
Slash the gold cloth into fine ribbons.
Try to have the court decide the case.
They are pushed back each time they attack.
He broke his ties with groups of former friends.
They floated on the raft to sun their white backs.
The map had an X that meant nothing.
Whitings are small fish caught in nets.
Some ads serve to cheat buyers.
Jerk the rope and the bell rings weakly.
A waxed floor makes us lose balance.
Madam, this is the best brand of corn.
On the islands the sea breeze is soft and mild.
The play began as soon as we sat down.
This will lead the world to more sound and fury
Add salt before you fry the egg.
The rush for funds reached its peak Tuesday.
The birch looked stark white and lonesome.
The box is held by a bright red snapper.
To make pure ice, you freeze water.
The first worm gets snapped early.
Jump the fence and hurry up the bank.
Yell and clap as the curtain slides back.
They are men nho walk the middle of the road.
Both brothers wear the same size.
In some forin or other we need fun.
The prince ordered his head chopped off.
The houses are built of red clay bricks.
Ducks fly north but lack a compass.
Fruit flavors are used in fizz drinks.
These pills do less good than others.
Canned pears lack full flavor.
The dark pot hung in the front closet.
Carry the pail to the wall and spill it there.
The train brought our hero to the big town.
We are sure that one war is enough.
Gray paint stretched for miles around.
The rude laugh filled the empty room.
High seats are best for football fans.
Tea served from the brown jug is tasty.
A dash of pepper spoils beef stew.
A zestful food is the hot-cross bun.
The horse trotted around the field at a brisk pace.
Find the twin who stole the pearl necklace.
Cut the cord that binds the box tightly.
The red tape bound the smuggled food.
Look in the corner to find the tan shirt.
The cold drizzle will halt the bond drive.
Nine men were hired to dig the ruins.
The junk yard had a mouldy smell.
The flint sputtered and lit a pine torch.
Soak the cloth and drown the sharp odor.
The shelves were bare of both jam or crackers.
A joy to every child is the swan boat.
All sat frozen and watched the screen.
ii cloud of dust stung his tender eyes.
To reach the end he needs much courage.
Shape the clay gently into block form.
The ridge on a smooth surface is a bump or flaw.
Hedge apples may stain your hands green.
Quench your thirst, then eat the crackers.
Tight curls get limp on rainy days.
The mute muffled the high tones of the horn.
The gold ring fits only a pierced ear.
The old pan was covered with hard fudge.
Watch the log float in the wide river.
The node on the stalk of wheat grew daily.
The heap of fallen leaves was set on fire.
Write fast, if you want to finish early.
His shirt was clean but one button was gone.
The barrel of beer was a brew of malt and hops.
Tin cans are absent from store shelves.
Slide the box into that empty space.
The plant grew large and green in the window.
The beam dropped down on the workmen's head.
Pink clouds floated JTith the breeze.
She danced like a swan, tall and graceful.
The tube was blown and the tire flat and useless.
It is late morning on the old wall clock.
Let's all join as we sing the last chorus.
The last switch cannot be turned off.
The fight will end in just six minutes.
The store walls were lined with colored frocks.
The peace league met to discuss their plans.
The rise to fame of a person takes luck.
Paper is scarce, so write with much care.
The quick fox jumped on the sleeping cat.
The nozzle of the fire hose was bright brass.
Screw the round cap on as tight as needed.
Time brings us many changes.
The purple tie was ten years old.
Men think and plan and sometimes act.
Fill the ink jar with sticky glue.
He smoke a big pipe with strong contents.
We need grain to keep our mules healthy.
Pack the records in a neat thin case.
The crunch of feet in the snow was the only sound.
The copper bowl shone in the sun's rays.
Boards will warp unless kept dry.
The plush chair leaned against the wall.
Glass will clink when struck by metal.
Bathe and relax in the cool green grass.
Nine rows of soldiers stood in line.
The beach is dry and shallow at low tide.
The idea is to sew both edges straight.
The kitten chased the dog down the street.
Pages bound in cloth make a book.
Try to trace the fine lines of the painting.
Women form less than half of the group.
The zones merge in the central part of town.
A gem in the rough needs work to polish.
Code is used when secrets are sent.
Most of the new is easy for us to hear.
He used the lathe to make brass objects.
The vane on top of the pole revolved in the wind.
Mince pie is a dish served to children.
The clan gathered on each dull night.
Let it burn, it gives us warmth and comfort.
A castle built from sand fails to endure.
A child's wit saved the day for us.
Tack the strip of carpet to the worn floor.
Next Tuesday we must vote.
Pour the stew from the pot into the plate.
Each penny shone like new.
The man went to the woods to gather sticks.
The dirt piles were lines along the road.
The logs fell and tumbled into the clear stream.
Just hoist it up and take it away,
A ripe plum is fit for a king's palate.
Our plans right now are hazy.
Brass rings are sold by these natives.
It takes a good trap to capture a bear.
Feed the white mouse some flower seeds.
The thaw came early and freed the stream.
He took the lead and kept it the whole distance.
The key you designed will fit the lock.
Plead to the council to free the poor thief.
Better hash is made of rare beef.
This plank was made for walking on.
The lake sparkled in the red hot sun.
He crawled with care along the ledge.
Tend the sheep while the dog wanders.
It takes a lot of help to finish these.
Mark the spot with a sign painted red.
Take two shares as a fair profit.
The fur of cats goes by many names.
North winds bring colds and fevers.
He asks no person to vouch for him.
Go now and come here later.
A sash of gold silk will trim her dress.
Soap can wash most dirt away.
That move means the game is over.
He wrote down a long list of items.
A siege will crack the strong defense.
Grape juice and water mix well.
Roads are paved with sticky tar.
Fake &ones shine but cost little.
The drip of the rain made a pleasant sound.
Smoke poured out of every crack.
Serve the hot rum to the tired heroes.
Much of the story makes good sense.
The sun came up to light the eastern sky.
Heave the line over the port side.
A lathe cuts and trims any wood.
It's a dense crowd in two distinct ways.
His hip struck the knee of the next player.
The stale smell of old beer lingers.
The desk was firm on the shaky floor.
It takes heat to bring out the odor.
Beef is scarcer than some lamb.
Raise the sail and steer the ship northward.
The cone costs five cents on Mondays.
A pod is what peas always grow in.
Jerk the dart from the cork target.
No cement will hold hard wood.
We now have a new base for shipping.
The list of names is carved around the base.
The sheep were led home by a dog.
Three for a dime, the young peddler cried.
The sense of smell is better than that of touch.
No hardship seemed to keep him sad.
Grace makes up for lack of beauty.
Nudge gently but wake her now.
The news struck doubt into restless minds.
Once we stood beside the shore.
A chink in the wall allowed a draft to blow.
Fasten two pins on each side.
A cold dip restores health and zest.
He takes the oath of office each March.
The sand drifts over the sill of the old house.
The point of the steel pen was bent and twisted.
There is a lag between thought and act.
Seed is needed to plant the spring corn.
Draw the chart with heavy black lines.
The boy owed his pal thirty cents.
The chap slipped into the crowd and was lost.
Hats are worn to tea and not to dinner.
The ramp led up to the wide highway.
Beat the dust from the rug onto the lawn.
Say it slow!y but make it ring clear.
The straw nest housed five robins.
Screen the porch with woven straw mats.
This horse will nose his way to the finish.
The dry wax protects the deep scratch.
He picked up the dice for a second roll.
These coins will be needed to pay his debt.
The nag pulled the frail cart along.
Twist the valve and release hot steam.
The vamp of the shoe had a gold buckle.
The smell of burned rags itches my nose.
Xew pants lack cuffs and pockets.
The marsh will freeze when cold enough.
They slice the sausage thin with a knife.
The bloom of the rose lasts a few days.
A gray mare walked before the colt.
Breakfast buns are fine with a hot drink.
Bottles hold four kinds of rum.
The man wore a feather in his felt hat.
He wheeled the bike past. the winding road.
Drop the ashes on the worn old rug.
The desk and both chairs were painted tan.
Throw out the used paper cup and plate.
A clean neck means a neat collar.
The couch cover and hall drapes were blue.
The stems of the tall glasses cracked and broke.
The wall phone rang loud and often.
The clothes dried on a thin wooden rack.
Turn on the lantern which gives us light.
The cleat sank deeply into the soft turf.
The bills were mailed promptly on the tenth of the month.
To have is better than to wait and hope.
The price is fair for a good antique clock.
The music played on while they talked.
Dispense with a vest on a day like this.
The bunch of grapes was pressed into wine.
He sent the figs, but kept the ripe cherries.
The hinge on the door creaked with old age.
The screen before the fire kept in the sparks.
Fly by night, and you waste little time.
Thick glasses helped him read the print.
Birth and death mark the limits of life.
The chair looked strong but had no bottom.
The kite flew wildly in the high wind.
A fur muff is stylish once more.
The tin box held priceless stones.
We need an end of all such matter.
The case was puzzling to the old and wise.
The bright lanterns were gay on the dark lawn.
We don't get much money but we have fun.
The youth drove with zest, but little skill.
Five years he lived with a shaggy dog.
A fence cuts through the corner lot.
The way to save money is not to spend much.
Shut the hatch before the waves push it in.
The odor of spring makes young hearts jump.
Crack the walnut with your sharp side teeth.
He offered proof in the form of a lsrge chart.
Send the stuff in a thick paper bag.
A quart of milk is water for the most part.
They told wild tales to frighten him.
The three story house was built of stone.
In the rear of the ground floor was a large passage.
A man in a blue sweater sat at the desk.
Oats are a food eaten by horse and man.
Their eyelids droop for want. of sleep.
The sip of tea revives his tired friend.
There are many ways to do these things.
Tuck the sheet under the edge of the mat.
A force equal to that would move the earth.
We like to see clear weather.
The work of the tailor is seen on each side.
Take a chance and win a china doll.
Shake the dust from your shoes, stranger.
She was kind to sick old people.
The dusty bench stood by the stone wall.
The square wooden crate was packed to be shipped.
We dress to suit the weather of most days.
Smile when you say nasty words.
A bowl of rice is free with chicken stew.
The water in this well is a source of good health.
Take shelter in this tent, but keep still.
That guy is the writer of a few banned books.
The little tales they tell are false.
The door was barred, locked, and bolted as well.
Ripe pears are fit for a queen's table.
A big wet stain was on the round carpet.
The kite dipped and swayed, but stayed aloft.
The pleasant hours fly by much too soon.
The room was crowded with a wild mob.
This strong arm shall shield your honor.
She blushed when he gave her a white orchid.
The beetle droned in the hot June sun.
Press the pedal with your left foot.
Neat plans fail without luck.
The black trunk fell from the landing.
The bank pressed for payment of the debt.
The theft of the pearl pin was kept secret.
Shake hands with this friendly child.
The vast space stretched into the far distance.
A rich farm is rare in this sandy waste.
His wide grin earned many friends.
Flax makes a fine brand of paper.
Hurdle the pit with the aid of a long pole.
A strong bid may scare your partner stiff.
Even a just cause needs power to win.
Peep under the tent and see the clowns.
The leaf drifts along with a slow spin.
Cheap clothes are flashy but don??????t last.
A thing of small note can cause despair.
Flood the mails with requests for this book.
A thick coat of black paint covered all.
The pencil was cut to be sharp at both ends.
Those last words were a strong statement.
He wrote his name boldly at the top of tile sheet.
Dill pickles are sour but taste fine.
Down that road is the way to the grain farmer.
Either mud or dust are found at all times.
The best method is to fix it in place with clips.
If you mumble your speech will be lost.
At night the alarm roused him from a deep sleep.
Read just what the meter says.
Fill your pack with bright trinkets for the poor.
The small red neon lamp went out.
Clams are small, round, soft, and tasty.
The fan whirled its round blades softly.
The line where the edges join was clean.
Breathe deep and smell the piny air.
It matters not if he reads these words or those.
A brown leather bag hung from its strap.
A toad and a frog are hard to tell apart.
A white silk jacket goes with any shoes.
A break in the dam almost caused a flood.
Paint the sockets in the wall dull green.
The child crawled into the dense grass.
Bribes fail where honest men work.
Trample the spark, else the flames will spread.
The hilt. of the sword was carved with fine designs.
A round hole was drilled through the thin board.
Footprints showed the path he took up the beach.
She was waiting at my front lawn.
A vent near the edge brought in fresh air.
Prod the old mule with a crooked stick.
It is a band of steel three inches wide.
The pipe ran almost the length of the ditch.
It was hidden from sight by a mass of leaves and shrubs.
The weight. of the package was seen on the high scale.
Wake and rise, and step into the green outdoors.
The green light in the brown box flickered.
The brass tube circled the high wall.
The lobes of her ears were pierced to hold rings.
Hold the hammer near the end to drive the nail.
Next Sunday is the twelfth of the month.
Every word and phrase he speaks is true.
He put his last cartridge into the gun and fired.
They took their kids from the public school.
Drive the screw straight into the wood.
Keep the hatch tight and the watch constant.
Sever the twine with a quick snip of the knife.
Paper will dry out when wet.
Slide the catch back and open the desk.
Help the weak to preserve their strength.
A sullen smile gets few friends.
Stop whistling and watch the boys march.
Jerk the cord, and out tumbles the gold.
Slidc the tray across the glass top.
The cloud moved in a stately way and was gone.
Light maple makes for a swell room.
Set the piece here and say nothing.
Dull stories make her laugh.
A stiff cord will do to fasten your shoe.
Get the trust fund to the bank early.
Choose between the high road and the low.
A plea for funds seems to come again.
He lent his coat to the tall gaunt stranger.
There is a strong chance it will happen once more.
The duke left the park in a silver coach.
Greet the new guests and leave quickly.
When the frost has come it is time for turkey.
Sweet words work better than fierce.
A thin stripe runs down the middle.
A six comes up more often than a ten.
Lush fern grow on the lofty rocks.
The ram scared the school children off.
The team with the best timing looks good.
The farmer swapped his horse for a brown ox.
Sit on the perch and tell the others what to do.
A steep trail is painful for our feet.
The early phase of life moves fast.
Green moss grows on the northern side.
Tea in thin china has a sweet taste.
Pitch the straw through the door of the stable.
The latch on the beck gate needed a nail.
The goose was brought straight from the old market.
The sink is the thing in which we pile dishes.
A whiff of it will cure the most stubborn cold.
The facts don’t always show who is right.
She flaps her cape as she parades the street.
The loss of the cruiser was a blow to the fleet.
Loop the braid to the left and then over.
Plead with the lawyer to drop the lost cause.
Calves thrive on tender spring grass.
Post no bills on this office wall.
Tear a thin sheet from the yellow pad.
A cruise in warm waters in a sleek yacht is fun.
A streak of color ran down the left edge.
It was done before the boy could see it.
Crouch before you jump or miss the mark.
Pack the kits and don’t forget the salt.
The square peg will settle in the round hole.
Fine soap saves tender skin.
Poached eggs and tea must suffice.
Bad nerves are jangled by a door slam.
Ship maps are different from those for planes.
Dimes showered down from all sides.
They sang the same tunes at each party.
The sky in the west is tinged with orange red.
The pods of peas ferment in bare fields.
The horse balked and threw the tall rider.
The hitch between the horse and cart broke.
Pile the coal high in the shed corner.
The gold vase is both rare and costly.
The knife was hung inside its bright sheath.
The rarest spice comes from the far East.
The roof should be tilted at a sharp slant.
A smatter of French is worse than none.
The mule trod the treadmill day and night.
The aim of the contest is to raise a great fund.
To send it. now in large amounts is bad.
There is a fine hard tang in salty air.
Cod is the main business of the north shore.
The slab was hewn from heavy blocks of slat’e.
Dunk the stale biscuits into strong drink.
Hang tinsel from both branches.
Cap the jar with a tight brass cover.
The poor boy missed the boat again.
Be sure to set the lamp firmly in the hole.
Pick a card and slip it. under the pack.
A round mat will cover the dull spot.
The first part of the plan needs changing.
The good book informs of what we ought to know.
The mail comes in three batches per day.
You cannot brew tea in a cold pot.
Dots of light betrayed the black cat.
Put the chart on the mantel and tack it down.
The night shift men rate extra pay.
The red paper brightened the dim stage.
See the player scoot to third base.
Slide the bill between the two leaves.
Many hands help get the job done.
We don't like to admit our small faults.
No doubt about the way the wind blows.
Dig deep in the earth for pirate's gold.
The steady drip is worse than a drenching rain.
A flat pack takes less luggage space.
Green ice frosted the punch bowl.
A stuffed chair slipped from the moving van.
The stitch will serve but needs to be shortened.
A thin book fits in the side pocket.
The gloss on top made it unfit to read.
The hail pattered on the burnt brown grass.
Seven seals were stamped on great sheets.
Our troops are set to strike heavy blows.
The store was jammed before the sale could start.
It was a bad error on the part of the new judge.
One step more and the board will collapse.
Take the match and strike it against your shoe.
The pot boiled, but the contents failed to jell.
The baby puts his right foot in his mouth.
The bombs left most of the town in ruins.
Stop and stare at the hard working man.
The streets are narrow and full of sharp turns.
The pup jerked the leash as he saw a feline shape.
Open your book to the first page.
Fish evade the net, and swim off.
Dip the pail once and let it settle.
Will you please answer that phone.
The big red apple fell to the ground.
The curtain rose and the show was on.
The young prince became heir to the throne.
He sent the boy on a short errand.
Leave now and you will arrive on time.
The corner store was robbed last night.
A gold ring will please most any girl.
The long journey home took a year.
She saw a cat in the neighbor's house.
A pink shell was found on the sandy beach.
Small children came to see him.
The grass and bushes were wet with dew.
The blind man counted his old coins.
A severe storm tore down the barn.
She called his name many times.
When you hear the bell, come quickly.

View File

@ -5,6 +5,7 @@ output: bookdown::html_chapter
---
```{r, include = FALSE}
library(dplyr)
library(readr)
```
@ -34,8 +35,8 @@ There are many ways to read flat files into R. If you've be using R for a while,
percentages, and more.
* They fail to do some annoying things like converting character vectors to
factors, and munging the column headers to make sure they're valid R
variable names.
factors, munging the column headers to make sure they're valid R
variable names, and using row names.
* They return objects with class `tbl_df`. As you saw in the dplyr chapter,
this provides a nicer printing method, so it's easier to work with large
@ -78,11 +79,6 @@ As well as reading data frame disk, readr also provides tools for working with d
* `type_convert()` applies the same parsing heuristics to the character columns
in a data frame. You can override its choices using `col_types`.
* `parse_datetime()`, `parse_factor()`, `parse_integer()`, etc. Corresponding
to each `col_XYZ()` function is a `parse_XYZ()` function that takes a
character vector and returns a parsed vector. We'll use these in examples
so you can see how a single piece works at a time.
For the rest of this chapter we'll focus on `read_csv()`. If you understand how to use this function, it will be straightforward to your knowledge to all the other functions in readr.
### Basics
@ -108,31 +104,69 @@ EXAMPLE
### Column types
Readr uses a heuristic to figure out the types of your columns: it reads the first 1000 rows. This is fast, and fairly robust. If readr detects the wrong type of data, you'll get warning messages:
Readr uses a heuristic to figure out the types of your columns: it reads the first 1000 rows and uses some (moderately conservative) heuristics to figure out the type of each column. This is fast, and fairly robust. If readr detects the wrong type of data, you'll get warning messages. Readr prints out the first five, and you can access them all with `problems()`:
EXAMPLE
You can fix these by overriding readr's guesses with the `col_type` argument.
Typically, you'll see a lot of warnings if readr has guessed the column type incorrectly. This most often occurs when the first 1000 rows are different to the rest of the data. Perhaps there are a lot of missing data there, or maybe your data is mostly numeric but a few rows have characters. Fortunately, it's easy to fix these problems using the `col_type` argument.
(Note that if you have a very large file, you might want to set `n_max` to 10,000 or 100,000. That will speed up iteration while you're finding common problems)
* `col_integer()` and `col_double()` specify integer and doubles. `col_number()`
is a more flexible parsed for numbers embedded in other strings. It will
look for the first number in a string, ignoring non-numeric prefixes and
suffixes. It will also ignoring the grouping mark specified by the locale
(see below for more details).
* `col_logical()` parses TRUE, T, FALSE and F into a logical vector.
* `col_character()` leaves strings as is. `col_factor()` allows you to load
data directly into a factor if you know what the levels are.
* `col_skip()` completely ignores a column.
Specifying the `col_type` looks like this:
* `col_date()`, `col_datetime()` and `col_time()` parse into dates, date times,
and times as described below.
```{r, eval = FALSE}
read_csv("mypath.csv", col_types = col(
x = col_integer(),
treatment = col_character()
))
```
Parsing occurs after leading and trailing whitespace has been removed (if not overridden with `trim_ws = FALSE`) and missing values listed in `na` have been removed.
You can use the following types of columns
* `col_integer()` (i) and `col_double()` (d) specify integer and doubles.
`col_logical()` (l) parses TRUE, T, FALSE and F into a logical vector.
`col_character()` (c) leaves strings as is.
* `col_number()` (n) is a more flexible parsed for numbers embedded in other
strings. It will look for the first number in a string, ignoring non-numeric
prefixes and suffixes. It will also ignoring the grouping mark specified by
the locale (see below for more details).
* `col_factor()` (f) allows you to load data directly into a factor if you know
what the levels are.
* `col_skip()` (_, -) completely ignores a column.
* `col_date()` (D), `col_datetime()` (T) and `col_time()` (t) parse into dates,
date times, and times as described below.
You might have noticed that each column parser has a one letter abbreviation, which you can instead of the full function call (assuming you're happy with the default arguments):
```{r, eval = FALSE}
read_csv("mypath.csv", col_types = cols(
x = "i",
treatment = "c"
))
```
(If you just have a few columns you supply a single string giving the type for each column: `i__dc`. See the documentation for more details. It's not as easy to understand as the `cols()` specification, so I'm not going to describe it further here.)
By default, any column not mentioned in `cols` will be guessed. If you'd rather those columns are simply not read in, use `cols_only()`. In that case, you can use `col_guess()` (?) if you want to guess the type of a column.
Each `col_XYZ()` function also has a corresponding `parse_XYZ()` that you can use on a character vector. This makes it easier to explore what each of the parsers does interactively.
```{r}
parse_integer(c("1", "2", "3"))
parse_logical(c("TRUE", "FALSE", "NA"))
parse_number(c("$1000", "20%", "3,000"))
parse_number(c("$1000", "20%", "3,000"))
```
Parsing occurs after leading and trailing whitespace has been removed (if not overridden with `trim_ws = FALSE`) and missing values listed in `na` have been removed:
```{r}
parse_logical(c("TRUE ", " ."), na = ".")
```
#### Datetimes
@ -149,7 +183,7 @@ parse_date("2010-10-01")
parse_time("20:10:01")
```
If these don't work for your data (common!) you can supply your own date time formats, built up of the following pieces:
If these defaults don't work for your data you can supply your own date time formats, built up of the following pieces:
* Year: `%Y` (4 digits). `%y` (2 digits); 00-69 -> 2000-2069, 70-99 -> 1970-1999.
@ -181,8 +215,42 @@ parse_date("01/02/15", "%d/%m/%y")
parse_date("01/02/15", "%y/%m/%d")
```
Then when you read in the data with `read_csv()` you can easily translate to the `col_date()` format.
### International data
The goal of readr's locales is to encapsulate the common options that vary between languages and different regions of the world. This includes:
* Names of months and days, used when parsing dates.
* The default time zones, used when parsing date times.
* The character encoding, used when reading non-ASCII strings.
* Default date and time formats, used when guessing column types.
* The decimal and grouping marks, used when reading numbers.
Readr is designed to be independent of your current locale settings. This makes a bit more hassle in the short term, but makes it much much easier to share your code with others: if your readr code works locally, it will also work for everyone else in the world. The same is not true for base R code, since it often inherits defaults from your system settings. Just because data ingest code works for you doesn't mean that it will work for someone else in another country.
The settings you are most like to need to change are:
* The names of days and months:
```{r}
locale("fr")
locale("fr", asciify = TRUE)
```
* The character encoding used in the file. If you don't know the encoding
you can use `guess_encoding()`. It's not perfect, but if you have a decent
sample of text, it's likely to be able to figure it out.
Readr converts all strings into UTF-8 as this is safest to work with across
platforms. (It's also what every stringr operation does.)
### Exercises
* Parse these dates (incl. non-English examples).
* Parse these example files.
* Parse this fixed width file.
## Databases
## Web APIs

849
strings.Rmd Normal file
View File

@ -0,0 +1,849 @@
---
layout: default
title: String manipulation
output: bookdown::html_chapter
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(stringr)
common <- rcorpora::corpora("words/common")$commonWords
fruit <- rcorpora::corpora("foods/fruits")$fruits
sentences <- readr::read_lines("harvard-sentences.txt")
```
# String manipulation
This chapter introduces you to string manipulation in R. You'll learn the basics of how strings work and how to create them by hand, but the focus of this chapter will be on regular expressions. Character variables typically unstructured or semi-structured data so you need some tools to make order from madness. Regular expressions are a very concise language for describing patterns in strings. When you first look at them, you'll think a cat walked across your keyboard, but as you learn more, you'll see how they allow you to express complex patterns very concisely. The goal of this chapter is not to teach you every detail of regular expressions. Instead we'll give you a solid foundation that allows you to solve a wide variety of problems and point you to resources where you can learn more.
This chapter will focus on the __stringr__ package. This package provides a consistent set of functions that all work the same way and are easier to learn than the base R equivalents. We'll also take a brief look at the __stringi__ package. This package is what stringr uses internally: it's more complex than stringr (and includes many many more functions). stringr includes tools to let you tackle the most common 90% of string manipulation challenges; stringi contains functions to let you tackle the last 10%.
## String basics
In R, strings are stored in a character vector. You can create strings with either single quotes or double quotes: there is no difference in behaviour. I recommend always using `"`, unless you want to create a string that contains multiple `"`, in which case use `'`.
```{r}
string1 <- "This is a string"
string2 <- 'If I want to include a "quote" inside a string, I use single quotes'
```
To include a literal single or double quote in a string you can use `\` to "escape" it:
```{r}
double_quote <- "\"" # or '"'
single_quote <- '\'' # or "'"
```
That means if you want to include a literal `\`, you'll need to double it up: `"\\"`.
Beware that the printed representation of the string is not the same as string itself, because the printed representation shows the escapes. To see the raw contents of the string, use writeLines()`:
```{r}
x <- c("\"", "\\")
x
writeLines(x)
```
There are a handful of other special characters. The most common used are `"\n"`, new line, and `"\t"`, tab, but you can see the complete list by requesting help on `"`: `?'"'`, or `?"'"`. You'll also sometimes strings like `"\u00b5"`, this is a way of writing non-English characters that works on all platforms:
```{r}
x <- "\u00b5"
x
```
### String length
Base R contains many functions to work with strings but we'll generally avoid them because they're inconsistent and hard to remember. Their behaviour is particularly inconsistent when it comes to missing values. For examle, `nchar()`, which gives the length of a string, returns 2 for `NA` (instead of `NA`)
```{r}
# (Will be fixed in R 3.3.0)
nchar(NA)
```
Instead we'll use functions from stringr. These have more evocative names, and all start with `str_`:
```{r}
str_length(NA)
```
The common `str_` prefix is particularly useful if you use RStudio, because typing `str_` trigger autocomplete, so you can easily see all of the stringr functions.
### Combining strings
To combine two or more strings, use `str_c()`:
```{r}
str_c("x", "y")
str_c("x", "y", "z")
```
Use the `sep` argument to control how they're separated:
```{r}
str_c("x", "y", sep = ", ")
```
Like most other functions in R, missing values are infectious. If you want them to print as `NA`, use `str_replace_na()`:
```{r}
x <- c("abc", NA)
str_c("|-", x, "-|")
str_c("|-", str_replace_na(x), "-|")
```
As shown above, `str_c()` is vectorised, automatically recycling shorter vectors to the same length as the longest:
```{r}
str_c("prefix-", c("a", "b", "c"), "-suffix")
```
Objects of length 0 are silently dropped. This is particularly useful in conjunction with `if`:
```{r}
name <- "Hadley"
time_of_day <- "morning"
birthday <- FALSE
str_c("Good ", time_of_day, " ", name,
if (birthday) " and HAPPY BIRTHDAY",
"."
)
```
To collapse vectors into a single string, use `collapse`:
```{r}
str_c(c("x", "y", "z"), collapse = ", ")
```
### Subsetting strings
You can extract parts of a string using `str_sub()`. As well as the string, `str_sub()` takes `start` and `end` argument which give the (inclusive) position of the substring:
```{r}
x <- c("apple", "banana", "pear")
str_sub(x, 1, 3)
# negative numbers count backwards from end
str_sub(x, -3, -1)
```
Note that `str_sub()` won't fail if the string is too short: it will just return as much as possible:
```{r}
str_sub("a", 1, 5)
```
You can also use the assignment form of `str_sub()`, `` `str_sub<-()` ``, to modify strings:
```{r}
str_sub(x, 1, 1) <- str_to_lower(str_sub(x, 1, 1))
x
```
### Locales
Above I used`str_to_lower()` to change to lower case. You can also use `str_to_upper()` or `str_to_title()`. However, changing case is more complicated than it might at first seem because different languages have different rules for changing case. You can pick which set of rules to use by specifying a locale:
```{r}
# Turkish has two i's: with and without a dot, and it
# has a different rule for capitalising them:
str_to_upper(c("i", "ı"))
str_to_upper(c("i", "ı"), locale = "tr")
```
The locale is specified as ISO 639 language codes, which are two or three letter abbreviations. If you don't already know the code for your language, [Wikipedia](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) has a good list. If you leave the locale blank, it will use the current locale.
Another important operation that's affected by the locale is sorting. The base R `order()` and `sort()` functions sort strings using the currect locale. If you want robust behaviour across different computers, you may want to use `str_sort()` and `str_order()` which take an additional `locale` argument:
```{r}
x <- c("apple", "eggplant", "banana")
str_sort(x, locale = "en") # English
str_sort(x, locale = "haw") # Hawaiian
```
### Exercises
1. In your own words, describe the difference between the `sep` and `collapse`
arguments to `str_c()`.
1. In code that doesn't use stringr, you'll often see `paste()` and `paste0()`.
What's the difference between the two functions? What stringr function are
they equivalent to? How do the functions differ in their handling of
`NA`?
1. Use `str_length()` and `str_sub()` to extract the middle character from
a character vector.
1. What does `str_wrap()` do? When might you want to use it?
1. What does `str_trim()` do? What's the opposite of `str_trim()`?
1. Write a function that turns (e.g.) a vector `c("a", "b", "c")` into
the string `a, b, and c`. Think carefully about what it should do if
given a vector of length 0, 1, or 2.
## Matching patterns with regular expressions
Regular expressions, regexps for short, are a very terse language that allow to describe patterns in strings. They take a little while to get your head around, but once you've got it you'll find them extremely useful.
To learn regular expressions, we'll use `str_show()` and `str_show_all()`. These functions take a character vector and a regular expression, and shows you how they match. We'll start with very simple regular expressions and then gradually get more and more complicated. Once you've mastered pattern matching, you'll learn how to apply those ideas with various stringr functions.
### Basics matches
The simplest patterns match exact strings:
```{r}
x <- c("apple", "banana", "pear")
str_view(x, "an")
```
The next step up in complexity is `.`, which matches any character (except a new line):
```{r}
str_view(x, ".a.")
```
But if "`.`" matches any character, how do you match an actual "`.`"? You need to use an "escape" to tell the regular expression you want to match it exactly, not use the special behaviour. The escape character used by regular expressions is `\`. Unfortunately, that's also the escape character used by strings, so to match a literal "`.`" you need to use `\\.`.
```{r}
# To create the regular expression, we need \\
dot <- "\\."
# But the expression itself only contains one:
writeLines(dot)
# And this tells R to look for explicit .
str_view(c("abc", "a.c", "bef"), "a\\.c")
```
If `\` is used an escape character, how do you match a literal `\`? Well you need to escape it, creating the regular expression `\\`. To create that regular expression, you need to use a string, which also needs to escape `\`. That means to match a literal `\` you need to write `"\\\\"` - you need four backslashes to match one!
```{r}
x <- "a\\b"
writeLines(x)
str_view(x, "\\\\")
```
In this book, I'll write a regular expression like `\.` and the string that represents the regular expression as `"\\."`.
#### Exercises
1. Explain why each of these strings don't match a `\`: `"\"`, `"\\"`, `"\\\"`.
1. How would you match the sequence `"'\`?
1. What patterns does will this regular expression match `"\..\..\..`?
How would you represent it as a string?
### Anchors
By default, regular expressions will match any part of a string. It's often useful to _anchor_ the regular expression so that it matches from the start or end of the string. You can use:
* `^` to match the start of the string.
* `*` to match the end of the string.
```{r}
x <- c("apple", "banana", "pear")
str_view(x, "^a")
str_view(x, "a$")
```
To remember which is which, try this mneomic which I learned from [Evan Misshula](https://twitter.com/emisshula/status/323863393167613953): if you begin with power (`^`), you end up with money (`$`).
To force a regular expression to only match a complete string, anchor it with both `^` and `$`.:
```{r}
x <- c("apple pie", "apple", "apple cake")
str_view(x, "apple")
str_view(x, "^apple$")
```
You can also match the boundary between words with `\b`. I don't find I often use this in R, but I will sometimes use it when I'm doing a find all in RStudio when I want to find the name of a function that's a component of other functions. For example, I'll search for `\bsum\b` to avoid matching `summarise`, `summary`, `rowsum` and so on.
#### Exercises
1. How would you match the literal string `"$^$"`?
1. Given this corpus of common words:
```{r}
```
Create regular expressions that find all words that:
1. Start with "y".
1. End with "x"
1. Are exactly three letters long. (Don't cheat by using `str_length()`!)
1. Have seven letters or more.
Since this list is long, you might want to use the `match` argument to
`str_view()` to show only the matching or non-matching words.
### Character classes and alternatives
There are number of other special patterns that match more than one character:
* `.`: any character apart from a new line.
* `\d`: any digit.
* `\s`: any whitespace (space, tab, newline).
* `[abc]`: match a, b, or c.
* `[!abc]`: match anything except a, b, or c.
Remember, to create a regular expression containing `\d` or `\s`, you'll need to escape the `\` for the string, so you'll type `"\\d"` or `"\\s"`.
You can use _alternation_ to pick between one or more alternative patterns. For example, `abc|d..f` will match either '"abc"', or `"deaf"`. Note that the precedence for `|` is low, so that `abc|xyz` matches either `abc` or `xyz` not `abcyz` or `abxyz`:
```{r}
str_view(c("abc", "xyz"), "abc|xyz")
```
Like with mathematical expression, if precedence ever gets confusing, use parentheses to make it clear what you want:
```{r}
str_view(c("grey", "gray"), "gr(e|a)y")
```
#### Exercises
1. Create regular expressions that find all words that:
1. Start with a vowel.
1. That only contain constants. (Hint: thinking about matching
"not"-vowels.)
1. End with `ed`, but not with `eed`.
1. End with `ing` or `ise`.
1. Write a regular expression that matches a word if it's probably written
in British English, not American English.
1. Create a regular expression that will match telephone numbers as commonly
written in your country.
### Repetition
The next step up in power involves control how many times a pattern matches:
* `?`: 0 or 1
* `+`: 1 or more
* `*`: 0 or more
* `{n}`: exactly n
* `{n,}`: n or more
* `{,m}`: at most m
* `{n,m}`: between n and m
```{r}
```
By default these matches are "greedy": they will match the longest string possible. You can make them "lazy", matching the shortest string possible by putting a `?` after them. This is an advanced feature of regular expressions, but it's useful to know that it exists:
```{r}
```
Note that the precedence of these operators are high, so you can write: `colou?r` to match either American or British spellings. That means most uses will need parentheses, like `bana(na)+` or `ba(na){2,}`.
#### Exercises
1. Describe in words what these regular expressions match:
(read carefully to see I'm using a regular expression or a string
that defines a regular expression.)
1. `^.*$`
1. `"\\{.+\\}"`
1. `\d{4}-\d{2}-\d{2}`
1. `"\\\\{4}"`
1. Create regular expressions to find all words that:
1. Have three or more vowels in a row.
1. Start with three consonants
1. Have two or more vowel-consontant pairs in a row.
### Grouping and backreferences
You learned about parentheses earlier as a way to disambiguate complex expression. They do one other special thing: they also define numeric groups that you can refer to with _backreferences_, `\1`, `\2` etc.For example, the following regular expression finds all fruits that have a pair letters that's repeated.
```{r}
str_view(fruit, "(..)\\1", match = TRUE)
```
(You'll also see how they're useful in conjunction with `str_match()` in a few pages.)
Unfortunately `()` in regexps serve two purposes: you usually use them to disambiguate precedence, but you can also use for grouping. If you're using one set for grouping and one set for disambiguation, things can get confusing. You might want to use `(?:)` instead: it only disambiguates, and doesn't modify the grouping. They are called non-capturing parentheses.
For example:
```{r}
str_detect(c("grey", "gray"), "gr(e|a)y")
str_detect(c("grey", "gray"), "gr(?:e|a)y")
```
### Exercises
1. Describe, in words, what these expressions will match:
1. `"(.)(.)\\2\\1"`
1. `(..)\1`
1. `"(.)(.)(.).*\\3\\2\\1"`
1. Construct regular expressions to match words that:
1. Start and end with the same character.
## Tools
Now that you've learned the basics of regular expression, it's time to learn how to apply to real problems. In this section you'll learn a wide array of stringr functions that let you:
* Determine which elements match a pattern.
* Find the positions of matches.
* Extract the content of matches.
* Replace matches with new values.
* How can you split a string into based on a match.
Because regular expressions are so powerful, it's easy to try and solve every problem with a single regular expression. But since you're in a programming language, it's often easy to break the problem down into smaller pieces. If you find yourself getting stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down in to smaller pieces, solving each challenge before moving onto the next one.
### Detect matches
To determine if a character vector matches a pattern, use `str_detect()`. It returns a logical vector the same length as the input:
```{r}
x <- c("apple", "banana", "pear")
str_detect(x, "e")
```
Remember that when you use a logical vector in a numeric context, `FALSE` becomes 0 and `TRUE` becomes 1. That makes `sum()` and `mean()` useful if you want answer questions about matches across a larger vector:
```{r}
# How many common words start with t?
sum(str_detect(common, "^t"))
# What proportion of common words end with a vowel?
mean(str_detect(common, "[aeiou]$"))
```
When you have complex logical conditions (e.g. match a or b but not c unless d) it's often easier to combine multiple `str_detect()` calls with logical operators, rather than trying to create a single regular expression. For example, here are two ways to find all words that don't contain any vowels:
```{r}
# Find all words containing at least one vowel, and negate
no_vowels_1 <- !str_detect(common, "[aeiou]")
# Find all words consisting only of consonants (non-vowels)
no_vowels_2 <- str_detect(common, "^[^aeiou]+$")
all.equal(no_vowels_1, no_vowels_2)
```
The results are identical, but I think the first approach is significantly easier to understand. So if you find your regular expression is getting overly complicated, try breaking it up into smaller pieces, giving each piece a name, and then combining with logical operations.
A common use of `str_detect()` is to select the elements that match a pattern. You can do this with logical subsetting, or the convenient `str_subset()` wrapper:
```{r}
common[str_detect(common, "x$")]
str_subset(common, "x$")
```
A variation on `str_detect()` is `str_count()`: rather than a simple yes or no, it tells you how many matches there are in a string:
```{r}
x <- c("apple", "banana", "pear")
str_count(x, "a")
# On average, how many vowels per word?
mean(str_count(common, "[aeiou]"))
```
Note that matches never overlap. For example, in `"abababa"`, how many times will the pattern `"aba"` match? Regular expressions say two, not three:
```{r}
str_count("abababa", "aba")
str_view_all("abababa", "aba")
```
Note the use of `str_view_all()`. As you'll shortly learn, many stringr functions come in pairs: one function works with a single match, and the other works with all matches.
### Exercises
1. For each of the following challenges, try solving it both a single
regular expression, and a combination of multiple `str_detect()` calls.
1. Find all words that start or end with `x`.
1. Find all words that start with a vowel and end with a consonant.
1. Are there any words that contain at least one of each different
vowel?
1. What word has the highest number of vowels? What word has the highest
proportion of vowels? (Hint: what is the denominator?)
### Extract matches
To extract the actual text of a match, use `str_extract()`. To show that off, we're going to need a more complicated example. I'm going to use the [Harvard sentences](https://en.wikipedia.org/wiki/Harvard_sentences), which were designed to tested VOIP systems, but are also useful for practicing regexs.
```{r}
length(sentences)
head(sentences)
```
Imagine we want to find all sentences that contain a colour. We first create a vector of colour names, and then turn it into a single regular expression:
```{r}
colours <- c("red", "orange", "yellow", "green", "blue", "purple")
colour_match <- str_c(colours, collapse = "|")
colour_match
```
Now we can select the sentences that contain a colour, and then extract the colour to figure out which one it is:
```{r}
has_colour <- str_subset(sentences, colour_match)
matches <- str_extract(has_colour, colour_match)
head(matches)
```
Note that `str_extract()` only extracts the first match. We can see that most easily by first selecting all the sentences that have more than 1 match:
```{r}
more <- sentences[str_count(sentences, colour_match) > 1]
str_view_all(more, colour_match)
str_extract(more, colour_match)
```
This is a common pattern for stringr functions, because working with a single match allows you to use much simpler data structures. To get all matches, use `str_extract_all()`. It returns either a list or a matrix, based on the value of the `simplify` argument:
```{r}
str_extract_all(more, colour_match)
str_extract_all(more, colour_match, simplify = TRUE)
```
You'll learn more about working with lists in Chapter XYZ. If you use `simplify = TRUE`, note that short matches are expanded to the same length as the longest:
```{r}
x <- c("a", "a b", "a b c")
str_extract_all(x, "[a-z]", simplify = TRUE)
```
#### Exercises
1. In the previous example, you might have noticed that the regular
expression matched "fickered", which is not a colour. Modify the
regex to fix the problem.
1. From the Harvard sentences data, extract:
1. The first word from each sentence.
1. All words ending in `ing`.
1. All plurals.
### Grouped matches
Earlier in this chapter we talked about the use of parentheses for clarifying precedence and to use with backreferences when matching. You can also parentheses to extract parts of a complex match. For example, imagine we want to extract nouns from the sentences. As a heuristic, we'll look for any word that comes after "a" or "the". Defining a "word" in a regular expression is a little tricky. Here I use a sequence of at least one character that isn't a space.
```{r}
noun <- "(a|the) ([^ ]+)"
has_noun <- sentences %>%
str_subset(noun) %>%
head(10)
str_extract(has_noun, noun)
```
`str_extract()` gives us the complete match; `str_match()` gives each individual component. Instead of a character vector, it returns a matrix, with one column for the complete match followed by one column for each group:
```{r}
str_match(has_noun, noun)
```
(Unsurprisingly, our heuristic for detecting nouns is poor, and also picks up adjectives like smooth and parked.)
```{r}
num <- str_c("one", "two", "three", "four", "five", "six",
"seven", "eight", "nine", "ten", sep = "|")
match <- str_interp("(${num}) ([^ ]+s)\\b")
sentences %>%
str_subset(match) %>%
head(10) %>%
str_match(match)
```
Like `str_extract()`, if you want all matches for each string, you'll need `str_match_all()`.
#### Exercises
### Replacing matches
`str_replace()` and `str_replace_all()` allow you to replace matches with new strings:
```{r}
x <- c("apple", "pear", "banana")
str_replace(x, "[aeiou]", "-")
str_replace_all(x, "[aeiou]", "-")
```
With `str_replace_all()` you can also perform multiple replacements by supplying a named vector:
```{r}
x <- c("1 house", "2 cars", "3 people")
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
```
You can refer to groups with backreferences:
```{r}
sentences %>%
head(5) %>%
str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2")
```
<!-- Replacing with a function call (hopefully) -->
#### Exercises
1. Replace all `/` in a string with `\`.
### Splitting
Use `str_split()` to split a string up into pieces. For example, we could split sentences into words:
```{r}
sentences %>%
head(5) %>%
str_split(" ")
```
Because each component might contain a different number of pieces, this returns a list. If you're working with a length-1 vector, the easiest thing is to just extra the first element of the list:
```{r}
"a|b|c|d" %>%
str_split("\\|") %>%
.[[1]]
```
Otherwise, like the other stringr functions that return a list, you can use `simplify = TRUE` to return a matrix:
```{r}
sentences %>%
head(5) %>%
str_split(" ", simplify = TRUE)
```
You can also request a maximum number of pieces;
```{r}
fields <- c("Name: Hadley", "County: NZ", "Age: 35")
fields %>% str_split(": ", n = 2, simplify = TRUE)
```
Instead of splitting up strings by patterns, you can also split up by character, line, sentence and word `boundary()`s:
```{r}
x <- "This is a sentence. This is another sentence."
str_view_all(x, boundary("word"))
str_split(x, " ")[[1]]
str_split(x, boundary("word"))[[1]]
```
#### Exercises
1. Split up a string like `"apples, pears, and bananas"` into individual
components.
1. Why is it's better to split up by `boundary("word")` than `" "`?
1. What does splitting with an empty string (`""`) do?
### Find matches
`str_locate()`, `str_locate_all()` gives you the starting and ending positions of each match. These are particularly useful when none of the other functions does exactly what you want. You can use `str_locate()` to find the matching pattern, `str_sub()` to extract and/or modify them.
## Other types of pattern
When you use a pattern that's a string, it's automatically wrapped into a call to `regex()`:
```{r, eval = FALSE}
# The regular call:
str_view(fruit, "nana")
# Is shorthand for
str_view(fruit, regex("nana"))
```
You can use the other arguments of `regex()` to control details of the match:
* `ignore_case = TRUE` allows characters to match either their uppercase or
lowercase forms. This always uses the current locale.
```{r}
bananas <- c("banana", "Banana", "BANANA")
str_view(bananas, "banana")
str_view(bananas, regex("banana", ignore_case = TRUE))
```
* `multiline = TRUE` allows `^` and `$` to match the start and end of each
line rather than the start and end of the complete string.
```{r}
x <- "Line 1\nLine 2\nLine 3"
str_view_all(x, "^Line")
str_view_all(x, regex("^Line", multiline = TRUE))
```
* `comments = TRUE` allows you to use comments and white space to make
complex regular expressions more understand. Space are ignored, as is
everything after `#`. To match a literal space, you'll need to escape it:
`"\\ "`.
* `dotall = TRUE` allows `.` to match everything, including `\n`.
There are three other functions you can use instead of `regex()`:
* `fixed()`: matches exactly the specified sequence of bytes. It ignores
all special regular expressions and operates at a very low level.
This allows you to avoid complex escaping can be much faster than
regular expressions:
```{r}
microbenchmark::microbenchmark(
fixed = str_detect(sentences, fixed("the")),
regex = str_detect(sentences, "the")
)
```
Here the fixed match is almost 3x times faster than the regular
expression match. However, if you're working with non-English data
`fixed()` can lead to unreliable matches because there are often
multiple ways of representing the same character. For example, there
are two ways to define "á": either as a single character or as an "a"
plus an accent:
```{r}
a1 <- "\u00e1"
a2 <- "a\u0301"
c(a1, a2)
a1 == a2
```
They render identically, but because they're defined differently,
`fixed()` does find a match. Instead, you can use `coll()`, defined
next to respect human character comparison rules:
```{r}
str_detect(a1, fixed(a2))
str_detect(a1, coll(a2))
```
* `coll()`: compare strings using standard **coll**ation rules. This is
useful for doing case insensitive matching. Note that `coll()` takes a
`locale` parameter that controls which rules are used for comparing
characters. Unfortunately different parts of the world use different rules!
```{r}
# That means you also need to be aware of the difference
# when doing case insensitive matches:
i <- c("I", "İ", "i", "ı")
i
str_subset(i, coll("i", TRUE))
str_subset(i, coll("i", TRUE, locale = "tr"))
```
Both `fixed()` and `regex()` have `ignore_case` arguments, but they
do not allow you to pick the locale: they always use the default locale.
You can see what that is with the following code; more on stringi
later.
```{r}
stringi::stri_locale_info()
```
The downside of `coll()` is because the rules for recognising which
characters are the same are complicated, `coll()` is relatively slow
compared to `regex()` and `fixed()`.
* As you saw with `str_split()` you can use `boundary()` to match boundaries.
You can also use it with the other functions, all though
```{r}
x <- "This is a sentence."
str_view_all(x, boundary("word"))
str_extract_all(x, boundary("word"))
```
### Exercises
1. How would you find all strings containing `\` with `regex()` vs.
with `fixed()`?
1. What are the five most common words in `sentences`?
## Other uses of regular expressions
There are a few other functions in base R that accept regular expressions:
* `apropos()` searchs all objects avaiable from the global environment. This
is useful if you can't quite remember the name of the function.
```{r}
apropos("replace")
```
* `dir()` lists all the files in a directory. The `pattern` argument takes
a regular expression and only return file names that match the pattern.
For example, you can find all the rmarkdown files in the current
directory with:
```{r}
head(dir(pattern = "\\.Rmd$"))
```
(If you're more comfortable with "globs" like `*.Rmd`, you can convert
them to regular expressions with `glob2rx()`):
* `ls()` is similar to `apropos()` but only works in the current
environment. However, if you have so many objects in your environment
that you have to use a regular expression to filter them all, you
need to think about what you're doing! (And probably use a list instead).
## Advanced topics
### The stringi package
stringr is built on top of the __stringi__ package. stringr is useful when you're learning because it exposes a minimal set of functions, that have been carefully picked to handle the most common string manipulation functions. stringi on the other hand is designed to be comprehensive. It contains almost every function you might ever need. stringi has `length(ls("package:stringi"))` functions to stringr's `length(ls("package:stringr"))`.
So if you find yourself struggling to do something that doesn't seem natural in stringr, it's worth taking a look at stringi. The use of the two packages are very similar because stringi was designed to mimic stringi's interface. The main difference is the prefix: `str_` vs `stri_`.
### Encoding
Complicated and fraught with difficulty. Best approach is to convert to UTF-8 as soon as possible. All stringr and stringi functions do this. Readr always reads as UTF-8.
* UTF-8
* Latin1
* bytes: everything else
Generally, you should fix encoding problems during the data import phase.
Detect encoding operates statistically, by comparing frequency of byte fragments across languages and encodings. Fundamentally heuristic and works better with larger amounts of text (i.e. a whole file, not a single string from that file).
```{r}
x <- "\xc9migr\xe9 cause c\xe9l\xe8bre d\xe9j\xe0 vu."
x
str_conv(x, "ISO-8859-1")
as.data.frame(stringi::stri_enc_detect(x))
str_conv(x, "ISO-8859-2")
```
### UTF-8
<http://wiki.secondlife.com/wiki/Unicode_In_5_Minutes>
<http://www.joelonsoftware.com/articles/Unicode.html>
Homoglyph attack, https://github.com/reinderien/mimic.

4
www/.gitignore vendored
View File

@ -1,3 +1 @@
bootstrap-2.3.2/
highlight/
jquery-1.11.0/
bootstrap-2.3.2

View File

@ -0,0 +1,625 @@
(function() {
// If window.HTMLWidgets is already defined, then use it; otherwise create a
// new object. This allows preceding code to set options that affect the
// initialization process (though none currently exist).
window.HTMLWidgets = window.HTMLWidgets || {};
// See if we're running in a viewer pane. If not, we're in a web browser.
var viewerMode = window.HTMLWidgets.viewerMode =
/\bviewer_pane=1\b/.test(window.location);
// See if we're running in Shiny mode. If not, it's a static document.
// Note that static widgets can appear in both Shiny and static modes, but
// obviously, Shiny widgets can only appear in Shiny apps/documents.
var shinyMode = window.HTMLWidgets.shinyMode =
typeof(window.Shiny) !== "undefined" && !!window.Shiny.outputBindings;
// We can't count on jQuery being available, so we implement our own
// version if necessary.
function querySelectorAll(scope, selector) {
if (typeof(jQuery) !== "undefined" && scope instanceof jQuery) {
return scope.find(selector);
}
if (scope.querySelectorAll) {
return scope.querySelectorAll(selector);
}
}
function asArray(value) {
if (value === null)
return [];
if ($.isArray(value))
return value;
return [value];
}
// Implement jQuery's extend
function extend(target /*, ... */) {
if (arguments.length == 1) {
return target;
}
for (var i = 1; i < arguments.length; i++) {
var source = arguments[i];
for (var prop in source) {
if (source.hasOwnProperty(prop)) {
target[prop] = source[prop];
}
}
}
return target;
}
// IE8 doesn't support Array.forEach.
function forEach(values, callback, thisArg) {
if (values.forEach) {
values.forEach(callback, thisArg);
} else {
for (var i = 0; i < values.length; i++) {
callback.call(thisArg, values[i], i, values);
}
}
}
// Replaces the specified method with the return value of funcSource.
//
// Note that funcSource should not BE the new method, it should be a function
// that RETURNS the new method. funcSource receives a single argument that is
// the overridden method, it can be called from the new method. The overridden
// method can be called like a regular function, it has the target permanently
// bound to it so "this" will work correctly.
function overrideMethod(target, methodName, funcSource) {
var superFunc = target[methodName] || function() {};
var superFuncBound = function() {
return superFunc.apply(target, arguments);
};
target[methodName] = funcSource(superFuncBound);
}
// Implement a vague facsimilie of jQuery's data method
function elementData(el, name, value) {
if (arguments.length == 2) {
return el["htmlwidget_data_" + name];
} else if (arguments.length == 3) {
el["htmlwidget_data_" + name] = value;
return el;
} else {
throw new Error("Wrong number of arguments for elementData: " +
arguments.length);
}
}
// http://stackoverflow.com/questions/3446170/escape-string-for-use-in-javascript-regex
function escapeRegExp(str) {
return str.replace(/[\-\[\]\/\{\}\(\)\*\+\?\.\\\^\$\|]/g, "\\$&");
}
function hasClass(el, className) {
var re = new RegExp("\\b" + escapeRegExp(className) + "\\b");
return re.test(el.className);
}
// elements - array (or array-like object) of HTML elements
// className - class name to test for
// include - if true, only return elements with given className;
// if false, only return elements *without* given className
function filterByClass(elements, className, include) {
var results = [];
for (var i = 0; i < elements.length; i++) {
if (hasClass(elements[i], className) == include)
results.push(elements[i]);
}
return results;
}
function on(obj, eventName, func) {
if (obj.addEventListener) {
obj.addEventListener(eventName, func, false);
} else if (obj.attachEvent) {
obj.attachEvent(eventName, func);
}
}
function off(obj, eventName, func) {
if (obj.removeEventListener)
obj.removeEventListener(eventName, func, false);
else if (obj.detachEvent) {
obj.detachEvent(eventName, func);
}
}
// Translate array of values to top/right/bottom/left, as usual with
// the "padding" CSS property
// https://developer.mozilla.org/en-US/docs/Web/CSS/padding
function unpackPadding(value) {
if (typeof(value) === "number")
value = [value];
if (value.length === 1) {
return {top: value[0], right: value[0], bottom: value[0], left: value[0]};
}
if (value.length === 2) {
return {top: value[0], right: value[1], bottom: value[0], left: value[1]};
}
if (value.length === 3) {
return {top: value[0], right: value[1], bottom: value[2], left: value[1]};
}
if (value.length === 4) {
return {top: value[0], right: value[1], bottom: value[2], left: value[3]};
}
}
// Convert an unpacked padding object to a CSS value
function paddingToCss(paddingObj) {
return paddingObj.top + "px " + paddingObj.right + "px " + paddingObj.bottom + "px " + paddingObj.left + "px";
}
// Makes a number suitable for CSS
function px(x) {
if (typeof(x) === "number")
return x + "px";
else
return x;
}
// Retrieves runtime widget sizing information for an element.
// The return value is either null, or an object with fill, padding,
// defaultWidth, defaultHeight fields.
function sizingPolicy(el) {
var sizingEl = document.querySelector("script[data-for='" + el.id + "'][type='application/htmlwidget-sizing']");
if (!sizingEl)
return null;
var sp = JSON.parse(sizingEl.textContent || sizingEl.text || "{}");
if (viewerMode) {
return sp.viewer;
} else {
return sp.browser;
}
}
function initSizing(el) {
var sizing = sizingPolicy(el);
if (!sizing)
return;
var cel = document.getElementById("htmlwidget_container");
if (!cel)
return;
if (typeof(sizing.padding) !== "undefined") {
document.body.style.margin = "0";
document.body.style.padding = paddingToCss(unpackPadding(sizing.padding));
}
if (sizing.fill) {
document.body.style.overflow = "hidden";
document.body.style.width = "100%";
document.body.style.height = "100%";
document.documentElement.style.width = "100%";
document.documentElement.style.height = "100%";
if (cel) {
cel.style.position = "absolute";
var pad = unpackPadding(sizing.padding);
cel.style.top = pad.top + "px";
cel.style.right = pad.right + "px";
cel.style.bottom = pad.bottom + "px";
cel.style.left = pad.left + "px";
el.style.width = "100%";
el.style.height = "100%";
}
return {
getWidth: function() { return cel.offsetWidth; },
getHeight: function() { return cel.offsetHeight; }
};
} else {
el.style.width = px(sizing.width);
el.style.height = px(sizing.height);
return {
getWidth: function() { return el.offsetWidth; },
getHeight: function() { return el.offsetHeight; }
};
}
}
// Default implementations for methods
var defaults = {
find: function(scope) {
return querySelectorAll(scope, "." + this.name);
},
renderError: function(el, err) {
var $el = $(el);
this.clearError(el);
// Add all these error classes, as Shiny does
var errClass = "shiny-output-error";
if (err.type !== null) {
// use the classes of the error condition as CSS class names
errClass = errClass + " " + $.map(asArray(err.type), function(type) {
return errClass + "-" + type;
}).join(" ");
}
errClass = errClass + " htmlwidgets-error";
// Is el inline or block? If inline or inline-block, just display:none it
// and add an inline error.
var display = $el.css("display");
$el.data("restore-display-mode", display);
if (display === "inline" || display === "inline-block") {
$el.hide();
if (err.message !== "") {
var errorSpan = $("<span>").addClass(errClass);
errorSpan.text(err.message);
$el.after(errorSpan);
}
} else if (display === "block") {
// If block, add an error just after the el, set visibility:none on the
// el, and position the error to be on top of the el.
// Mark it with a unique ID and CSS class so we can remove it later.
$el.css("visibility", "hidden");
if (err.message !== "") {
var errorDiv = $("<div>").addClass(errClass).css("position", "absolute")
.css("top", el.offsetTop)
.css("left", el.offsetLeft)
// setting width can push out the page size, forcing otherwise
// unnecessary scrollbars to appear and making it impossible for
// the element to shrink; so use max-width instead
.css("maxWidth", el.offsetWidth)
.css("height", el.offsetHeight);
errorDiv.text(err.message);
$el.after(errorDiv);
// Really dumb way to keep the size/position of the error in sync with
// the parent element as the window is resized or whatever.
var intId = setInterval(function() {
if (!errorDiv[0].parentElement) {
clearInterval(intId);
return;
}
errorDiv
.css("top", el.offsetTop)
.css("left", el.offsetLeft)
.css("maxWidth", el.offsetWidth)
.css("height", el.offsetHeight);
}, 500);
}
}
},
clearError: function(el) {
var $el = $(el);
var display = $el.data("restore-display-mode");
$el.data("restore-display-mode", null);
if (display === "inline" || display === "inline-block") {
if (display)
$el.css("display", display);
$(el.nextSibling).filter(".htmlwidgets-error").remove();
} else if (display === "block"){
$el.css("visibility", "inherit");
$(el.nextSibling).filter(".htmlwidgets-error").remove();
}
},
sizing: {}
};
// Called by widget bindings to register a new type of widget. The definition
// object can contain the following properties:
// - name (required) - A string indicating the binding name, which will be
// used by default as the CSS classname to look for.
// - initialize (optional) - A function(el) that will be called once per
// widget element; if a value is returned, it will be passed as the third
// value to renderValue.
// - renderValue (required) - A function(el, data, initValue) that will be
// called with data. Static contexts will cause this to be called once per
// element; Shiny apps will cause this to be called multiple times per
// element, as the data changes.
window.HTMLWidgets.widget = function(definition) {
if (!definition.name) {
throw new Error("Widget must have a name");
}
if (!definition.type) {
throw new Error("Widget must have a type");
}
// Currently we only support output widgets
if (definition.type !== "output") {
throw new Error("Unrecognized widget type '" + definition.type + "'");
}
// TODO: Verify that .name is a valid CSS classname
if (!definition.renderValue) {
throw new Error("Widget must have a renderValue function");
}
// For static rendering (non-Shiny), use a simple widget registration
// scheme. We also use this scheme for Shiny apps/documents that also
// contain static widgets.
window.HTMLWidgets.widgets = window.HTMLWidgets.widgets || [];
// Merge defaults into the definition; don't mutate the original definition.
var staticBinding = extend({}, defaults, definition);
overrideMethod(staticBinding, "find", function(superfunc) {
return function(scope) {
var results = superfunc(scope);
// Filter out Shiny outputs, we only want the static kind
return filterByClass(results, "html-widget-output", false);
};
});
window.HTMLWidgets.widgets.push(staticBinding);
if (shinyMode) {
// Shiny is running. Register the definition as an output binding.
// Merge defaults into the definition; don't mutate the original definition.
// The base object is a Shiny output binding if we're running in Shiny mode,
// or an empty object if we're not.
var shinyBinding = extend(new Shiny.OutputBinding(), defaults, definition);
// Wrap renderValue to handle initialization, which unfortunately isn't
// supported natively by Shiny at the time of this writing.
// NB: shinyBinding.initialize may be undefined, as it's optional.
// Rename initialize to make sure it isn't called by a future version
// of Shiny that does support initialize directly.
shinyBinding._htmlwidgets_initialize = shinyBinding.initialize;
delete shinyBinding.initialize;
overrideMethod(shinyBinding, "find", function(superfunc) {
return function(scope) {
var results = superfunc(scope);
// Only return elements that are Shiny outputs, not static ones
var dynamicResults = results.filter(".html-widget-output");
// It's possible that whatever caused Shiny to think there might be
// new dynamic outputs, also caused there to be new static outputs.
// Since there might be lots of different htmlwidgets bindings, we
// schedule execution for later--no need to staticRender multiple
// times.
if (results.length !== dynamicResults.length)
scheduleStaticRender();
return dynamicResults;
};
});
overrideMethod(shinyBinding, "renderValue", function(superfunc) {
return function(el, data) {
// Resolve strings marked as javascript literals to objects
if (!(data.evals instanceof Array)) data.evals = [data.evals];
for (var i = 0; data.evals && i < data.evals.length; i++) {
window.HTMLWidgets.evaluateStringMember(data.x, data.evals[i]);
}
if (!this.renderOnNullValue) {
if (data.x === null) {
el.style.visibility = "hidden";
return;
} else {
el.style.visibility = "inherit";
}
}
if (!elementData(el, "initialized")) {
initSizing(el);
elementData(el, "initialized", true);
if (this._htmlwidgets_initialize) {
var result = this._htmlwidgets_initialize(el, el.offsetWidth,
el.offsetHeight);
elementData(el, "init_result", result);
}
}
Shiny.renderDependencies(data.deps);
superfunc(el, data.x, elementData(el, "init_result"));
};
});
overrideMethod(shinyBinding, "resize", function(superfunc) {
return function(el, width, height) {
// Shiny can call resize before initialize/renderValue have been
// called, which doesn't make sense for widgets.
if (elementData(el, "initialized")) {
superfunc(el, width, height, elementData(el, "init_result"));
}
};
});
Shiny.outputBindings.register(shinyBinding, shinyBinding.name);
}
};
var scheduleStaticRenderTimerId = null;
function scheduleStaticRender() {
if (!scheduleStaticRenderTimerId) {
scheduleStaticRenderTimerId = setTimeout(function() {
scheduleStaticRenderTimerId = null;
window.HTMLWidgets.staticRender();
}, 1);
}
}
// Render static widgets after the document finishes loading
// Statically render all elements that are of this widget's class
window.HTMLWidgets.staticRender = function() {
var bindings = window.HTMLWidgets.widgets || [];
forEach(bindings, function(binding) {
var matches = binding.find(document.documentElement);
forEach(matches, function(el) {
var sizeObj = initSizing(el, binding);
if (hasClass(el, "html-widget-static-bound"))
return;
el.className = el.className + " html-widget-static-bound";
var initResult;
if (binding.initialize) {
initResult = binding.initialize(el,
sizeObj ? sizeObj.getWidth() : el.offsetWidth,
sizeObj ? sizeObj.getHeight() : el.offsetHeight
);
}
if (binding.resize) {
var lastSize = {};
var resizeHandler = function(e) {
var size = {
w: sizeObj ? sizeObj.getWidth() : el.offsetWidth,
h: sizeObj ? sizeObj.getHeight() : el.offsetHeight
};
if (size.w === 0 && size.h === 0)
return;
if (size.w === lastSize.w && size.h === lastSize.h)
return;
lastSize = size;
binding.resize(el, size.w, size.h, initResult);
};
on(window, "resize", resizeHandler);
// This is needed for cases where we're running in a Shiny
// app, but the widget itself is not a Shiny output, but
// rather a simple static widget. One example of this is
// an rmarkdown document that has runtime:shiny and widget
// that isn't in a render function. Shiny only knows to
// call resize handlers for Shiny outputs, not for static
// widgets, so we do it ourselves.
if (window.jQuery) {
window.jQuery(document).on("shown", resizeHandler);
window.jQuery(document).on("hidden", resizeHandler);
}
// This is needed for the specific case of ioslides, which
// flips slides between display:none and display:block.
// Ideally we would not have to have ioslide-specific code
// here, but rather have ioslides raise a generic event,
// but the rmarkdown package just went to CRAN so the
// window to getting that fixed may be long.
if (window.addEventListener) {
// It's OK to limit this to window.addEventListener
// browsers because ioslides itself only supports
// such browsers.
on(document, "slideenter", resizeHandler);
on(document, "slideleave", resizeHandler);
}
}
var scriptData = document.querySelector("script[data-for='" + el.id + "'][type='application/json']");
if (scriptData) {
var data = JSON.parse(scriptData.textContent || scriptData.text);
// Resolve strings marked as javascript literals to objects
if (!(data.evals instanceof Array)) data.evals = [data.evals];
for (var k = 0; data.evals && k < data.evals.length; k++) {
window.HTMLWidgets.evaluateStringMember(data.x, data.evals[k]);
}
binding.renderValue(el, data.x, initResult);
}
});
});
}
// Wait until after the document has loaded to render the widgets.
if (document.addEventListener) {
document.addEventListener("DOMContentLoaded", function() {
document.removeEventListener("DOMContentLoaded", arguments.callee, false);
window.HTMLWidgets.staticRender();
}, false);
} else if (document.attachEvent) {
document.attachEvent("onreadystatechange", function() {
if (document.readyState === "complete") {
document.detachEvent("onreadystatechange", arguments.callee);
window.HTMLWidgets.staticRender();
}
});
}
window.HTMLWidgets.getAttachmentUrl = function(depname, key) {
// If no key, default to the first item
if (typeof(key) === "undefined")
key = 1;
var link = document.getElementById(depname + "-" + key + "-attachment");
if (!link) {
throw new Error("Attachment " + depname + "/" + key + " not found in document");
}
return link.getAttribute("href");
};
window.HTMLWidgets.dataframeToD3 = function(df) {
var names = [];
var length;
for (var name in df) {
if (df.hasOwnProperty(name))
names.push(name);
if (typeof(df[name]) !== "object" || typeof(df[name].length) === "undefined") {
throw new Error("All fields must be arrays");
} else if (typeof(length) !== "undefined" && length !== df[name].length) {
throw new Error("All fields must be arrays of the same length");
}
length = df[name].length;
}
var results = [];
var item;
for (var row = 0; row < length; row++) {
item = {};
for (var col = 0; col < names.length; col++) {
item[names[col]] = df[names[col]][row];
}
results.push(item);
}
return results;
};
window.HTMLWidgets.transposeArray2D = function(array) {
var newArray = array[0].map(function(col, i) {
return array.map(function(row) {
return row[i]
})
});
return newArray;
};
// Split value at splitChar, but allow splitChar to be escaped
// using escapeChar. Any other characters escaped by escapeChar
// will be included as usual (including escapeChar itself).
function splitWithEscape(value, splitChar, escapeChar) {
var results = [];
var escapeMode = false;
var currentResult = "";
for (var pos = 0; pos < value.length; pos++) {
if (!escapeMode) {
if (value[pos] === splitChar) {
results.push(currentResult);
currentResult = "";
} else if (value[pos] === escapeChar) {
escapeMode = true;
} else {
currentResult += value[pos];
}
} else {
currentResult += value[pos];
escapeMode = false;
}
}
if (currentResult !== "") {
results.push(currentResult);
}
return results;
}
// Function authored by Yihui/JJ Allaire
window.HTMLWidgets.evaluateStringMember = function(o, member) {
var parts = splitWithEscape(member, '.', '\\');
for (var i = 0, l = parts.length; i < l; i++) {
var part = parts[i];
// part may be a character or 'numeric' member name
if (o !== null && typeof o === "object" && part in o) {
if (i == (l - 1)) { // if we are at the end of the line then evalulate
if (typeof o[part] === "string")
o[part] = eval("(" + o[part] + ")");
} else { // otherwise continue to next embedded object
o = o[part];
}
}
}
};
})();

4
www/jquery-1.11.0/jquery.min.js vendored Normal file

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,11 @@
.str_view ul, .str_view li {
list-style: none;
padding: 0;
margin: 0.5em 0;
font-family: monospace;
}
.str_view .match {
border: 1px solid #ccc;
background-color: #eee;
}

View File

@ -0,0 +1,17 @@
HTMLWidgets.widget({
name: 'str_view',
type: 'output',
initialize: function(el, width, height) {
},
renderValue: function(el, x, instance) {
el.innerHTML = x.html;
},
resize: function(el, width, height, instance) {
}
});