diff --git a/EDA.qmd b/EDA.qmd index b0fb536..83380a2 100644 --- a/EDA.qmd +++ b/EDA.qmd @@ -91,7 +91,7 @@ This is true even if you measure quantities that are constant, like the speed of Each of your measurements will include a small amount of error that varies from measurement to measurement. Variables can also vary if you measure across different subjects (e.g. the eye colors of different people) or different times (e.g. the energy levels of an electron at different moments). Every variable has its own pattern of variation, which can reveal interesting information about how that variable varies between measurements on the same observation as well as across observations. -The best way to understand that pattern is to visualize the distribution of the variable's values, which you've learned about in @sec-data-visualisation. +The best way to understand that pattern is to visualize the distribution of the variable's values, which you've learned about in @sec-data-visualization. We'll start our exploration by visualizing the distribution of weights (`carat`) of \~54,000 diamonds from the `diamonds` dataset. Since `carat` is a numerical variable, we can use a histogram: @@ -492,7 +492,7 @@ ggplot(mpg, What do you learn? How do you interpret the plots? -5. Compare and contrast `geom_violin()` with a faceted `geom_histogram()`, or a coloured `geom_freqpoly()`. +5. Compare and contrast `geom_violin()` with a faceted `geom_histogram()`, or a colored `geom_freqpoly()`. What are the pros and cons of each method? 6. If you have a small dataset, it's sometimes useful to use `geom_jitter()` to see the relationship between a continuous and categorical variable. diff --git a/data-visualize.qmd b/data-visualize.qmd index 932a1f6..cfc0a78 100644 --- a/data-visualize.qmd +++ b/data-visualize.qmd @@ -1,4 +1,4 @@ -# Data visualization {#sec-data-visualisation} +# Data visualization {#sec-data-visualization} ```{r} #| results: "asis" @@ -844,7 +844,7 @@ Another great tool is Google: try googling the error message, as it's likely som ## Summary In this chapter, you've learned the basics of data visualization with ggplot2. -We started with the basic idea that underpins ggplot2: a visualization is a mapping from variables in your data to aesthetic properties like position, colour, size and shape. +We started with the basic idea that underpins ggplot2: a visualization is a mapping from variables in your data to aesthetic properties like position, color, size and shape. You then learned about increasing the complexity and improving the presentation of your plots layer-by-layer. You also learned about commonly used plots for visualizing the distribution of a single variable as well as for visualizing relationships between two or more variables, by levering additional aesthetic mappings and/or splitting your plot into small multiples using faceting. diff --git a/factors.qmd b/factors.qmd index 3980521..67b42a7 100644 --- a/factors.qmd +++ b/factors.qmd @@ -278,7 +278,7 @@ This makes the plot easier to read because the colors of the line at the far rig #| unrelated to the lines on the plot. #| #| Rearranging the legend makes the plot easier to read because the -#| legend colours now match the order of the lines on the far right +#| legend colors now match the order of the lines on the far right #| of the plot. You can see some unsuprising patterns: the proportion #| never marred decreases with age, married forms an upside down U #| shape, and widowed starts off low but increases steeply after age @@ -291,12 +291,12 @@ by_age <- gss_cat |> prop = n / sum(n) ) -ggplot(by_age, aes(age, prop, colour = marital)) + +ggplot(by_age, aes(age, prop, color = marital)) + geom_line(na.rm = TRUE) -ggplot(by_age, aes(age, prop, colour = fct_reorder2(marital, age, prop))) + +ggplot(by_age, aes(age, prop, color = fct_reorder2(marital, age, prop))) + geom_line() + - labs(colour = "marital") + labs(color = "marital") ``` Finally, for bar plots, you can use `fct_infreq()` to order levels in decreasing frequency: this is the simplest type of reordering because it doesn't need any extra variables. diff --git a/functions.qmd b/functions.qmd index da64ebc..aeeed35 100644 --- a/functions.qmd +++ b/functions.qmd @@ -421,7 +421,7 @@ This is a problem of indirection, and it arises because dplyr uses **tidy evalua Tidy evaluation is great 95% of the time because it makes your data analyses very concise as you never have to say which data frame a variable comes from; it's obvious from the context. The downside of tidy evaluation comes when we want to wrap up repeated tidyverse code into a function. -Here we need some way to tell `group_mean()` and `summarise()` not to treat `group_var` and `mean_var` as the name of the variables, but instead look inside them for the variable we actually want to use. +Here we need some way to tell `group_mean()` and `summarize()` not to treat `group_var` and `mean_var` as the name of the variables, but instead look inside them for the variable we actually want to use. Tidy evaluation includes a solution to this problem called **embracing** 🤗. Embracing a variable means to wrap it in braces so (e.g.) `var` becomes `{{ var }}`. @@ -712,7 +712,7 @@ hex_plot <- function(df, x, y, z, bins = 20, fun = "mean") { df |> ggplot(aes({{ x }}, {{ y }}, z = {{ z }})) + stat_summary_hex( - aes(colour = after_scale(fill)), # make border same colour as fill + aes(color = after_scale(fill)), # make border same color as fill bins = bins, fun = fun, ) @@ -808,9 +808,9 @@ For example, the following function makes it particularly easy to interactively ```{r} # https://twitter.com/yutannihilat_en/status/1574387230025875457 -density <- function(colour, facets, binwidth = 0.1) { +density <- function(color, facets, binwidth = 0.1) { diamonds |> - ggplot(aes(carat, after_stat(density), colour = {{ colour }})) + + ggplot(aes(carat, after_stat(density), color = {{ color }})) + geom_freqpoly(binwidth = binwidth) + facet_wrap(vars({{ facets }})) } @@ -896,17 +896,17 @@ This makes it easier to see the hierarchy in your code by skimming the left-hand ```{r} # missing extra two spaces -density <- function(colour, facets, binwidth = 0.1) { +density <- function(color, facets, binwidth = 0.1) { diamonds |> - ggplot(aes(carat, after_stat(density), colour = {{ colour }})) + + ggplot(aes(carat, after_stat(density), color = {{ color }})) + geom_freqpoly(binwidth = binwidth) + facet_wrap(vars({{ facets }})) } # Pipe indented incorrectly -density <- function(colour, facets, binwidth = 0.1) { +density <- function(color, facets, binwidth = 0.1) { diamonds |> - ggplot(aes(carat, after_stat(density), colour = {{ colour }})) + + ggplot(aes(carat, after_stat(density), color = {{ color }})) + geom_freqpoly(binwidth = binwidth) + facet_wrap(vars({{ facets }})) } diff --git a/intro.qmd b/intro.qmd index 423a680..9389f10 100644 --- a/intro.qmd +++ b/intro.qmd @@ -46,16 +46,16 @@ Once you have tidy data, a common next step is to **transform** it. Transformation includes narrowing in on observations of interest (like all people in one city, or all data from the last year), creating new variables that are functions of existing variables (like computing speed from distance and time), and calculating a set of summary statistics (like counts or means). Together, tidying and transforming are called **wrangling**, because getting your data in a form that's natural to work with often feels like a fight! -Once you have tidy data with the variables you need, there are two main engines of knowledge generation: visualisation and modelling. +Once you have tidy data with the variables you need, there are two main engines of knowledge generation: visualization and modelling. These have complementary strengths and weaknesses so any real analysis will iterate between them many times. -**Visualisation** is a fundamentally human activity. -A good visualisation will show you things that you did not expect, or raise new questions about the data. -A good visualisation might also hint that you're asking the wrong question, or that you need to collect different data. -Visualisations can surprise you and they don't scale particularly well because they require a human to interpret them. +**Visualization** is a fundamentally human activity. +A good visualization will show you things that you did not expect, or raise new questions about the data. +A good visualization might also hint that you're asking the wrong question, or that you need to collect different data. +Visualizations can surprise you and they don't scale particularly well because they require a human to interpret them. The last step of data science is **communication**, an absolutely critical part of any data analysis project. -It doesn't matter how well your models and visualisation have led you to understand the data unless you can also communicate your results to others. +It doesn't matter how well your models and visualization have led you to understand the data unless you can also communicate your results to others. Surrounding all these tools is **programming**. Programming is a cross-cutting tool that you use in nearly every part of a data science project. @@ -70,7 +70,7 @@ Throughout this book, we'll point you to resources where you can learn more. The previous description of the tools of data science is organised roughly according to the order in which you use them in an analysis (although of course you'll iterate through them multiple times). In our experience, however, learning data ingest and tidying first is sub-optimal, because 80% of the time it's routine and boring, and the other 20% of the time it's weird and frustrating. That's a bad place to start learning a new subject! -Instead, we'll start with visualisation and transformation of data that's already been imported and tidied. +Instead, we'll start with visualization and transformation of data that's already been imported and tidied. That way, when you ingest and tidy your own data, your motivation will stay high because you know the pain is worth the effort. Within each chapter, we try and adhere to a similar pattern: start with some motivating examples so you can see the bigger picture, and then dive into the details. @@ -358,4 +358,3 @@ knitr::kable(df, format = "markdown") ```{r} cli:::ruler() ``` - diff --git a/joins.qmd b/joins.qmd index 509b949..daeb4ca 100644 --- a/joins.qmd +++ b/joins.qmd @@ -90,7 +90,7 @@ These relationships are summarized visually in @fig-flights-relationships. #| out-width: ~ #| fig-cap: > #| Connections between all five data frames in the nycflights13 package. -#| Variables making up a primary key are coloured grey, and are connected +#| Variables making up a primary key are colored grey, and are connected #| to their corresponding foreign keys with arrows. #| fig-alt: > #| The relationships between airports, planes, flights, weather, and @@ -379,7 +379,7 @@ flights2 |> coord_quickmap() ``` - You might want to use the `size` or `colour` of the points to display the average delay for each airport. + You might want to use the `size` or `color` of the points to display the average delay for each airport. 8. What happened on June 13 2013? Draw a map of the delays, and then use Google to cross-reference with the weather. @@ -396,7 +396,7 @@ flights2 |> inner_join(airports, by = c("dest" = "faa")) |> ggplot(aes(lon, lat)) + borders("state") + - geom_point(aes(size = n, colour = delay)) + + geom_point(aes(size = n, color = delay)) + coord_quickmap() ``` @@ -426,12 +426,12 @@ y <- tribble( #| echo: false #| out-width: ~ #| fig-cap: > -#| Graphical representation of two simple tables. The coloured `key` -#| columns map background colour to key value. The grey columns represent +#| Graphical representation of two simple tables. The colored `key` +#| columns map background color to key value. The grey columns represent #| the "value" columns that are carried along for the ride. #| fig-alt: > #| x and y are two data frames with 2 columns and 3 rows, with contents -#| as described in the text. The values of the keys are coloured: +#| as described in the text. The values of the keys are colored: #| 1 is green, 2 is purple, 3 is orange, and 4 is yellow. knitr::include_graphics("diagrams/join/setup.png", dpi = 270) diff --git a/layers.qmd b/layers.qmd index 9a5dc88..19bd1b2 100644 --- a/layers.qmd +++ b/layers.qmd @@ -9,7 +9,7 @@ status("complete") ## Introduction -In the @sec-data-visualisation, you learned much more than just how to make scatterplots, bar charts, and boxplots. +In the @sec-data-visualization, you learned much more than just how to make scatterplots, bar charts, and boxplots. You learned a foundation that you can use to make *any* type of plot with ggplot2. In this chapter, you'll expand on that foundation as you learn about the layered grammar of graphics. @@ -498,7 +498,7 @@ To learn more about any single geom, use the help (e.g. `?geom_smooth`). ## Facets -In @sec-data-visualisation you learned about faceting with `facet_wrap()`, which splits a plot into subplots that each display one subset of the data based on a categorical variable. +In @sec-data-visualization you learned about faceting with `facet_wrap()`, which splits a plot into subplots that each display one subset of the data based on a categorical variable. ```{r} #| fig-alt: > diff --git a/missing-values.qmd b/missing-values.qmd index 3772dc2..065f369 100644 --- a/missing-values.qmd +++ b/missing-values.qmd @@ -10,7 +10,7 @@ status("polishing") ## Introduction You've already learned the basics of missing values earlier in the book. -You first saw them in @sec-data-visualisation where they resulted in a warning when making a plot as well as in @sec-summarize where they interfered with computing summary statistics, and you learned about their infectious nature and how to check for their presence in @sec-na-comparison. +You first saw them in @sec-data-visualization where they resulted in a warning when making a plot as well as in @sec-summarize where they interfered with computing summary statistics, and you learned about their infectious nature and how to check for their presence in @sec-na-comparison. Now we'll come back to them in more depth, so you can learn more of the details. We'll start by discussing some general tools for working with missing values recorded as `NA`s. diff --git a/prog-strings.qmd b/prog-strings.qmd index 42b3c73..ee36521 100644 --- a/prog-strings.qmd +++ b/prog-strings.qmd @@ -128,19 +128,19 @@ str_extract_all(x, boundary("word")) ### Extract ```{r} -colours <- c("red", "orange", "yellow", "green", "blue", "purple") -colour_match <- str_c(colours, collapse = "|") -colour_match +colors <- c("red", "orange", "yellow", "green", "blue", "purple") +color_match <- str_c(colors, collapse = "|") +color_match -more <- sentences[str_count(sentences, colour_match) > 1] -str_extract_all(more, colour_match) +more <- sentences[str_count(sentences, color_match) > 1] +str_extract_all(more, color_match) ``` If you use `simplify = TRUE`, `str_extract_all()` will return a matrix with short matches expanded to the same length as the longest: ```{r} -str_extract_all(more, colour_match, simplify = TRUE) +str_extract_all(more, color_match, simplify = TRUE) x <- c("a", "a b", "a b c") str_extract_all(x, "[a-z]", simplify = TRUE) diff --git a/quarto.qmd b/quarto.qmd index 506d23e..a1dc5f4 100644 --- a/quarto.qmd +++ b/quarto.qmd @@ -364,7 +364,7 @@ comma(.12358124331) ### Exercises -1. Add a section that explores how diamond sizes vary by cut, colour, and clarity. +1. Add a section that explores how diamond sizes vary by cut, color, and clarity. Assume you're writing a report for someone who doesn't know R, and instead of setting `echo: false` on each chunk, set a global option. 2. Download `diamond-sizes.qmd` from . diff --git a/spreadsheets.qmd b/spreadsheets.qmd index 985209c..37d2937 100644 --- a/spreadsheets.qmd +++ b/spreadsheets.qmd @@ -15,7 +15,7 @@ In this chapter we will introduce you to tools for working with data in Excel sp This will build on much of what you've learned in @sec-data-import but we will also discuss additional considerations and complexities when working with data from spreadsheets. If you or your collaborators are using spreadsheets for organizing data, we strongly recommend reading the paper "Data Organization in Spreadsheets" by Karl Broman and Kara Woo: . -The best practices presented in this paper will save you much headache down the line when you import the data from a spreadsheet into R to analyse and visualise. +The best practices presented in this paper will save you much headache down the line when you import the data from a spreadsheet into R to analyse and visualize. ## Excel diff --git a/whole-game.qmd b/whole-game.qmd index 1c59a38..7dcdef4 100644 --- a/whole-game.qmd +++ b/whole-game.qmd @@ -29,7 +29,7 @@ knitr::include_graphics("diagrams/data-science/whole-game.png", dpi = 270) Five chapters focus on the tools of data science: - Visualisation is a great place to start with R programming, because the payoff is so clear: you get to make elegant and informative plots that help you understand data. - In @sec-data-visualisation you'll dive into visualization, learning the basic structure of a ggplot2 plot, and powerful techniques for turning data into plots. + In @sec-data-visualization you'll dive into visualization, learning the basic structure of a ggplot2 plot, and powerful techniques for turning data into plots. - Visualisation alone is typically not enough, so in @sec-data-transform, you'll learn the key verbs that allow you to select important variables, filter out key observations, create new variables, and compute summaries.