library(tidyverse)
You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.
Working with data provided by R packages is a great way to learn the tools of data science, but at some point you want to stop learning and start working with your own data. In this chapter, you’ll learn how to read plain-text rectangular files into R.
In this chapter, you’ll learn how to load flat files in R with the readr package, which is part of the core tidyverse.
library(tidyverse)
To begin we’ll focus on the most rectangular data file type: the CSV, short for comma separate values. Here is what a simple CSV file looks like. The first row, commonly called the header row, gives the column names, and the following six rows give the data.
#> Student ID,Full Name,favourite.food,mealPlan,AGE
#> 1,Sunil Huffmann,Strawberry yoghurt,Lunch only,4
#> 2,Barclay Lynn,French fries,Lunch only,5
#> 3,Jayendra Lyne,N/A,Breakfast and lunch,7
#> 4,Leon Rossini,Anchovies,Lunch only,
#> 5,Chidiegwu Dunkel,Pizza,Breakfast and lunch,five
#> 6,Güvenç Attila,Ice cream,Lunch only,6
#tbl-students-table shows a representation of the same data as a table.
Student ID | Full Name | favourite.food | mealPlan | AGE |
---|---|---|---|---|
1 | Sunil Huffmann | Strawberry yoghurt | Lunch only | 4 |
2 | Barclay Lynn | French fries | Lunch only | 5 |
3 | Jayendra Lyne | N/A | Breakfast and lunch | 7 |
4 | Leon Rossini | Anchovies | Lunch only | NA |
5 | Chidiegwu Dunkel | Pizza | Breakfast and lunch | five |
6 | Güvenç Attila | Ice cream | Lunch only | 6 |
We can read this file into R using #chp-https://readr.tidyverse.org/reference/read_delim
. The first argument is the most important: it’s the path to the file.
students <- read_csv("data/students.csv") #> Rows: 6 Columns: 5 #> ── Column specification ──────────────────────────────────────────────────────── #> Delimiter: "," #> chr (4): Full Name, favourite.food, mealPlan, AGE #> dbl (1): Student ID #> #> ℹ Use `spec()` to retrieve the full column specification for this data. #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
When you run #chp-https://readr.tidyverse.org/reference/read_delim
it prints out a message telling you the number of rows and columns of data, the delimiter that was used, and the column specifications (names of columns organized by the type of data the column contains). It also prints out some information about how to retrieve the full column specification as well as how to quiet this message. This message is an important part of readr and we’ll come back to in #sec-col-types.
Once you read data in, the first step usually involves transforming it in some way to make it easier to work with in the rest of your analysis. Let’s take another look at the students
data with that in mind.
In the favourite.food
column, there are a bunch of food items and then the character string N/A
, which should have been an real NA
that R will recognize as “not available”. This is something we can address using the na
argument.
students <- read_csv("data/students.csv", na = c("N/A", "")) students #> # A tibble: 6 × 5 #> `Student ID` `Full Name` favourite.food mealPlan AGE #> <dbl> <chr> <chr> <chr> <chr> #> 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4 #> 2 2 Barclay Lynn French fries Lunch only 5 #> 3 3 Jayendra Lyne <NA> Breakfast and lunch 7 #> 4 4 Leon Rossini Anchovies Lunch only <NA> #> 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five #> 6 6 Güvenç Attila Ice cream Lunch only 6
You might also notice that the Student ID
and Full Name
columns are surrounded by back ticks. That’s because they contain spaces, breaking R’s usual rules for variable names. To refer to them, you need to use those back ticks:
students |> rename( student_id = `Student ID`, full_name = `Full Name` ) #> # A tibble: 6 × 5 #> student_id full_name favourite.food mealPlan AGE #> <dbl> <chr> <chr> <chr> <chr> #> 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4 #> 2 2 Barclay Lynn French fries Lunch only 5 #> 3 3 Jayendra Lyne <NA> Breakfast and lunch 7 #> 4 4 Leon Rossini Anchovies Lunch only <NA> #> 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five #> 6 6 Güvenç Attila Ice cream Lunch only 6
An alternative approach is to use #chp-https://rdrr.io/pkg/janitor/man/clean_names
to use some heuristics to turn them all into snake case at onceThe #chp-http://sfirke.github.io/janitor/ package is not part of the tidyverse, but it offers handy functions for data cleaning and works well within data pipelines that uses |>
..
students |> janitor::clean_names() #> # A tibble: 6 × 5 #> student_id full_name favourite_food meal_plan age #> <dbl> <chr> <chr> <chr> <chr> #> 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4 #> 2 2 Barclay Lynn French fries Lunch only 5 #> 3 3 Jayendra Lyne <NA> Breakfast and lunch 7 #> 4 4 Leon Rossini Anchovies Lunch only <NA> #> 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five #> 6 6 Güvenç Attila Ice cream Lunch only 6
Another common task after reading in data is to consider variable types. For example, meal_type
is a categorical variable with a known set of possible values, which in R should be represent as factor:
students |> janitor::clean_names() |> mutate( meal_plan = factor(meal_plan) ) #> # A tibble: 6 × 5 #> student_id full_name favourite_food meal_plan age #> <dbl> <chr> <chr> <fct> <chr> #> 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4 #> 2 2 Barclay Lynn French fries Lunch only 5 #> 3 3 Jayendra Lyne <NA> Breakfast and lunch 7 #> 4 4 Leon Rossini Anchovies Lunch only <NA> #> 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five #> 6 6 Güvenç Attila Ice cream Lunch only 6
Note that the values in the meal_type
variable has stayed exactly the same, but the type of variable denoted underneath the variable name has changed from character (<chr>
) to factor (<fct>
). You’ll learn more about factors in #chp-factors.
Before you move on to analyzing these data, you’ll probably want to fix the age
column as well: currently it’s a character variable because of the one observation that is typed out as five
instead of a numeric 5
. We discuss the details of fixing this issue in #chp-spreadsheets.
students <- students |> janitor::clean_names() |> mutate( meal_plan = factor(meal_plan), age = parse_number(if_else(age == "five", "5", age)) ) students #> # A tibble: 6 × 5 #> student_id full_name favourite_food meal_plan age #> <dbl> <chr> <chr> <fct> <dbl> #> 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4 #> 2 2 Barclay Lynn French fries Lunch only 5 #> 3 3 Jayendra Lyne <NA> Breakfast and lunch 7 #> 4 4 Leon Rossini Anchovies Lunch only NA #> 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch 5 #> 6 6 Güvenç Attila Ice cream Lunch only 6
There are a couple of other important arguments that we need to mention, and they’ll be easier to demonstrate if we first show you a handy trick: #chp-https://readr.tidyverse.org/reference/read_delim
can read csv files that you’ve created in a string:
read_csv( "a,b,c 1,2,3 4,5,6" ) #> # A tibble: 2 × 3 #> a b c #> <dbl> <dbl> <dbl> #> 1 1 2 3 #> 2 4 5 6
Usually #chp-https://readr.tidyverse.org/reference/read_delim
uses the first line of the data for the column names, which is a very common convention. But sometime there are a few lines of metadata at the top of the file. You can use skip = n
to skip the first n
lines or use comment = "#"
to drop all lines that start with (e.g.) #
:
read_csv( "The first line of metadata The second line of metadata x,y,z 1,2,3", skip = 2 ) #> # A tibble: 1 × 3 #> x y z #> <dbl> <dbl> <dbl> #> 1 1 2 3 read_csv( "# A comment I want to skip x,y,z 1,2,3", comment = "#" ) #> # A tibble: 1 × 3 #> x y z #> <dbl> <dbl> <dbl> #> 1 1 2 3
In other cases, the data might not have column names. You can use col_names = FALSE
to tell #chp-https://readr.tidyverse.org/reference/read_delim
not to treat the first row as headings, and instead label them sequentially from X1
to Xn
:
read_csv( "1,2,3 4,5,6", col_names = FALSE ) #> # A tibble: 2 × 3 #> X1 X2 X3 #> <dbl> <dbl> <dbl> #> 1 1 2 3 #> 2 4 5 6
Alternatively you can pass col_names
a character vector which will be used as the column names:
read_csv( "1,2,3 4,5,6", col_names = c("x", "y", "z") ) #> # A tibble: 2 × 3 #> x y z #> <dbl> <dbl> <dbl> #> 1 1 2 3 #> 2 4 5 6
These arguments are all you need to know to read the majority of CSV files that you’ll encounter in practice. (For the rest, you’ll need to carefully inspect your .csv
file and carefully read the documentation for #chp-https://readr.tidyverse.org/reference/read_delim
’s many other arguments.)
Once you’ve mastered #chp-https://readr.tidyverse.org/reference/read_delim
, using readr’s other functions is straightforward; it’s just a matter of knowing which function to reach for:
#chp-https://readr.tidyverse.org/reference/read_delim
reads semicolon separated files. These use ;
instead of ,
to separate fields, and are common in countries that use ,
as the decimal marker.
#chp-https://readr.tidyverse.org/reference/read_delim
reads tab delimited files.
#chp-https://readr.tidyverse.org/reference/read_delim
reads in files with any delimiter, attempting to automatically guess the delimited if you don’t specify it.
#chp-https://readr.tidyverse.org/reference/read_fwf
reads fixed width files. You can specify fields either by their widths with #chp-https://readr.tidyverse.org/reference/read_fwf
or their position with #chp-https://readr.tidyverse.org/reference/read_fwf
.
#chp-https://readr.tidyverse.org/reference/read_table
reads a common variation of fixed width files where columns are separated by white space.
#chp-https://readr.tidyverse.org/reference/read_log
reads Apache style log files.
What function would you use to read a file where fields were separated with “|”?
Apart from file
, skip
, and comment
, what other arguments do #chp-https://readr.tidyverse.org/reference/read_delim
and #chp-https://readr.tidyverse.org/reference/read_delim
have in common?
What are the most important arguments to #chp-https://readr.tidyverse.org/reference/read_fwf
?
Sometimes strings in a CSV file contain commas. To prevent them from causing problems they need to be surrounded by a quoting character, like "
or '
. By default, #chp-https://readr.tidyverse.org/reference/read_delim
assumes that the quoting character will be "
. What argument to #chp-https://readr.tidyverse.org/reference/read_delim
do you need to specify to read the following text into a data frame?
"x,y\n1,'a,b'"
Identify what is wrong with each of the following inline CSV files. What happens when you run the code?
read_csv("a,b\n1,2,3\n4,5,6") read_csv("a,b,c\n1,2\n1,2,3,4") read_csv("a,b\n\"1") read_csv("a,b\n1,2\na,b") read_csv("a;b\n1;3")
Practice referring to non-syntactic names in the following data frame by:
1
.1
vs 2
.3
which is 2
divided by 1
.one
, two
and three
.annoying <- tibble( `1` = 1:10, `2` = `1` * 2 + rnorm(length(`1`)) )
A CSV file doesn’t contain any information about the type of each variable (i.e. whether it’s a logical, number, string, etc), so readr will try to guess the type. This section describes how the guessing process works, how to resolve some common problems that cause it to fail, and if needed, how to supply the column types yourself. Finally, we’ll mention a couple of general strategies that are a useful if readr is failing catastrophically and you need to get more insight in to the structure of your file.
readr uses a heuristic to figure out the column types. For each column, it pulls the values of 1,000You can override the default of 1000 with the guess_max
argument. rows spaced evenly from the first row to the last, ignoring an missing values. It then works through the following questions:
F
, T
, FALSE
, or TRUE
(ignoring case)? If so, it’s a logical.1
, -4.5
, 5e6
, Inf
)? If so, it’s a number.You can see that behavior in action in this simple example:
read_csv(" logical,numeric,date,string TRUE,1,2021-01-15,abc false,4.5,2021-02-15,def T,Inf,2021-02-16,ghi" ) #> Rows: 3 Columns: 4 #> ── Column specification ──────────────────────────────────────────────────────── #> Delimiter: "," #> chr (1): string #> dbl (1): numeric #> lgl (1): logical #> date (1): date #> #> ℹ Use `spec()` to retrieve the full column specification for this data. #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. #> # A tibble: 3 × 4 #> logical numeric date string #> <lgl> <dbl> <date> <chr> #> 1 TRUE 1 2021-01-15 abc #> 2 FALSE 4.5 2021-02-15 def #> 3 TRUE Inf 2021-02-16 ghi
This heuristic works well if you have a clean dataset, but in real life you’ll encounter a selection of weird and wonderful failures.
The most common way column detection fails is that a column contains unexpected values and you get a character column instead of a more specific type. One of the most common causes for this a missing value, recorded using something other than the NA
that stringr expects.
Take this simple 1 column CSV file as an example:
csv <- " x 10 . 20 30"
If we read it without any additional arguments, x
becomes a character column:
df <- read_csv(csv) #> Rows: 4 Columns: 1 #> ── Column specification ──────────────────────────────────────────────────────── #> Delimiter: "," #> chr (1): x #> #> ℹ Use `spec()` to retrieve the full column specification for this data. #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
In this very small case, you can easily see the missing value .
. But what happens if you have thousands of rows with only a few missing values represented by .
s speckled amongst them? One approach is to tell readr that x
is a numeric column, and then see where it fails. You can do that with the col_types
argument, which takes a named list:
df <- read_csv(csv, col_types = list(x = col_double())) #> Warning: One or more parsing issues, call `problems()` on your data frame for details, #> e.g.: #> dat <- vroom(...) #> problems(dat)
Now #chp-https://readr.tidyverse.org/reference/read_delim
reports that there was a problem, and tells us we can find out more with #chp-https://readr.tidyverse.org/reference/problems
:
problems(df) #> # A tibble: 1 × 5 #> row col expected actual file #> <int> <int> <chr> <chr> <chr> #> 1 3 1 a double . /private/tmp/Rtmp43JYhG/file7cf337a06034
This tells us that there was a problem in row 3, col 1 where readr expected a double but got a .
. That suggests this dataset uses .
for missing values. So then we set na = "."
, the automatic guessing succeeds, giving us the numeric column that we want:
df <- read_csv(csv, na = ".") #> Rows: 4 Columns: 1 #> ── Column specification ──────────────────────────────────────────────────────── #> Delimiter: "," #> dbl (1): x #> #> ℹ Use `spec()` to retrieve the full column specification for this data. #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
readr provides a total of nine column types for you to use:
#chp-https://readr.tidyverse.org/reference/parse_atomic
and #chp-https://readr.tidyverse.org/reference/parse_atomic
read logicals and real numbers. They’re relatively rarely needed (except as above), since readr will usually guess them for you.#chp-https://readr.tidyverse.org/reference/parse_atomic
reads integers. We distinguish because integers and doubles in this book because they’re functionally equivalent, but reading integers explicitly can occasionally be useful because they occupy half the memory of doubles.#chp-https://readr.tidyverse.org/reference/parse_atomic
reads strings. This is sometimes useful to specify explicitly when you have a column that is a numeric identifier, i.e. long series of digits that identifies some object, but it doesn’t make sense to (e.g.) divide it in half.#chp-https://readr.tidyverse.org/reference/parse_factor
, #chp-https://readr.tidyverse.org/reference/parse_datetime
and #chp-https://readr.tidyverse.org/reference/parse_datetime
create factors, dates and date-time respectively; you’ll learn more about those when we get to those data types in #chp-factors and #chp-datetimes.#chp-https://readr.tidyverse.org/reference/parse_number
is a permissive numeric parser that will ignore non-numeric components, and is particularly useful for currencies. You’ll learn more about it in #chp-numbers.#chp-https://readr.tidyverse.org/reference/col_skip
skips a column so it’s not included in the result.It’s also possible to override the default column by switching from #chp-https://rdrr.io/r/base/list
to #chp-https://readr.tidyverse.org/reference/cols
:
csv <- " x,y,z 1,2,3" read_csv(csv, col_types = cols(.default = col_character())) #> # A tibble: 1 × 3 #> x y z #> <chr> <chr> <chr> #> 1 1 2 3
Another useful helper is #chp-https://readr.tidyverse.org/reference/cols
which will read in only the columns you specify:
read_csv( "x,y,z 1,2,3", col_types = cols_only(x = col_character()) ) #> # A tibble: 1 × 1 #> x #> <chr> #> 1 1
Sometimes your data is split across multiple files instead of being contained in a single file. For example, you might have sales data for multiple months, with each month’s data in a separate file: 01-sales.csv
for January, 02-sales.csv
for February, and 03-sales.csv
for March. With #chp-https://readr.tidyverse.org/reference/read_delim
you can read these data in at once and stack them on top of each other in a single data frame.
sales_files <- c("data/01-sales.csv", "data/02-sales.csv", "data/03-sales.csv") read_csv(sales_files, id = "file") #> Rows: 19 Columns: 6 #> ── Column specification ──────────────────────────────────────────────────────── #> Delimiter: "," #> chr (1): month #> dbl (4): year, brand, item, n #> #> ℹ Use `spec()` to retrieve the full column specification for this data. #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. #> # A tibble: 19 × 6 #> file month year brand item n #> <chr> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 data/01-sales.csv January 2019 1 1234 3 #> 2 data/01-sales.csv January 2019 1 8721 9 #> 3 data/01-sales.csv January 2019 1 1822 2 #> 4 data/01-sales.csv January 2019 2 3333 1 #> 5 data/01-sales.csv January 2019 2 2156 9 #> 6 data/01-sales.csv January 2019 2 3987 6 #> # … with 13 more rows
With the additional id
parameter we have added a new column called file
to the resulting data frame that identifies the file the data come from. This is especially helpful in circumstances where the files you’re reading in do not have an identifying column that can help you trace the observations back to their original sources.
If you have many files you want to read in, it can get cumbersome to write out their names as a list. Instead, you can use the base #chp-https://rdrr.io/r/base/list.files
function to find the files for you by matching a pattern in the file names. You’ll learn more about these patterns in #chp-regexps.
sales_files <- list.files("data", pattern = "sales\\.csv$", full.names = TRUE) sales_files #> [1] "data/01-sales.csv" "data/02-sales.csv" "data/03-sales.csv"
readr also comes with two useful functions for writing data back to disk: #chp-https://readr.tidyverse.org/reference/write_delim
and #chp-https://readr.tidyverse.org/reference/write_delim
. Both functions increase the chances of the output file being read back in correctly by using the standard UTF-8 encoding for strings and ISO8601 format for date-times.
The most important arguments are x
(the data frame to save), and file
(the location to save it). You can also specify how missing values are written with na
, and if you want to append
to an existing file.
write_csv(students, "students.csv")
Now let’s read that csv file back in. Note that the type information is lost when you save to csv:
students #> # A tibble: 6 × 5 #> student_id full_name favourite_food meal_plan age #> <dbl> <chr> <chr> <fct> <dbl> #> 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4 #> 2 2 Barclay Lynn French fries Lunch only 5 #> 3 3 Jayendra Lyne <NA> Breakfast and lunch 7 #> 4 4 Leon Rossini Anchovies Lunch only NA #> 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch 5 #> 6 6 Güvenç Attila Ice cream Lunch only 6 write_csv(students, "students-2.csv") read_csv("students-2.csv") #> # A tibble: 6 × 5 #> student_id full_name favourite_food meal_plan age #> <dbl> <chr> <chr> <chr> <dbl> #> 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4 #> 2 2 Barclay Lynn French fries Lunch only 5 #> 3 3 Jayendra Lyne <NA> Breakfast and lunch 7 #> 4 4 Leon Rossini Anchovies Lunch only NA #> 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch 5 #> 6 6 Güvenç Attila Ice cream Lunch only 6
This makes CSVs a little unreliable for caching interim results—you need to recreate the column specification every time you load in. There are two main options:
#chp-https://readr.tidyverse.org/reference/read_rds
and #chp-https://readr.tidyverse.org/reference/read_rds
are uniform wrappers around the base functions #chp-https://rdrr.io/r/base/readRDS
and #chp-https://rdrr.io/r/base/readRDS
. These store data in R’s custom binary format called RDS:
write_rds(students, "students.rds") read_rds("students.rds") #> # A tibble: 6 × 5 #> student_id full_name favourite_food meal_plan age #> <dbl> <chr> <chr> <fct> <dbl> #> 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4 #> 2 2 Barclay Lynn French fries Lunch only 5 #> 3 3 Jayendra Lyne <NA> Breakfast and lunch 7 #> 4 4 Leon Rossini Anchovies Lunch only NA #> 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch 5 #> 6 6 Güvenç Attila Ice cream Lunch only 6
The arrow package allows you to read and write parquet files, a fast binary file format that can be shared across programming languages:
library(arrow) write_parquet(students, "students.parquet") read_parquet("students.parquet") #> # A tibble: 6 × 5 #> student_id full_name favourite_food meal_plan age #> <dbl> <chr> <chr> <fct> <dbl> #> 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4 #> 2 2 Barclay Lynn French fries Lunch only 5 #> 3 3 Jayendra Lyne NA Breakfast and lunch 7 #> 4 4 Leon Rossini Anchovies Lunch only NA #> 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch 5 #> 6 6 Güvenç Attila Ice cream Lunch only 6
Parquet tends to be much faster than RDS and is usable outside of R, but does require you install the arrow package.
Sometimes you’ll need to assemble a tibble “by hand” doing a little data entry in your R script. There are two useful functions to help you do this which differ in whether you layout the tibble by columns or by rows. #chp-https://tibble.tidyverse.org/reference/tibble
works by column:
tibble( x = c(1, 2, 5), y = c("h", "m", "g"), z = c(0.08, 0.83, 0.60) ) #> # A tibble: 3 × 3 #> x y z #> <dbl> <chr> <dbl> #> 1 1 h 0.08 #> 2 2 m 0.83 #> 3 5 g 0.6
Note that every column in tibble must be same size, so you’ll get an error if they’re not:
tibble( x = c(1, 2), y = c("h", "m", "g"), z = c(0.08, 0.83, 0.6) ) #> Error: #> ! Tibble columns must have compatible sizes. #> • Size 2: Existing data. #> • Size 3: Column `y`. #> ℹ Only values of size one are recycled.
Laying out the data by column can make it hard to see how the rows are related, so an alternative is #chp-https://tibble.tidyverse.org/reference/tribble
, short for transposed tibble, which lets you lay out your data row by row. #chp-https://tibble.tidyverse.org/reference/tribble
is customized for data entry in code: column headings start with ~
and entries are separated by commas. This makes it possible to lay out small amounts of data in an easy to read form:
tribble( ~x, ~y, ~z, "h", 1, 0.08, "m", 2, 0.83, "g", 5, 0.60, ) #> # A tibble: 3 × 3 #> x y z #> <chr> <dbl> <dbl> #> 1 h 1 0.08 #> 2 m 2 0.83 #> 3 g 5 0.6
We’ll use #chp-https://tibble.tidyverse.org/reference/tibble
and #chp-https://tibble.tidyverse.org/reference/tribble
later in the book to construct small examples to demonstrate how various functions work.
In this chapter, you’ve learned how to load CSV files with #chp-https://readr.tidyverse.org/reference/read_delim
and to do your own data entry with #chp-https://tibble.tidyverse.org/reference/tibble
and #chp-https://tibble.tidyverse.org/reference/tribble
. You’ve learned how csv files work, some of the problems you might encounter, and how to overcome them. We’ll come to data import a few times in this book: #chp-databases will show you how to load data from databases, #chp-spreadsheets from Excel and googlesheets, #chp-rectangling from JSON, and #chp-webscraping from websites.
Now that you’re writing a substantial amount of R code, it’s time to learn more about organizing your code into files and directories. In the next chapter, you’ll learn all about the advantages of scripts and projects, and some of the many tools that they provide to make your life easier.