<p>Working with data provided by R packages is a great way to learn data science tools, but you want to apply what you’ve learned to your own data at some point. In this chapter, you’ll learn the basics of reading data files into R.</p>
<p>Specifically, this chapter will focus on reading plain-text rectangular files. We’ll start with practical advice for handling features like column names, types, and missing data. You will then learn about reading data from multiple files at once and writing data from R to a file. Finally, you’ll learn how to handcraft data frames in R.</p>
<p>To begin, we’ll focus on the most rectangular data file type: CSV, which is short for comma-separated values. Here is what a simple CSV file looks like. The first row, commonly called the header row, gives the column names, and the following six rows provide the data.</p>
<p>We can read this file into R using <code><ahref="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code>. The first argument is the most important: it’s the path to the file.</p>
<p>When you run <code><ahref="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code>, it prints out a message telling you the number of rows and columns of data, the delimiter that was used, and the column specifications (names of columns organized by the type of data the column contains). It also prints out some information about retrieving the full column specification and how to quiet this message. This message is an integral part of readr, and we’ll return to it in <ahref="#sec-col-types"data-type="xref">#sec-col-types</a>.</p>
<p>Once you read data in, the first step usually involves transforming it in some way to make it easier to work with in the rest of your analysis. Let’s take another look at the <code>students</code> data with that in mind.</p>
<p>In the <code>favourite.food</code> column, there are a bunch of food items, and then the character string <code>N/A</code>, which should have been a real <code>NA</code> that R will recognize as “not available”. This is something we can address using the <code>na</code> argument.</p>
<p>You might also notice that the <code>Student ID</code> and <code>Full Name</code> columns are surrounded by backticks. That’s because they contain spaces, breaking R’s usual rules for variable names. To refer to them, you need to use those backticks:</p>
<p>An alternative approach is to use <code><ahref="https://rdrr.io/pkg/janitor/man/clean_names.html">janitor::clean_names()</a></code> to use some heuristics to turn them all into snake case at once<spandata-type="footnote">The <ahref="http://sfirke.github.io/janitor/">janitor</a> package is not part of the tidyverse, but it offers handy functions for data cleaning and works well within data pipelines that uses <code>|></code>.</span>.</p>
<p>Another common task after reading in data is to consider variable types. For example, <code>meal_type</code> is a categorical variable with a known set of possible values, which in R should be represented as a factor:</p>
<p>Note that the values in the <code>meal_type</code> variable have stayed the same, but the type of variable denoted underneath the variable name has changed from character (<code><chr></code>) to factor (<code><fct></code>). You’ll learn more about factors in <ahref="#chp-factors"data-type="xref">#chp-factors</a>.</p>
<p>Before you analyze these data, you’ll probably want to fix the <code>age</code> column. Currently, it’s a character variable because one of the observations is typed out as <code>five</code> instead of a numeric <code>5</code>. We discuss the details of fixing this issue in <ahref="#chp-spreadsheets"data-type="xref">#chp-spreadsheets</a>.</p>
<p>There are a couple of other important arguments that we need to mention, and they’ll be easier to demonstrate if we first show you a handy trick: <code><ahref="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> can read CSV files that you’ve created in a string:</p>
<p>Usually, <code><ahref="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> uses the first line of the data for the column names, which is a very common convention. But it’s not uncommon for a few lines of metadata to be included at the top of the file. You can use <code>skip = n</code> to skip the first <code>n</code> lines or use <code>comment = "#"</code> to drop all lines that start with (e.g.) <code>#</code>:</p>
<p>In other cases, the data might not have column names. You can use <code>col_names = FALSE</code> to tell <code><ahref="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> not to treat the first row as headings and instead label them sequentially from <code>X1</code> to <code>Xn</code>:</p>
<p>These arguments are all you need to know to read the majority of CSV files that you’ll encounter in practice. (For the rest, you’ll need to carefully inspect your <code>.csv</code> file and read the documentation for <code><ahref="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code>’s many other arguments.)</p>
<p>Once you’ve mastered <code><ahref="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code>, using readr’s other functions is straightforward; it’s just a matter of knowing which function to reach for:</p>
<ul><li><p><code><ahref="https://readr.tidyverse.org/reference/read_delim.html">read_csv2()</a></code> reads semicolon-separated files. These use <code>;</code> instead of <code>,</code> to separate fields and are common in countries that use <code>,</code> as the decimal marker.</p></li>
<li><p><code><ahref="https://readr.tidyverse.org/reference/read_delim.html">read_delim()</a></code> reads in files with any delimiter, attempting to automatically guess the delimiter if you don’t specify it.</p></li>
<li><p><code><ahref="https://readr.tidyverse.org/reference/read_fwf.html">read_fwf()</a></code> reads fixed-width files. You can specify fields by their widths with <code><ahref="https://readr.tidyverse.org/reference/read_fwf.html">fwf_widths()</a></code> or by their positions with <code><ahref="https://readr.tidyverse.org/reference/read_fwf.html">fwf_positions()</a></code>.</p></li>
<li><p><code><ahref="https://readr.tidyverse.org/reference/read_table.html">read_table()</a></code> reads a common variation of fixed-width files where columns are separated by white space.</p></li>
<li><p>Apart from <code>file</code>, <code>skip</code>, and <code>comment</code>, what other arguments do <code><ahref="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> and <code><ahref="https://readr.tidyverse.org/reference/read_delim.html">read_tsv()</a></code> have in common?</p></li>
<li><p>What are the most important arguments to <code><ahref="https://readr.tidyverse.org/reference/read_fwf.html">read_fwf()</a></code>?</p></li>
<p>Sometimes strings in a CSV file contain commas. To prevent them from causing problems, they need to be surrounded by a quoting character, like <code>"</code> or <code>'</code>. By default, <code><ahref="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> assumes that the quoting character will be <code>"</code>. To read the following text into a data frame, what argument to <code><ahref="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> do you need to specify?</p>
<p>A CSV file doesn’t contain any information about the type of each variable (i.e., whether it’s a logical, number, string, etc.), so readr will try to guess the type. This section describes how the guessing process works, how to resolve some common problems that cause it to fail, and, if needed, how to supply the column types yourself. Finally, we’ll mention a few general strategies that are useful if readr is failing catastrophically and you need to get more insight into the structure of your file.</p>
<p>readr uses a heuristic to figure out the column types. For each column, it pulls the values of 1,000<spandata-type="footnote">You can override the default of 1000 with the <code>guess_max</code> argument.</span> rows spaced evenly from the first row to the last, ignoring missing values. It then works through the following questions:</p>
<li>Does it contain only numbers (e.g., <code>1</code>, <code>-4.5</code>, <code>5e6</code>, <code>Inf</code>)? If so, it’s a number.</li>
<li>Does it match the ISO8601 standard? If so, it’s a date or date-time. (We’ll return to date-times in more detail in <ahref="#sec-creating-datetimes"data-type="xref">#sec-creating-datetimes</a>).</li>
<p>The most common way column detection fails is that a column contains unexpected values, and you get a character column instead of a more specific type. One of the most common causes for this is a missing value, recorded using something other than the <code>NA</code> that stringr expects.</p>
<p>In this very small case, you can easily see the missing value <code>.</code>. But what happens if you have thousands of rows with only a few missing values represented by <code>.</code>s speckled among them? One approach is to tell readr that <code>x</code> is a numeric column, and then see where it fails. You can do that with the <code>col_types</code> argument, which takes a named list:</p>
<p>Now <code><ahref="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> reports that there was a problem, and tells us we can find out more with <code><ahref="https://readr.tidyverse.org/reference/problems.html">problems()</a></code>:</p>
<p>This tells us that there was a problem in row 3, col 1 where readr expected a double but got a <code>.</code>. That suggests this dataset uses <code>.</code> for missing values. So then we set <code>na = "."</code>, the automatic guessing succeeds, giving us the numeric column that we want:</p>
<code><ahref="https://readr.tidyverse.org/reference/parse_atomic.html">col_logical()</a></code> and <code><ahref="https://readr.tidyverse.org/reference/parse_atomic.html">col_double()</a></code> read logicals and real numbers. They’re relatively rarely needed (except as above), since readr will usually guess them for you.</li>
<code><ahref="https://readr.tidyverse.org/reference/parse_atomic.html">col_integer()</a></code> reads integers. We distinguish integers and doubles in this book because they’re functionally equivalent, but reading integers explicitly can occasionally be useful because they occupy half the memory of doubles.</li>
<code><ahref="https://readr.tidyverse.org/reference/parse_atomic.html">col_character()</a></code> reads strings. This is sometimes useful to specify explicitly when you have a column that is a numeric identifier, i.e.long series of digits that identifies some object, but it doesn’t make sense to (e.g.) divide it in half.</li>
<code><ahref="https://readr.tidyverse.org/reference/parse_factor.html">col_factor()</a></code>, <code><ahref="https://readr.tidyverse.org/reference/parse_datetime.html">col_date()</a></code>, and <code><ahref="https://readr.tidyverse.org/reference/parse_datetime.html">col_datetime()</a></code> create factors, dates, and date-times respectively; you’ll learn more about those when we get to those data types in <ahref="#chp-factors"data-type="xref">#chp-factors</a> and <ahref="#chp-datetimes"data-type="xref">#chp-datetimes</a>.</li>
<code><ahref="https://readr.tidyverse.org/reference/parse_number.html">col_number()</a></code> is a permissive numeric parser that will ignore non-numeric components, and is particularly useful for currencies. You’ll learn more about it in <ahref="#chp-numbers"data-type="xref">#chp-numbers</a>.</li>
<code><ahref="https://readr.tidyverse.org/reference/col_skip.html">col_skip()</a></code> skips a column so it’s not included in the result.</li>
</ul><p>It’s also possible to override the default column by switching from <code><ahref="https://rdrr.io/r/base/list.html">list()</a></code> to <code><ahref="https://readr.tidyverse.org/reference/cols.html">cols()</a></code>:</p>
<p>Another useful helper is <code><ahref="https://readr.tidyverse.org/reference/cols.html">cols_only()</a></code> which will read in only the columns you specify:</p>
<p>Sometimes your data is split across multiple files instead of being contained in a single file. For example, you might have sales data for multiple months, with each month’s data in a separate file: <code>01-sales.csv</code> for January, <code>02-sales.csv</code> for February, and <code>03-sales.csv</code> for March. With <code><ahref="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> you can read these data in at once and stack them on top of each other in a single data frame.</p>
<p>With the additional <code>id</code> parameter we have added a new column called <code>file</code> to the resulting data frame that identifies the file the data come from. This is especially helpful in circumstances where the files you’re reading in do not have an identifying column that can help you trace the observations back to their original sources.</p>
<p>If you have many files you want to read in, it can get cumbersome to write out their names as a list. Instead, you can use the base <code><ahref="https://rdrr.io/r/base/list.files.html">list.files()</a></code> function to find the files for you by matching a pattern in the file names. You’ll learn more about these patterns in <ahref="#chp-regexps"data-type="xref">#chp-regexps</a>.</p>
<p>readr also comes with two useful functions for writing data back to disk: <code><ahref="https://readr.tidyverse.org/reference/write_delim.html">write_csv()</a></code> and <code><ahref="https://readr.tidyverse.org/reference/write_delim.html">write_tsv()</a></code>. Both functions increase the chances of the output file being read back in correctly by using the standard UTF-8 encoding for strings and ISO8601 format for date-times.</p>
<p>The most important arguments are <code>x</code> (the data frame to save), and <code>file</code> (the location to save it). You can also specify how missing values are written with <code>na</code>, and if you want to <code>append</code> to an existing file.</p>
<p>This makes CSVs a little unreliable for caching interim results—you need to recreate the column specification every time you load in. There are two main alternative:</p>
<p><code><ahref="https://readr.tidyverse.org/reference/read_rds.html">write_rds()</a></code> and <code><ahref="https://readr.tidyverse.org/reference/read_rds.html">read_rds()</a></code> are uniform wrappers around the base functions <code><ahref="https://rdrr.io/r/base/readRDS.html">readRDS()</a></code> and <code><ahref="https://rdrr.io/r/base/readRDS.html">saveRDS()</a></code>. These store data in R’s custom binary format called RDS:</p>
<p>The arrow package allows you to read and write parquet files, a fast binary file format that can be shared across programming languages. We’ll return to arrow in more depth in <ahref="#chp-arrow"data-type="xref">#chp-arrow</a>.</p>
<p>Sometimes you’ll need to assemble a tibble “by hand” doing a little data entry in your R script. There are two useful functions to help you do this which differ in whether you layout the tibble by columns or by rows. <code><ahref="https://tibble.tidyverse.org/reference/tibble.html">tibble()</a></code> works by column:</p>
<p>Laying out the data by column can make it hard to see how the rows are related, so an alternative is <code><ahref="https://tibble.tidyverse.org/reference/tribble.html">tribble()</a></code>, short for <strong>tr</strong>ansposed t<strong>ibble</strong>, which lets you lay out your data row by row. <code><ahref="https://tibble.tidyverse.org/reference/tribble.html">tribble()</a></code> is customized for data entry in code: column headings start with <code>~</code> and entries are separated by commas. This makes it possible to lay out small amounts of data in an easy to read form:</p>
<p>We’ll use <code><ahref="https://tibble.tidyverse.org/reference/tibble.html">tibble()</a></code> and <code><ahref="https://tibble.tidyverse.org/reference/tribble.html">tribble()</a></code> later in the book to construct small examples to demonstrate how various functions work.</p>
<p>In this chapter, you’ve learned how to load CSV files with <code><ahref="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> and to do your own data entry with <code><ahref="https://tibble.tidyverse.org/reference/tibble.html">tibble()</a></code> and <code><ahref="https://tibble.tidyverse.org/reference/tribble.html">tribble()</a></code>. You’ve learned how csv files work, some of the problems you might encounter, and how to overcome them. We’ll come to data import a few times in this book: <ahref="#chp-spreadsheets"data-type="xref">#chp-spreadsheets</a> from Excel and googlesheets, <ahref="#chp-databases"data-type="xref">#chp-databases</a> will show you how to load data from databases, <ahref="#chp-arrow"data-type="xref">#chp-arrow</a> from parquet files, <ahref="#chp-rectangling"data-type="xref">#chp-rectangling</a> from JSON, and <ahref="#chp-webscraping"data-type="xref">#chp-webscraping</a> from websites.</p>
<p>Now that you’re writing a substantial amount of R code, it’s time to learn more about organizing your code into files and directories. In the next chapter, you’ll learn all about the advantages of scripts and projects, and some of the many tools that they provide to make your life easier.</p>