diff --git a/import.Rmd b/import.Rmd index 57aed9b..3b3bca0 100644 --- a/import.Rmd +++ b/import.Rmd @@ -190,6 +190,9 @@ Using parsers is mostly a matter of understanding what's available and how they 1. `parse_character()` seems so simple that it shouldn't be necessary. But one complication makes it quite important: character encodings. +1. `parse_factor()` create factors, the data structure that R uses to represent + categorical variables with fixed and known values. + 1. `parse_datetime()`, `parse_date()`, and `parse_time()` allow you to parse various date & time specifications. These are the most complicated because there are so many different ways of writing dates. @@ -240,7 +243,7 @@ parse_number("123.456.789", locale = locale(grouping_mark = ".")) parse_number("123'456'789", locale = locale(grouping_mark = "'")) ``` -### Character +### Strings {#readr-strings} It seems like `parse_character()` should be really simple --- it could just return its input. Unfortunately life isn't so simple, as there are multiple ways to represent the same string. To understand what's going on, we need to dive into the details of how computers represent strings. In R, we can get at the underlying representation of a string using `charToRaw()`: @@ -280,6 +283,17 @@ The first argument to `guess_encoding()` can either be a path to a file, or, as Encodings are a rich and complex topic, and I've only scratched the surface here. If you'd like to learn more I'd recommend reading the detailed explanation at . +### Factors {#readr-factors} + +R uses factors to represent categorical variables that have a known set of possible values. Given `parse_factor()` a vector of known `levels` to generate a warning whenever an unexpected value is present: + +```{r} +fruit <- c("apple", "banana") +parse_factor(c("apple", "banana", "bananana"), levels = fruit) +``` + +If you have problematic entries, it's often easier to read in as strings and then use the tools you'll learn about in [strings] and [factors] to clean them up. + ### Dates, date-times, and times {#readr-datetimes} You pick between three parsers depending on whether you want a date (the number of days since 1970-01-01), a date-time (the number of seconds since midnight 1970-01-01), or a time (the number of seconds since midnight). When called without any additional arguments: