This typically means that you take data stored in a file, database, or web application programming interface (API), and load it into a data frame in R.
If you can't get your data into R, you can't do data science on it!
Once you've imported your data, it is a good idea to **tidy** it.
Tidying your data means storing it in a consistent form that matches the semantics of the dataset with the way it is stored.
In brief, when your data is tidy, each column is a variable, and each row is an observation.
Tidy data is important because the consistent structure lets you focus your struggle on questions about the data, not fighting to get the data into the right form for different functions.
Once you have tidy data, a common first step is to **transform** it.
Transformation includes narrowing in on observations of interest (like all people in one city, or all data from the last year), creating new variables that are functions of existing variables (like computing speed from distance and time), and calculating a set of summary statistics (like counts or means).
Together, tidying and transforming are called **wrangling**, because getting your data in a form that's natural to work with often feels like a fight!
Programming is a cross-cutting tool that you use in every part of the project.
You don't need to be an expert programmer to be a data scientist, but learning more about programming pays off because becoming a better programmer allows you to automate common tasks, and solve new problems with greater ease.
You'll use these tools in every data science project, but for most projects they're not enough.
There's a rough 80-20 rule at play; you can tackle about 80% of every project using the tools that you'll learn in this book, but you'll need other tools to tackle the remaining 20%.
Throughout this book we'll point you to resources where you can learn more.
The previous description of the tools of data science is organised roughly according to the order in which you use them in an analysis (although of course you'll iterate through them multiple times).
In our experience, however, this is not the best way to learn them:
- Starting with data ingest and tidying is sub-optimal because 80% of the time it's routine and boring, and the other 20% of the time it's weird and frustrating.
That's a bad place to start learning a new subject!
Instead, we'll start with visualisation and transformation of data that's already been imported and tidied.
That way, when you ingest and tidy your own data, your motivation will stay high because you know the pain is worth it.
- Programming tools are not necessarily interesting in their own right, but do allow you to tackle considerably more challenging problems.
We'll give you a selection of programming tools in the middle of the book, and then you'll see how they can combine with the data science tools to tackle interesting modelling problems.
Within each chapter, we try and stick to a similar pattern: start with some motivating examples so you can see the bigger picture, and then dive into the details.
Each section of the book is paired with exercises to help you practice what you've learned.
While it's tempting to skip the exercises, there's no better way to learn than practicing on real problems.
This book proudly focuses on small, in-memory datasets.
This is the right place to start because you can't tackle big data unless you have experience with small data.
The tools you learn in this book will easily handle hundreds of megabytes of data, and with a little care you can typically use them to work with 1-2 Gb of data.
If you're routinely working with larger data (10-100 Gb, say), you should learn more about [data.table](https://github.com/Rdatatable/data.table).
This book doesn't teach data.table because it has a very concise interface which makes it harder to learn since it offers fewer linguistic cues.
But if you're working with large data, the performance payoff is worth the extra effort required to learn it.
If your data is bigger than this, carefully consider if your big data problem might actually be a small data problem in disguise.
While the complete data might be big, often the data needed to answer a specific question is small.
You might be able to find a subset, subsample, or summary that fits in memory and still allows you to answer the question that you're interested in.
The challenge here is finding the right small data, which often requires a lot of iteration.
Another possibility is that your big data problem is actually a large number of small data problems.
Each individual problem might fit in memory, but you have millions of them.
For example, you might want to fit a model to each person in your dataset.
That would be trivial if you had just 10 or 100 people, but instead you have a million.
Fortunately each problem is independent of the others (a setup that is sometimes called embarrassingly parallel), so you just need a system (like Hadoop or Spark) that allows you to send different datasets to different computers for processing.
Once you've figured out how to answer the question for a single subset using the tools described in this book, you learn new tools like sparklyr, rhipe, and ddr to solve it for the full dataset.
However, we strongly believe that it's best to master one tool at a time.
You will get better faster if you dive deep, rather than spreading yourself thinly over many topics.
This doesn't mean you should only know one thing, just that you'll generally learn faster if you stick to one thing at a time.
You should strive to learn new things throughout your career, but make sure your understanding is solid before you move on to the next interesting thing.
We think R is a great place to start your data science journey because it is an environment designed from the ground up to support data science.
R is not just a programming language, but it is also an interactive environment for doing data science.
To support interaction, R is a much more flexible language than many of its peers.
This flexibility comes with its downsides, but the big upside is how easy it is to evolve tailored grammars for specific parts of the data science process.
These mini languages help you think about problems as a data scientist, while supporting fluent interaction between your brain and the computer.
This book focuses exclusively on rectangular data: collections of values that are each associated with a variable and an observation.
There are lots of datasets that do not naturally fit in this paradigm, including images, sounds, trees, and text.
But rectangular data frames are extremely common in science and industry, and we believe that they are a great place to start your data science journey.
It's possible to divide data analysis into two camps: hypothesis generation and hypothesis confirmation (sometimes called confirmatory analysis).
The focus of this book is unabashedly on hypothesis generation, or data exploration.
Here you'll look deeply at the data and, in combination with your subject knowledge, generate many interesting hypotheses to help explain why the data behaves the way it does.
You evaluate the hypotheses informally, using your scepticism to challenge the data in multiple ways.
2. You can only use an observation once to confirm a hypothesis.
As soon as you use it more than once you're back to doing exploratory analysis.
This means to do hypothesis confirmation you need to "preregister" (write out in advance) your analysis plan, and not deviate from it even when you have seen the data.
It's common to think about modelling as a tool for hypothesis confirmation, and visualisation as a tool for hypothesis generation.
But that's a false dichotomy: models are often used for exploration, and with a little care you can use visualisation for confirmation.
The key difference is how often do you look at each observation: if you look only once, it's confirmation; if you look more than once, it's exploration.
We've made a few assumptions about what you already know in order to get the most out of this book.
You should be generally numerically literate, and it's helpful if you have some programming experience already.
If you've never programmed before, you might find [Hands on Programming with R](http://amzn.com/1449359019) by Garrett to be a useful adjunct to this book.
There are four things you need to run the code in this book: R, RStudio, a collection of R packages called the **tidyverse**, and a handful of other packages.
Packages are the fundamental units of reproducible R code.
They include reusable functions, the documentation that describes how to use them, and sample data.
To download R, go to CRAN, the **c**omprehensive **R** **a**rchive **n**etwork.
CRAN is composed of a set of mirror servers distributed around the world and is used to distribute R and R packages.
Don't try and pick a mirror that's close to you: instead use the cloud mirror, <https://cloud.r-project.org>, which automatically figures it out for you.
A new major version of R comes out once a year, and there are 2-3 minor releases each year.
It's a good idea to update regularly.
Upgrading can be a bit of a hassle, especially for major versions, which require you to reinstall all your packages, but putting it off only makes it worse.
On your own computer, type that line of code in the console, and then press enter to run it.
R will download the packages from CRAN and install them on to your computer.
If you have problems installing, make sure that you are connected to the internet, and that <https://cloud.r-project.org/> isn't blocked by your firewall or proxy.
There are many other excellent packages that are not part of the tidyverse, because they solve problems in a different domain, or are designed with a different set of underlying principles.
This doesn't make them better or worse, just different.
In other words, the complement to the tidyverse is not the messyverse, but many other universes of interrelated packages.
As you tackle more data science projects with R, you'll learn new packages and new ways of thinking about data.
In your console, you type after the `>`, called the **prompt**; we don't show the prompt in the book.
In the book, output is commented out with `#>`; in your console it appears directly after your code.
These two differences mean that if you're working with an electronic version of the book, you can easily copy code out of the book and into the console.
Typically adding "R" to a query is enough to restrict it to relevant results: if the search isn't useful, it often means that there aren't any R-specific results available.
Google is particularly useful for error messages.
If you get an error message and you have no idea what it means, try googling it!
Chances are that someone else has been confused by it in the past, and there will be help somewhere on the web.
(If the error message isn't in English, run `Sys.setenv(LANGUAGE = "en")` and re-run the code; you're more likely to find help for English error messages.)
1. **Packages** should be loaded at the top of the script, so it's easy to see which ones the example needs.
This is a good time to check that you're using the latest version of each package; it's possible you've discovered a bug that's been fixed since you installed the package.
For packages in the tidyverse, the easiest way to check is to run `tidyverse_update()`.
2. The easiest way to include **data** in a question is to use `dput()` to generate the R code to recreate it.
For example, to recreate the `mtcars` dataset in R, I'd perform the following steps:
1. Run `dput(mtcars)` in R
2. Copy the output
3. In my reproducible script, type `mtcars <-` then paste.
Try and find the smallest subset of your data that still reveals the problem.
3. Spend a little bit of time ensuring that your **code** is easy for others to read:
- Make sure you've used spaces and your variable names are concise, yet informative.
- Use comments to indicate where your problem lies.
- Do your best to remove everything that is not related to the problem.\
The shorter your code is, the easier it is to understand, and the easier it is to fix.
Finish by checking that you have actually made a reproducible example by starting a fresh R session and copying and pasting your script in.
You should also spend some time preparing yourself to solve problems before they occur.
Investing a little time in learning R each day will pay off handsomely in the long run.
One way is to follow what Hadley, Garrett, and everyone else at RStudio are doing on the [RStudio blog](https://blog.rstudio.org).
This is where we post announcements about new packages, new IDE features, and in-person courses.
You might also want to follow Hadley ([\@hadleywickham](https://twitter.com/hadleywickham)) or Garrett ([\@statgarrett](https://twitter.com/statgarrett)) on Twitter, or follow [\@rstudiotips](https://twitter.com/rstudiotips) to keep up with new features in the IDE.
To keep up with the R community more broadly, we recommend reading <http://www.r-bloggers.com>: it aggregates over 500 blogs about R from around the world.
If you're an active Twitter user, follow the ([`#rstats`](https://twitter.com/search?q=%23rstats)) hashtag.
Twitter is one of the key tools that Hadley uses to keep up with new developments in the community.
This book isn't just the product of Hadley and Garrett, but is the result of many conversations (in person and online) that we've had with the many people in the R community.
There are a few people we'd like to thank in particular, because they have spent many hours answering our questions and helping us to better think about data science:
- Genevera Allen for discussions about models, modelling, the statistical learning perspective, and the difference between hypothesis generation and hypothesis confirmation.