diff --git a/intro.Rmd b/intro.Rmd index e292ce2..834188a 100644 --- a/intro.Rmd +++ b/intro.Rmd @@ -33,7 +33,7 @@ __Models__ are the complementary tools to visualisation. Models are a fundamenta It doesn't matter how well models and visualisation have led you to understand the data, unless you can __commmunicate__ your results to other people. Communication is an absolutely critical part of any data analysis project. -There's one important toolset that's not shown in the diagram: programming. Programming is a cross-cutting tool that you use in every part of the project. You don't need to be an expert programmer to be a data scientist, but learning more about programming often pays off. Becoming a better programmer will allow you automate common tasks, and solve new problems with greater ease. +There's one important toolset that's not shown in the diagram: programming. Programming is a cross-cutting tool that you use in every part of the project. You don't need to be an expert programmer to be a data scientist, but learning more about programming pays off. Becoming a better programmer will allow you automate common tasks, and solve new problems with greater ease. You'll use these tools in every data science project, but for most projects they're not enough. There's a rough 80-20 rule at play: you can probably tackle 80% of every project using the tools we'll teach you, but you'll need more to tackle the remaining 20%. Throughout this book we'll point you to resources where you can learn more. @@ -41,79 +41,47 @@ You'll use these tools in every data science project, but for most projects they The above description of the tools of data science was organised roughly around the order in which you use them in analysis (although of course you'll iterate through them multiple times). In our experience, however, this is not the best way to learn them: -* Starting with data ingest is boring. 80% of the time it's routine and boring, - and the other 20% of the time it's horrendously frustrating. That's not a - good place to start learning! Instead, we'll start with visualisation and - transformation on on data that's already been imported and tidied. That will - keep your motivation high so that when you tackle the frustrating parts of - getting your data in R for your own projects, you'll know the payoff and be - motivated to stick with it through the pain. +* Starting with data ingest and tidying is sub-optimal because 80% of the time + it's routine and boring, and the other 20% of the time it's horrendously + frustrating. Instead, we'll start with visualisation and transformation on + data that's already been imported and tidied. That way when you ingest + and tidy your own data, you'll be able to keep your motivation high because + you know the pain is worth it because of what you can accomplish once its + done. -* Some topics, like modelling, are best explained with other tools, like - visualisation, transformation and tidying. +* Some topics are best explained with other tools. For example, we believe that + it's easier to understand how models work as a tool for data science, if you + already know about visualisation, data transformation, and tidy data. * Programming tools are not necessarily interesting in their own right, but do allow you to tackle considerably more challenging problems. We'll give you a selection of programming tools in the middle of the book, and - then show you how they synergise with the key data science tools to tackle - intersting problems. - -We've polished this order based on our experience teaching live classes, and it's been carefully designed to keep you motivated. + then finish off by showing how they combine with the key data science tools + to tackle interesting problems. Within each chapter, we try and stick to a similar pattern: start with some motivating examples so you can see the bigger picture, and then dive into the details. Each section of the book is paired with exercises to help you practice what you've learned. It's tempting to skip these, but there's no better way to learn than practicing. ## What you won't learn -There are some important topics that this book doesn't cover. Here I want to talk about them briefly and tell you why. +There are some important topics that this book doesn't cover. We believe it's important to stay ruthlessly focussed on the essentials so you can get up and running as quickly as possible. That means this book can't covered every important topic. ### Big data -This book proudly focussed on in-memory, or small, datasets. +This book proudly focusses on small, in-memory datasets. This is the right place to start because you can't tackle big data unless you have experience with small data. The tools you learn in this book will easily handle hundreds of megabytes of data, and with a little care you can typically use them to work with 1-2 Gb of data. If you're routinely working larger data (10-100 Gb, say), you should learn more about [data.table](https://github.com/Rdatatable/data.table). We don't teach here because it has a very concise interface that is harder to learn because it offers fewer linguistic cues. But if you're working with large data, the performance payoff is worth a little extra effort to learn it. -* Small data: data that fits in memory on a laptop, ~10 GB. Note that small - data is still big! R is great with small data. Pointer to data.table. - -* Medium data: data that fits in memory on a powerful server, ~5 TB. It's - possible to use R with this much data, but it's challenging. Dealing - effectively with medium data requires effective use of all cores on a - computer. It's not that hard to do that from R, but it requires some thought, - and many packages do not take advantage of R's tools. - -* Big data: data that must be stored on disk or spread across the memory of - multiple machines. Writing code that works efficiently with this sort of data - is a very challenging. Tools for this sort of data will never be written in - R: they'll be written in a language specially designed for high performance - computing like C/C++, Fortran or Scala. But R can still talk to these systems. - -The other thing to bear in mind, is that while all your data might be big, typically you don't need all of it to answer a specific question: +Many big data problems are often small data problems in disguise. Often your complete dataset is big, but the data needed to answer is a specific question is small. It's often possible to find a subset, subsample, or summary that fits in memory and still allows you to answer the question you're interested in. The challenge here is finding the right small data, which often requires a lot of iteration. We'll touch on this idea in [transform](#transform). -* Many questions can be answered with the right small dataset. It's often - possible to find a subset, subsample, or summary that fits in memory and - still allows you to answer the question you're interested in. The challenge - here is finding the right small data, which often requires a lot of iteration. - -* Other challenges are because an individual problem might fit in memory, - but you have hundreds of thousands or millions of them. For example, you - might want to fit a model to each person in your dataset. That would be - trivial if you had just 10 or 100 people, but instead you have a million. - Fortunately each problem is independent (sometimes called embarassingly - parallel), so you just need a system (like hadoop) that allows you to - send different datasets to different computers for processing. +Another class of big data problem consists of many small data problems. Each individual problem might fit in memory, but you have millions of them. For example, you might want to fit a model to each person in your dataset. That would be trivial if you had just 10 or 100 people, but instead you have a million. Fortunately each problem is independent (sometimes called embarassingly parallel), so you just need a system (like hadoop) that allows you to send different datasets to different computers for processing. Once you've figured out to how answer the question for a single subset using the tools described in this book, you can use packages like SparkR, rhipe, and ddr to solve it for the complete dataset. ### Python -In this book, you won't learn anything about Python, or any other programming language. This isn't because we think Python is bad! It's a great tool, and most data science teams use a mix of (at least!) R and Python. +In this book, you won't learn anything about Python, Juli, or any other programming language useful for data science. This isn't because we think these tools are bad. They're not! And in practice, most data science teams use a mix of languages, often at least R and Python. However, we strongly believe that it's best to master one tool at a time. You will get better faster if you dive deep, rather than spreading yourself thinly over many topics. This doesn't mean you should be only know one thing, just that you'll generally learn faster if you stick to one thing at a time. -### Non-data-frame data +### Non-rectangular data -No trees or graphs, images or sounds. - - -## Talking about data science - -Throughout the book, we will discuss the principles of data that will help you become a better scientist. That begins here. We will refer to the terms below throughout the book because they are so useful. +This book focusses exclusively on rectangular data, data made up of variables, observations, and values: * A _variable_ is a quantity, quality, or property that you can measure. @@ -124,8 +92,6 @@ Throughout the book, we will discuss the principles of data that will help you b (usually all at the same time or on the same object). Observations contain values that you measure on different variables. -These terms will help us speak precisely about the different parts of a data set. They will also provide a system for turning data into insights. - This book focuses exclusively on structured data sets: collections of values that are each associated with a variable and an observation. There are lots of data that doesn't naturally fit in this paradigm: images, sounds, trees, text. But data frames are extremely common in science and in industry and we believe that they're a great place to start your data analysis journey. ## Prerequisites @@ -134,9 +100,8 @@ We've made few assumptions about what you already know in order to get the most To run the code in this book, you will need to install both R and the RStudio IDE, an application that makes R easier to use. Both are open source, free and easy to install: -1. Download R and install R from . -1. Download and install RStudio from . -1. Open RStudio like you would any operating system. +1. Download R and install R, . +1. Download and install RStudio, . You'll also need to install some R packages. An R _package_ is a collection of functions, data, and documentation that extends the capabilities of base R. Using packages is key to the successful use of R. To install all the packages used in this book open RStudio and run: @@ -151,15 +116,15 @@ install.packages(pkgs) R will download the packages from CRAN and install them in your system library. If you have problems installing, make that you are connected to the internet, and that you haven't blocked in your firewall or proxy. -After you have downloaded the packages, you can load any of the packages into your current R session with the `library()` command, e.g. +You will not be able to use the functions, objects, and help files in a package until you load it with `library()`. After you have downloaded the packages, you can load any of the packages into your current R session with the `library()` command, e.g. ```{r, eval = FALSE} library(tidyr) ``` -You will not be able to use the functions, objects, and help files in a package until you load it with `library()`. You will need to reload the package if you start a new R session. +You will need to reload the package every time you start a new R session. -## RStudio +### RStudio Brief RStudio orientation (code, console, and output). Pointers to where to learn more. diff --git a/toc.rds b/toc.rds index 13ad3b7..92f93e5 100644 Binary files a/toc.rds and b/toc.rds differ diff --git a/toc.yaml b/toc.yaml index 5e672ef..60e9100 100644 --- a/toc.yaml +++ b/toc.yaml @@ -1,13 +1,21 @@ contribute.rmd: [] -expressing-yourself.Rmd: [] +data-structures.Rmd: [] +datetimes.Rmd: [] +eda.Rmd: [] +functions.Rmd: [] import.Rmd: [] -index.rmd: [] +index.rmd: toc intro.Rmd: [] lists.Rmd: - hierarchy - walk +model-assess.Rmd: [] +model-vis.Rmd: [] +model.Rmd: [] +rmarkdown.Rmd: [] +shiny.Rmd: [] strings.Rmd: [] temp.Rmd: [] tidy.Rmd: [] -transform.Rmd: [] +transform.Rmd: transform visualize.Rmd: [] diff --git a/transform.Rmd b/transform.Rmd index cb9f2ea..f5ba7ac 100644 --- a/transform.Rmd +++ b/transform.Rmd @@ -4,6 +4,8 @@ title: Data transformation output: bookdown::html_chapter --- +# Data transformation {#transform} + Copy from dplyr vignettes. ## Filter