Fix typos

This commit is contained in:
Bill Behrman 2016-07-08 17:03:07 -07:00
parent 37979814ca
commit 02db8dc189
1 changed files with 4 additions and 4 deletions

View File

@ -26,7 +26,7 @@ __Visualisation__ is a fundamentally human activity. A good visualisation will s
__Models__ are the complementary tools to visualisation. Models are a fundamentally mathematical or computational tool, so they generally scale well. Even when they don't, it's usually cheaper to buy more computers than it is to buy more brains. But every model makes assumptions, and by its very nature a model can not question its own assumptions. That means a model cannot fundamentally surprise you.
The last step of data science is __communication__, an absolutely critical part of any data analysis project. It doesn't matter how well models and visualisation have led you to understand the data, unless you can commmunicate your results to other people.
The last step of data science is __communication__, an absolutely critical part of any data analysis project. It doesn't matter how well models and visualisation have led you to understand the data, unless you can communicate your results to other people.
There's one important toolset that's not shown in the diagram: programming. Programming is a cross-cutting tool that you use in every part of the project. You don't need to be an expert programmer to be a data scientist, but learning more about programming pays off. Becoming a better programmer will allow you to automate common tasks, and solve new problems with greater ease.
@ -66,7 +66,7 @@ This book proudly focuses on small, in-memory datasets. This is the right place
Many big data problems are often small data problems in disguise. Often your complete dataset is big, but the data needed to answer a specific question is small. It's often possible to find a subset, subsample, or summary that fits in memory and still allows you to answer the question that you're interested in. The challenge here is finding the right small data, which often requires a lot of iteration. We'll touch on this idea in [transform](#transform).
Another class of big data problem consists of many small data problems. Each individual problem might fit in memory, but you have millions of them. For example, you might want to fit a model to each person in your dataset. That would be trivial if you had just 10 or 100 people, but instead you have a million. Fortunately each problem is independent (sometimes called embarassingly parallel), so you just need a system (like Hadoop) that allows you to send different datasets to different computers for processing. Once you've figured out how to answer the question for a single subset using the tools described in this book, you can use packages like SparkR, rhipe, and ddr to solve it for the complete dataset.
Another class of big data problem consists of many small data problems. Each individual problem might fit in memory, but you have millions of them. For example, you might want to fit a model to each person in your dataset. That would be trivial if you had just 10 or 100 people, but instead you have a million. Fortunately each problem is independent (sometimes called embarrassingly parallel), so you just need a system (like Hadoop) that allows you to send different datasets to different computers for processing. Once you've figured out how to answer the question for a single subset using the tools described in this book, you can use packages like SparkR, rhipe, and ddr to solve it for the complete dataset.
### Big p data (many variables)
@ -102,7 +102,7 @@ To run the code in this book, you will need to install both R and the RStudio ID
### RStudio
RStudio is an integated development environment, or IDE, for R programming. There are three key regions:
RStudio is an integrated development environment, or IDE, for R programming. There are three key regions:
```{r, echo = FALSE}
knitr::include_graphics("screenshots/rstudio-layout.png")
@ -128,7 +128,7 @@ We strongly recommend making two changes to the default RStudio options:
knitr::include_graphics("screenshots/rstudio-workspace.png")
```
This ensures that every time you restart RStudio you get a completely clean slate. This is good pratice because it encourages you to capture all important interactions in your code. There's nothing worse than discovering three months after the fact that you've only stored the results of an important calculation in your workspace, not the calculation itself in your code. During a project, it's good practice to regularly restart R either using the menu Session | Restart R or the keyboard shortcut Cmd + Shift + F10.
This ensures that every time you restart RStudio you get a completely clean slate. This is good practice because it encourages you to capture all important interactions in your code. There's nothing worse than discovering three months after the fact that you've only stored the results of an important calculation in your workspace, not the calculation itself in your code. During a project, it's good practice to regularly restart R either using the menu Session | Restart R or the keyboard shortcut Cmd + Shift + F10.
### R packages