Flesh out R markdown workflow

2016-08-23 16:40:35 -05:00 · 2016-08-23 16:40:35 -05:00 · e29e3a5ec0
parent 67fa6287e1
commit e29e3a5ec0
1 changed files with 53 additions and 21 deletions
--- a/rmarkdown-workflow.Rmd
+++ b/rmarkdown-workflow.Rmd
@ -1,30 +1,62 @@
 # R Markdown workflow

-You have seen three basic workflows:
+Earlier we discussed a basic workflow for capturing your R code where you work work interactively in the _console_, then capture what works in the _script editor_. R Markdown effectively puts the console and the script editor in same place, blurring the lines between interactive exploration and long-term code capture. You can rapidly iterate within a chunk, editing and re-executing with Cmd/Ctrl + Shift + Enter. When you're happy, you move on and start a new chunk.

-1.  Work in console.
-1.  Work with a single script.
-1.  Work with multiple scripts in a project.
+R Markdown is also important because it so tightly integrates prose and code. This makes it a great __analysis notebook__ because it provides it lets you develop code and record your thoughts. An analysis notebook shares many of the same goals as classic lab notebook in the physical sciences:

-R Markdown introduces a four workflow: instead of writing R scripts in the script editor, you write R Markdown documents. The chief advantage of this workflow is it put prose and code on an equal footing. That is obviously important if you're writing up a report to share with someone else. But it's equally important when you're working on a data analysis that's just for you because it provides a great environment to develop you code, and record your thoughts as they occur. It allows you to a modern day lab notebook --- you record everything you tried as you try it. You notebooks only need to be understandable by you. 
+*   Record what you what did and why you did it. Regardless of how great your
+    memory is, if you don't record what you do, there will come some time where 
+    you have forgotten important details. Write them down so you don't forget!

-Knitting an `.Rmd` is also great reproduciblility practice because every time you knit it, it's done in a completely fresh session. If you always finish the day by knitting your `.Rmd` and making sure it works, then you're doing reproducible research!
+*   To support rigorous thinking. You are more likely to come up with a strong
+    analysis if you record your thoughts as you go, and continue to reflect
+    on them. This also saves you time when you eventually write up your 
+    analysis to share with others.

-R Markdown provides a useful way to organize your data science projects. You can use an R Markdown file to create a reproducible record of how you import, tidy, transform, visualise, and model data.
+*   To help others understand your work. It is rare to do data analysis by 
+    yourself, and you'll often be working as part of a team. A lab notebook 
+    helps you to share not only what you've done, but why you did it 
+    with your colleagues or lab mates.

-and then generate a report from your .Rmd file to communicate your results. This process is very efficient: you can write your report once and then deploy it many times in different formats or with different parameters. Moreover, by creating an R Markdown file, you participate in two movements that are leading to better scientific practices: 
+And much of the good advice about using lab notebooks effectively can also be translated to analysis notebooks. I've drawn on my on experiences and the Colin Purrington's advice on lab notebooks  <http://colinpurrington.com/tips/lab-notebooks> to come up with the following list of tips:

-1.   **Reproducible Research** - The familiar scientific report format 
-     (*Introduction*, *Methods and Materials*, *Results*, *Discussion* and 
-     *Conclusion*) helps experimental scientists report their results in a 
-     reproducible way. Embedded in the format are the details that a scientist 
-     would need to reproduce the experiment. 
+*   Ensure each notebook has an descriptive title, evocative filename, and a
+    first paragraph that briefly describes the aims of the analysis.
+    
+*   Use the YAML header date field to record the date you started working on the 
+    notebook:
+    
+    ```yaml
+    date: 2016-08-23
+    ```

-    R Markdown files provide the same service for data science. Embedded in 
-    the file is the code that a data scientist would need to repeat your analysis. 
-
-2.   **Literate Data Analysis** - like 
-    [literate programming](https://en.wikipedia.org/wiki/Literate_programming),
-    literate data analysis intermingles code with human-readable text to build 
-    a program that is easy to understand and easy to debug (and often more 
-    streamlined than a program written in a non-literate way).
+    Use ISO8601 YYYY-MM-DD format so that's there no ambiguity. Use it
+    even if you don't normally write dates that way!
+    
+*   If you spend a lot of time on analysis idea and it turns out to be a dead
+    end, don't delete it! Write up a brief note about why it failed and leave 
+    it in the notebook. That will help you avoid going down the same dead end
+    when you come back to the analysis in the future.
+    
+*   Generally, you're better off doing data entry outside of R. If you need
+    a small snippet of data, clearly lay it out using `tibble::tribble()`. 
+    
+*   If you discover a error in a data file, never modify it directly, but 
+    instead write code to correct the value. Explain why you made the fix.
+  
+*   Before you finish for the day, make sure you can knitr the notebook
+    (aftering clearing caches if you're using them). That will let you fix
+    any problems while the code is still fresh in your mind.
+  
+*   If you want your code to be reproducible in the long-run (i.e. so you can 
+    come back to run it in a year or so), you'll need to track the versions 
+    of the packages that your code uses. A rigorous approach is to use
+    __packrat__, <http://rstudio.github.io/packrat/>, which stores packages in
+    your project directory. A quicky and dirty hack is to include a chunk that
+    runs `sessionInfo()` --- that won't let easily recreate your packages as
+    they are today, but at least you know what they were.
+    
+*   You are going to create many many many analysis notebooks over the course
+    of your career. How are you going to organise them so you can find them
+    again in the future? I recommend storing them in individual projects,
+    and coming up with a good naming scheme.