Polishing style

2022-03-29 08:45:22 -05:00 · 2022-03-29 08:45:22 -05:00 · 6be44f9a14
parent 9239e50ccc
commit 6be44f9a14
1 changed files with 75 additions and 28 deletions
--- a/workflow-style.Rmd
+++ b/workflow-style.Rmd
@ -48,30 +48,38 @@ SHORTFLIGHTS  <- flights |> filter(air_time < 60)
 ```

 As a general rule of thumb, it's better to prefer long, descriptive names that are easy to understand, rather than concise names that are fast to type.
-Short names save relatively little time when writing code (especially since autocomplete will help you finish typing them), but can be expensive when you come back to old need and need to puzzle out what a cryptic abbreviation means.
+Short names save relatively little time when writing code (especially since autocomplete will help you finish typing them), but can be time-consuming when you come back to old code and are forced to puzzle out a cryptic abbreviation.

 If you have a bunch of names for related things, do your best to be consistent.
 It's easy for inconsistencies to arise when you forget a previous convention, so don't feel bad if you have to go back and rename things.
-If you have a bunch of variables that are some variation on a theme you're generally better off giving them a common prefix, rather than a common suffix, because autocomplete works best on the start of a variable.
+In general, if you have a bunch of variables that are a variation on a theme you're better off giving them a common prefix, rather than a common suffix, because autocomplete works best on the start of a variable.

 ## Spaces

-Put spaces on either side of mathematical operators apart from `^` (`+`, `-`, `==`, `<`, ...), and around the assignment operator (`<-`).
+Put spaces on either side of mathematical operators apart from `^` (i.e., `+`, `-`, `==`, `<`, ...), and around the assignment operator (`<-`).
+
+```{r, eval = FALSE}
+# Strive for
+z <- (a + b)^2 / d
+
+# Avoid
+z<-( a + b ) ^ 2/d
+```
+
 Don't put spaces inside or outside parentheses for regular function calls.
 Always put a space after a comma, just like in regular English.

 ```{r, eval = FALSE}
 # Strive for
-(a + b)^2 / d
 mean(x, na.rm = TRUE)

 # Avoid
-( a + b ) ^ 2/d
 mean (x ,na.rm=TRUE)
 ```

 It's OK to add extra spaces if it improves alignment.
 For example, if you're creating multiple variables in `mutate()`, you might want to add spaces so that all the `=` line up.
+This makes it easier to skim the code.

 ```{r, eval = FALSE}
 flights |> 
@ -84,18 +92,25 @@ flights |>

 ## Pipes

-`|>` should always have a space before it and should typically be followed by a newline.
-If the function has named arguments (like `mutate()` or `summarise()`), always put each argument on a new line.
-If the function doesn't have named arguments (like `select()` or `filter()` keep everything on one line unless it doesn't fit, in which case you should put each argument on its own line.
-
-After the first step of the pipeline, indent each line by two spaces.
-If you're putting arguments on their own line, indent each argument by an extra two spaces.
-Make sure `)` is on its own line, and un-indented to match the horizontal position of the function name.
+`|>` should always have a space before it and should typically be the last thing on a line.
+This makes makes it easier to add new steps, rearrange existing steps, modify elements within a step, and to get a 50,000 ft view by skimming the verbs on the left-hand side.

 ```{r, eval = FALSE}
 # Strive for 
 flights |>  
  filter(!is.na(arr_delay), !is.na(tailnum)) |> 
+  count(dest)
+
+# Avoid
+flights|>filter(!is.na(arr_delay), !is.na(tailnum))|>count(dest)
+```
+
+If the function you're piping into has named arguments (like `mutate()` or `summarise()`), put each argument on a new line.
+If the function doesn't have named arguments (like `select()` or `filter()`) keep everything on one line unless it doesn't fit, in which case you should put each argument on its own line.
+
+```{r, eval = FALSE}
+# Strive for
+flights |>  
  group_by(tailnum) |> 
  summarise(
    delay = mean(arr_delay, na.rm = TRUE),
@ -103,14 +118,44 @@ flights |>
  )

 # Avoid
-flights|> filter(!is.na(arr_delay), !is.na(tailnum)) |> 
-  group_by(tailnum) |> summarise(delay = mean(arr_delay, na.rm = TRUE), 
-                                 n = n())
+flights |>
+  group_by(
+    tailnum
+  ) |> 
+  summarise(delay = mean(arr_delay, na.rm = TRUE), n = n())
+```
+
+After the first step of the pipeline, indent each line by two spaces.
+If you're putting each argument on its own line, indent by an extra two spaces.
+Make sure `)` is on its own line, and un-indented to match the horizontal position of the function name.
+
+```{r, eval = FALSE}
+# Strive for 
+flights |>  
+  group_by(tailnum) |> 
+  summarise(
+    delay = mean(arr_delay, na.rm = TRUE),
+    n = n()
+  )
+
+# Avoid
+flights|>
+  group_by(tailnum) |> 
+  summarise(
+             delay = mean(arr_delay, na.rm = TRUE), 
+             n = n()
+           )
+
+flights|>
+  group_by(tailnum) |> 
+  summarise(
+  delay = mean(arr_delay, na.rm = TRUE), 
+  n = n()
+  )
 ```

-This structure makes it easier to add new steps, rearrange existing steps, modify elements within a step, and to get a 50,000 view by skimming the left-hand side.
 It's OK to shirk some of these rules if your pipeline fits easily on one line.
-But it's common for short snippets to grow longer, so you'll usually save time in the long run by starting with all the vertical space you need.
+But in our collective experience, it's common for short snippets to grow longer, so you'll usually save time in the long run by starting with all the vertical space you need.

 ```{r, eval = FALSE}
 # This fits compactly on one line
@ -124,7 +169,16 @@ df |>
  )
 ```

-The same basic rules apply to ggplot2, just treat `+` the same way as `|>`.
+Finally, be wary of writing very long pipes, say longer than 10-15 lines.
+Try to break them up into smaller sub-tasks, giving each task an informative name.
+The names will help cue the reader into what's happening and makes it easier to check that intermediate results are as expected.
+Whenever you can give something an informative name, you should give it an informative name.
+Don't expect to get it right the first time!
+This means breaking up long pipelines if there are intermediate states that can get good names.
+
+## ggplot2
+
+The same basic rules that apply to the pipe also apply to ggplot2; just treat `+` the same way as `|>`.

 ```{r, eval = FALSE}
 flights |> 
@ -157,19 +211,12 @@ flights |>
  geom_point()
 ```

-Be wary of writing very long pipes, say longer than 10-15 lines.
-Try to break them up into smaller sub-tasks, giving each task an informative name.
-The names will help cue the reader into what's happening and makes it easier to check that intermediate results are as expected.
-Whenever you can give something an informative name, you should give it an informative name.
-Don't expect to get it right the first time!
-This means breaking up long pipelines if there are intermediate states that can get good names.
-
 ## Organisation

 Use comments to explain the "why" of your code, not the "how" or the "what".
 If you simply describe what your code is doing in prose, you'll have to be careful to update the comment and code in tandem: if you change the code and forget to update the comment, they'll be inconsistent which will lead to confusion when you come back to your code in the future.
 For data analysis code, use comments to explain your overall plan of attack and record important insight as you encounter them.
-There's way to re-capture this knowledge from the code itself.
+There's no way to re-capture this knowledge from the code itself.

 As your scripts get longer, use **sectioning** comments to break up your file into manageable pieces:

@ -193,10 +240,10 @@ knitr::include_graphics("screenshots/rstudio-nav.png")

 ## Exercises

-1.  Restyle each of the following pipelines following the guidelines above.
+1.  Restyle the following pipelines following the guidelines above.

    ```{r, eval = FALSE}
    flights|>filter(dest=="IAH")|>group_by(year,month,day)|>summarise(n=n(),delay=mean(arr_delay,na.rm=TRUE))|>filter(n>10)

-
+    flights|>filter(carrier=="UA",dest%in%c("IAH","HOU"),sched_dep_time>0900,sched_arr_time<2000)|>group_by(flight)|>summarise(delay=mean(arr_delay,na.rm=TRUE),cancelled=sum(is.na(arr_delay)),n=n())|>filter(n>10)
    ```