Do It Again, R!

class: center, middle, inverse, title-slide

# Do It Again, R!
## Intro to Loops, Functions and Groups
### Jeff Newmiller
### 7 May 2019

---

# Outline

- For Loops
    + Where is my output?
    + Keeping results
- Vectors of inputs and vectors of outputs<sup>1</sup>
    + Vectorized functions, not "Vectorize my function"
    + Appending columns
    + Plotting results: ggplot2
- Expanding options
    + Cartesian join: expand.grid

.footnote[
[1] Inspired by _R for Data Science_ (https://r4ds.had.co.nz)
]

---

# Packages Used

```r
library(dplyr)          # tidyverse
library(ggplot2)        # tidyverse
library(purrr)          # tidyverse
library(microbenchmark) # speed check
library(tidyr)          # tidyverse
```

---

## Sample Problem

Given a relationship between `$x$` and `$y$`:

$$
y = a x^2 + b x + c
$$

we can find which values of `$x$` cause `$y=0$` using the quadratic formula:

$$
x = \frac{-b \pm \sqrt{b^2-4ac}}{2a}
$$

---

# Sample Solution

For now, we will focus on the most positive answer:

```r
A <- 1 ; B <- -2 ; C <- -3
DF1 <- tibble( x = seq( -3, 5, 0.1 )
             , y = A*x^2 + B*x + C )
head( DF1, 4 )
```

```
## # A tibble: 4 x 2
##       x     y
##   <dbl> <dbl>
## 1  -3   12   
## 2  -2.9 11.2 
## 3  -2.8 10.4 
## 4  -2.7  9.69
```

```r
ans <- ( -B + sqrt( B^2 - 4*A*C ) ) / ( 2 * A )
DF1b <- tibble( x = ans, y = 0 )
ans
```

```
## [1] 3
```

---

# Solution Plot

![](DoItAgainR_files/figure-html/unnamed-chunk-2-1.png)

---

# Do it more than once?

```r
input <- read.csv( "data/MyABC1.csv" )
input
```

```
##   Trial A  B  C
## 1     1 1 -2 -3
## 2     2 1 -3 -4
## 3     3 1 -4 -5
```

---

# For Loop - Printing

A common approach used in other languages is to print results as you go along:

```r
for ( i in seq_along( input$A ) ) {
    A <- input$A[ i ]
    B <- input$B[ i ]
    C <- input$C[ i ]
    print( sprintf( "A = %8.4f, B = %8.4f, C = %8.4f, x = %8.4f"
                  , A, B, C
                  , ( -B + sqrt( B^2 - 4*A*C ) )
                    / ( 2 * A ) ) )
}
```

```
## [1] "A =   1.0000, B =  -2.0000, C =  -3.0000, x =   3.0000"
## [1] "A =   1.0000, B =  -3.0000, C =  -4.0000, x =   4.0000"
## [1] "A =   1.0000, B =  -4.0000, C =  -5.0000, x =   5.0000"
```

---

# For Loop - Concatenating

Another common recommendation is to concatenate:

```r
ans <- numeric( 0 )
for ( i in seq_along( input$A ) ) {
    A <- input$A[ i ]
    B <- input$B[ i ]
    C <- input$C[ i ]
    ans <- c( ans, ( -B + sqrt( B^2 - 4*A*C ) ) / ( 2 * A ) )
}
ans
```

```
## [1] 3 4 5
```

Concatenating can be _very_ inefficient if there are many elements of `ans` because there are `$n-1$` shorter versions of `ans` created and discarded along the way.

---

# For Loop - Element Replacement

Allocating a vector large enough to hold all the answers and then replacing one element at a time is noticeably faster.

```r
ans <- numeric( nrow( input ) )
ans
```

```
## [1] 0 0 0
```

```r
for ( i in seq_along( input$A ) ) {
    A <- input$A[ i ] ; B <- input$B[ i ] ; C <- input$C[ i ]
    ans[ i ] <- ( -B + sqrt( B^2 - 4*A*C ) ) / ( 2 * A )
}
ans
```

```
## [1] 3 4 5
```

---

# Combined Presentation

Having your results in a vector of their own is useful, but showing that vector next to the inputs is even more useful:

```r
cbind( input, tibble( ans = ans ) )
```

```
##   Trial A  B  C ans
## 1     1 1 -2 -3   3
## 2     2 1 -3 -4   4
## 3     3 1 -4 -5   5
```

---

# Make a Solution Function

To make a function, write code that uses inputs mentioned in the parentheses, and the last expression in the function will be the answer given by the function:

```r
my_solution <- function( A, B, C ) {
  ( -B + sqrt( B^2 - 4*A*C ) ) / ( 2 * A )
}
my_solution( 1, -2, -3 ) # try it out
```

```
## [1] 3
```

You can also draw on variables from the working environment where you define the function, but it is intentionally hard to make changes to such variables inside a function.

---

# Make a Many Solutions Function

If your solutions generally only worked for one value each for `A`, `B`, and `C` then you might make this function:

```r
my_many_solutions <- function( A, B, C, solution ) {
  ans <- numeric( length( A ) )
  for ( i in seq_along( A ) ) {
    ans[ i ] <- solution( A[ i ], B[ i ], C[ i ] )
  }
  ans
}
my_many_solutions( input$A, input$B, input$C, my_solution )
```

```
## [1] 3 4 5
```

---

# But why Re-Invent the Wheel?

The `pmap_dbl` function takes a data frame and applies a function to each row. The function arguments have to be the same as the column names in the data frame, so we exclude the `Trial` column before giving it to `pmap_dbl`:

```r
library(purrr)
pmap_dbl( input[ , -1 ], my_solution )
```

```
## [1] 3 4 5
```

---

# Vectorized Calculations

In this case, because R arithmetic operations are internally vectorized, the `my_solution` function can directly be given vectors instead of one value at a time, which is much faster:

```r
my_solution( input$A, input$B, input$C )
```

```
## [1] 3 4 5
```

---

# Why Not Always Vectorize?

Some algorithms are hard to implement in vectorized fashion... you may reach your goal sooner if you use a slower but simpler algorithm that is imperfect but gets the job done:

```r
my_slow_solution <- function( A, B, C ) {
  Y <- function( X, A, B, C ) { A * X^2 + B * X + C }
  # only one answer at a time
  uniroot( Y, c( 0, 1e6 ), A = A, B = B, C = C, tol = 1e-8 )$root
}
my_slow_solution( 1, -2, -3 )
```

```
## [1] 3
```

but if you plan to solve many thousands or millions of problems, it may pay off to spend time making a complex algorithm vectorized.

---

# Speed Review

The price of flexibility when using `pmap_dbl` is speed.

```r
options( digits = 1 ) # compact results
library(microbenchmark)
ans <- microbenchmark( 
  loop_slow = my_many_solutions( input$A, input$B, input$C
                               , my_slow_solution )
, pmap_slow = pmap_dbl( input[ , -1 ], my_slow_solution )
, loop_single = my_many_solutions( input$A, input$B, input$C
                                 , my_solution )
, pmap_single = pmap_dbl( input[ , -1 ], my_solution )
, vectorized = my_solution( input$A, input$B, input$C )
)
summary( ans )
```

```
##          expr min  lq mean median  uq  max neval
## 1   loop_slow 221 231  398    294 463 4322   100
## 2   pmap_slow 255 263  401    283 421 3486   100
## 3 loop_single  17  21   30     24  35   77   100
## 4 pmap_single  42  52   90     71 113  570   100
## 5  vectorized  14  17   26     22  32   86   100
```

---

# Expanding Options

What if you want to try out a bunch of possible input combinations without listing all of them in a file? `expand.grid` to the rescue!

```r
options( digits = 3 )
Bvals <- seq( -2, 0, 0.1 )   # 21 values
Cvals <- seq( -4, -2, 0.5 )  # 5 values
DF2 <- expand.grid( A = 1    # all combinations A/B/C
                  , B = Bvals
                  , C = Cvals
                  )
nrow( DF2 ) # 1 * 5 * 21
```

```
## [1] 105
```

```r
head( DF2 )
```

```
##   A    B  C
## 1 1 -2.0 -4
## 2 1 -1.9 -4
## 3 1 -1.8 -4
## 4 1 -1.7 -4
## 5 1 -1.6 -4
## 6 1 -1.5 -4
```

---

# Computing on Combinations

With so many ways to compute a vector of answers given an input data frame, we can choose the one most appropriate:

```r
DF2$ans <- my_solution( DF2$A, DF2$B, DF2$C ) # 105 answers at once!
head( DF2 )
```

```
##   A    B  C  ans
## 1 1 -2.0 -4 3.24
## 2 1 -1.9 -4 3.16
## 3 1 -1.8 -4 3.09
## 4 1 -1.7 -4 3.02
## 5 1 -1.6 -4 2.95
## 6 1 -1.5 -4 2.89
```

Each row now has an answer associated with it. How should we present these results?

---

# Grouping Rows

One way is to treat each separate value of `C` distinctly, but plot `ans` vs. `B`. The usual way to mark a group of rows is to give them all the same value of a factor:

```r
DF2$Cgroup <- factor( DF2$C # there are 105 elements...
                    , levels = Cvals # but only 5 discrete values
                    )
head( DF2 )
```

```
##   A    B  C  ans Cgroup
## 1 1 -2.0 -4 3.24     -4
## 2 1 -1.9 -4 3.16     -4
## 3 1 -1.8 -4 3.09     -4
## 4 1 -1.7 -4 3.02     -4
## 5 1 -1.6 -4 2.95     -4
## 6 1 -1.5 -4 2.89     -4
```

---

# Use `ggplot2` Color

Ggplot will separate the data into groups of records where the `Cgroup` column values are all the same. Note that because `C` is numeric, `ggplot` would assume if we used `C` for color there could be values like 3.7 even though we purposely skipped over that value.

```r
library(ggplot2)
ggplot( DF2, aes( x = B, y = ans, colour = Cgroup ) ) +
  geom_line( size = 1 ) +
  scale_color_viridis_d( name = "C" ) +
  theme_minimal()
```

![](DoItAgainR_files/figure-html/unnamed-chunk-13-1.png)

---

# Output

Sometimes others don't want a "long"-form data frame... the `spread` function is useful for "pivoting" your calculations to a more human-digestible layout:

```r
library(tidyr)
options( digits = 3 )
DF3 <- (   DF2
       %>% select( A, B, Cgroup, ans )
       %>% spread( Cgroup, ans )
       )
head( DF3 )
```

```
##   A    B   -4 -3.5   -3 -2.5   -2
## 1 1 -2.0 3.24 3.12 3.00 2.87 2.73
## 2 1 -1.9 3.16 3.05 2.93 2.79 2.65
## 3 1 -1.8 3.09 2.98 2.85 2.72 2.58
## 4 1 -1.7 3.02 2.90 2.78 2.65 2.50
## 5 1 -1.6 2.95 2.83 2.71 2.57 2.42
## 6 1 -1.5 2.89 2.77 2.64 2.50 2.35
```

The factor version of `C` is discrete, which makes it easier to know which `ans` values will end up in each column.

---

# Conclusion

- For loops are not always bad
    + Most useful for repeating `solution`s that involve a lot of calculations anyway
    + `apply`-type functions are just wrappers for `for` loops with results pre-allocation
- Vectorized code uses all vector-based primitive operations
    + `apply`-type code _looks_ vectorized so easy to read, but not as fast
    + See `?Arithmetic`, and `?cumsum`, and `?rowSums` for example
- Grouping rows by discrete values helps you sort out your results
    + Numeric values are not discrete, but you can fake it by making a factor
- Do check out R4DS (https://r4ds.had.co.nz) for more useful analysis strategies

.footnote[
[1] HTML slides built with the `xaringan` package, available via `install.packages`.
]