Annoucement

Acknowledgement

Dr. Hua Zhou’s slides

Advanced R

To gain a deep understanding of how R works, the book Advanced R by Hadley Wickham is a must read. Read now to save numerous hours you might waste in future.

We cover select topics on coding style, benchmarking, profiling, debugging, parallel computing, byte code compiling, Rcpp, and package development.

Style

Benchmark

Sources:

In order to identify performance issue, we need to measure runtime accurately.

system.time

set.seed(280)
x <- runif(1e6)

system.time({sqrt(x)})
##    user  system elapsed 
##   0.002   0.001   0.003
system.time({x ^ 0.5})
##    user  system elapsed 
##   0.011   0.000   0.011
system.time({exp(log(x) / 2)})
##    user  system elapsed 
##   0.008   0.001   0.008

From William Dunlap:

“User CPU time” gives the CPU time spent by the current process (i.e., the current R session) and “system CPU time” gives the CPU time spent by the kernel (the operating system) on behalf of the current process. The operating system is used for things like opening files, doing input or output, starting other processes, and looking at the system clock: operations that involve resources that many processes must share. Different operating systems will have different things done by the operating system.

microbenchmark

library("microbenchmark")
library("ggplot2")

mbm <- microbenchmark(
  sqrt(x),
  x ^ 0.5,
  exp(log(x) / 2),
  times = 100
)
## Warning in microbenchmark(sqrt(x), x^0.5, exp(log(x)/2), times = 100): less
## accurate nanosecond times to avoid potential integer overflows
mbm
## Unit: milliseconds
##           expr      min       lq      mean   median        uq       max neval
##        sqrt(x) 1.387727 1.435061  1.853112 1.462244  1.628950  5.965582   100
##          x^0.5 9.263909 9.319935 10.182796 9.429426 11.173033 17.317211   100
##  exp(log(x)/2) 7.741046 7.789713  8.456798 7.869909  8.403668 22.016508   100

Results from microbenchmark can be nicely plotted in base R or ggplot2.

boxplot(mbm)

autoplot(mbm)
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.

bench

The bench package is another tool for microbenchmarking. The output is a tibble:

library(bench)

(lb <- bench::mark(
  sqrt(x),
  x ^ 0.5,
  exp(log(x) / 2)
))
## # A tibble: 3 × 6
##   expression         min   median `itr/sec` mem_alloc `gc/sec`
##   <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
## 1 sqrt(x)         1.08ms   1.46ms      678.    7.63MB    195. 
## 2 x^0.5              9ms   9.38ms      106.    7.63MB     21.3
## 3 exp(log(x)/2)   7.75ms   7.84ms      126.    7.63MB     25.2

To visualize the result

plot(lb)
## Loading required namespace: tidyr

It is colored according to GC (garbage collection) levels.

Profiling

Premature optimization is the root of all evil (or at least most of it) in programming.
-Don Knuth

Sources:

First example

library(profvis)

profvis({
  data(diamonds, package = "ggplot2")

  plot(price ~ carat, data = diamonds)
  m <- lm(price ~ carat, data = diamonds)
  abline(m, col = "red")
})

Example: profiling time

First generate test data:

times <- 4e5
cols <- 150
data <- as.data.frame(x = matrix(rnorm(times * cols, mean = 5), ncol = cols))
data <- cbind(id = paste0("g", seq_len(times)), data)
head(data)
##   id       V1       V2       V3       V4       V5       V6       V7       V8
## 1 g1 5.969774 3.386956 4.358790 6.237727 5.771275 5.174312 5.037446 4.430052
## 2 g2 4.921927 5.633818 5.694171 4.865483 4.153967 5.061355 4.371293 5.463183
## 3 g3 5.800428 5.549560 3.976693 4.171676 4.666550 7.123001 5.098244 5.293764
## 4 g4 7.898909 4.615669 4.377937 4.701138 5.101872 5.295811 4.312929 4.085740
## 5 g5 6.452353 5.880938 5.926650 4.594801 4.482622 5.954435 4.586131 3.444331
## 6 g6 4.469903 6.975048 4.217334 6.039186 3.682259 4.181690 5.154283 5.046682
##         V9      V10      V11      V12      V13      V14      V15      V16
## 1 3.945362 6.991074 6.534173 6.317850 5.068947 5.795827 5.498428 4.983893
## 2 5.313754 5.346877 5.352962 4.297411 3.143680 5.640137 5.002181 2.896327
## 3 4.865347 4.281242 4.907693 4.644739 4.526487 5.236702 2.669363 4.749898
## 4 3.162529 3.530056 3.844779 5.075276 4.674053 4.944213 5.056897 6.646479
## 5 5.776090 3.128314 5.455446 5.906633 6.148552 4.399970 4.660836 3.235290
## 6 6.352984 6.289363 5.674194 3.793744 3.325485 5.174891 4.695926 4.282914
##        V17      V18      V19      V20      V21      V22      V23      V24
## 1 4.021139 5.065696 6.310364 5.490599 5.055548 4.238561 5.797920 4.650697
## 2 4.800560 4.762682 3.607615 4.842496 6.059028 6.391931 7.143750 4.092829
## 3 5.418325 4.843147 4.703208 5.158282 4.913269 5.997505 6.404710 5.901887
## 4 4.596020 7.193654 5.363815 5.828282 4.825783 4.223200 4.941217 5.421718
## 5 5.168906 3.363993 3.267161 6.299007 3.990615 4.938966 6.021251 4.856988
## 6 3.766488 5.115333 3.881455 5.711343 5.260267 6.811890 6.109790 5.321400
##        V25      V26      V27      V28      V29      V30      V31      V32
## 1 6.232738 5.156774 3.934398 3.779090 4.736236 3.174204 6.293351 5.757692
## 2 5.430874 4.537045 6.011686 4.729784 4.512812 5.574587 5.457530 4.388429
## 3 5.027204 4.896893 4.960644 4.795403 4.603998 4.743484 4.563350 3.839632
## 4 6.068461 4.881969 5.554265 4.336586 6.011450 5.118163 4.219831 5.042094
## 5 5.456461 6.307981 4.052145 5.367455 5.202699 4.090536 4.852753 3.555741
## 6 6.915472 5.672714 3.100835 7.315498 5.420842 5.900181 4.055118 4.149567
##        V33      V34      V35      V36      V37      V38      V39      V40
## 1 4.955279 6.137107 3.573192 4.722608 6.093101 5.095926 4.211594 5.593647
## 2 4.955159 4.581849 5.322105 5.565669 5.911121 5.071331 3.647852 6.422367
## 3 4.702261 5.063408 5.118905 2.957961 4.564707 2.932893 4.673558 4.913145
## 4 4.108503 3.184510 4.287867 5.577840 5.290784 6.163332 5.269505 3.863132
## 5 6.238919 4.414152 4.087591 4.255676 4.117047 5.126822 5.187829 5.426037
## 6 5.278965 4.367095 2.899642 4.237844 6.699517 5.530437 5.831208 3.738705
##        V41      V42      V43      V44      V45      V46      V47      V48
## 1 4.753439 5.169350 4.883482 3.657573 5.669892 4.241260 4.851962 5.456667
## 2 6.187101 4.716270 5.597624 4.413653 4.230248 4.871157 5.293433 6.435590
## 3 5.438515 4.850280 4.626902 4.459188 5.827907 5.663024 3.457956 5.721593
## 4 5.892379 5.762037 5.179667 3.942267 5.778172 5.932535 5.524665 4.771691
## 5 4.901255 3.280638 5.652811 4.206320 5.858897 6.658522 5.633509 4.963152
## 6 3.961880 4.871338 5.252205 4.892354 5.132825 5.681368 7.427848 3.656597
##        V49      V50      V51      V52      V53      V54      V55      V56
## 1 3.641526 5.823936 6.202179 4.682491 6.020682 2.888328 4.884376 5.080382
## 2 5.709755 5.460538 4.667820 4.270379 4.517615 5.204095 4.635607 5.231288
## 3 4.349273 5.278583 3.925600 5.579634 4.434822 5.277278 5.749743 5.783613
## 4 6.172233 6.314465 6.106169 5.754318 5.706827 6.369271 5.991514 3.357682
## 5 4.258391 6.554396 5.381009 4.826608 5.104414 5.925495 6.495626 3.528083
## 6 5.196717 5.721946 6.507628 4.720609 4.800197 4.981737 4.804950 4.029324
##        V57      V58      V59      V60      V61      V62      V63      V64
## 1 5.731241 3.716672 3.728804 4.225844 5.893676 5.236128 3.998496 5.052559
## 2 3.800630 5.785667 3.602729 5.613489 4.841687 4.655119 5.548548 3.668189
## 3 4.409653 4.344321 4.891546 4.541498 5.192576 5.162769 5.449993 4.251714
## 4 5.511138 4.634418 4.294338 5.769028 5.738561 4.282375 5.683305 3.788938
## 5 5.250808 6.254993 7.384261 5.711009 4.899243 4.683286 5.427239 4.776874
## 6 4.003427 4.317158 5.344575 7.102476 3.520562 3.071539 5.494856 6.203147
##        V65      V66      V67      V68      V69      V70      V71      V72
## 1 2.297626 5.044528 4.422164 4.317439 5.039205 5.306640 4.291307 5.752981
## 2 5.042682 5.209917 6.233088 3.673400 5.550484 4.341351 5.621920 5.107727
## 3 4.954122 3.964389 3.632149 5.902254 5.180800 5.905438 6.001834 4.165093
## 4 4.699785 5.317041 4.387484 2.815468 5.945561 4.348289 4.949502 3.373685
## 5 5.688282 4.922704 5.282254 5.433178 7.002321 5.236936 6.255325 3.910518
## 6 6.325418 4.847684 7.013564 5.339205 3.909243 5.400927 5.143125 3.502081
##        V73      V74      V75      V76      V77      V78      V79      V80
## 1 4.452937 4.556165 4.878821 5.494998 4.525805 5.598838 5.510052 4.335159
## 2 5.582111 5.070413 3.553271 3.302500 4.883199 5.669165 6.934187 5.756974
## 3 7.069034 5.077702 4.823577 3.950463 2.968174 2.981413 5.337635 5.745397
## 4 3.325205 5.391997 5.576931 4.869578 5.218397 3.661650 4.783694 7.297424
## 5 4.194588 5.936640 5.947325 6.494725 6.002182 4.566777 3.771062 5.382190
## 6 3.909446 4.450924 4.379779 4.845526 5.499161 2.473238 4.043085 3.965896
##        V81      V82      V83      V84      V85      V86      V87      V88
## 1 3.841143 4.854982 5.862762 6.055590 5.570398 5.605576 5.207664 4.721803
## 2 6.290887 3.717678 5.352763 4.009332 4.243106 6.890147 6.300223 6.049974
## 3 4.865970 3.963592 4.502974 4.874401 4.185517 4.027433 4.543304 4.727077
## 4 5.910201 6.051057 3.789773 4.745911 4.428981 3.917806 4.718456 4.085395
## 5 4.958616 2.225116 5.862221 4.738276 5.119915 3.244853 5.478629 5.415818
## 6 3.419392 5.939352 4.620255 3.502446 7.679182 7.079275 4.400633 5.172185
##        V89      V90      V91      V92      V93      V94      V95      V96
## 1 4.638047 4.648935 5.864618 4.781209 4.581879 4.869569 4.365018 3.933875
## 2 5.293047 4.015187 6.509665 4.338551 4.430226 4.618964 4.688742 5.389455
## 3 4.302911 4.715384 7.099666 4.696381 4.224483 3.484758 4.979978 5.162419
## 4 6.146197 5.816736 2.364329 5.978111 3.916872 5.119176 6.135774 6.309547
## 5 7.350988 5.664298 5.541353 6.000726 5.227135 5.361953 3.361382 4.935839
## 6 5.316849 5.509925 4.510909 3.316033 5.467789 5.780469 4.307894 5.820533
##        V97      V98      V99     V100     V101     V102     V103     V104
## 1 4.926531 5.451521 5.017248 5.636656 4.776153 6.359240 5.665790 6.934336
## 2 4.288900 5.201093 5.817945 5.068234 5.083228 6.793019 5.144909 3.751889
## 3 4.980340 3.123095 4.738060 4.542765 4.638650 5.814557 7.978221 5.774194
## 4 4.931209 4.736936 3.554819 3.171603 4.795017 5.165080 4.827155 4.330540
## 5 5.301755 4.509777 5.518396 4.667021 5.465201 5.009373 5.611512 5.683406
## 6 3.036929 5.952751 3.954832 5.889513 5.194754 4.391003 5.226084 4.291935
##       V105     V106     V107     V108     V109     V110     V111     V112
## 1 4.916882 3.682590 3.553191 6.178398 6.588264 3.722993 4.291036 6.209619
## 2 5.029695 4.816148 3.061924 4.118335 5.784075 5.031129 4.164354 6.112392
## 3 4.389153 4.837424 4.265490 5.130126 5.268602 5.803634 5.390342 6.347929
## 4 4.128308 5.855140 4.941786 4.834643 4.263134 3.589032 5.405492 4.955846
## 5 6.508588 4.252722 4.999025 4.962796 5.694972 4.879281 4.167013 4.673221
## 6 5.436822 4.295753 3.682477 4.373195 5.968806 4.794256 4.186473 4.809896
##       V113     V114     V115     V116     V117     V118     V119     V120
## 1 3.578261 6.869507 2.238355 4.383842 4.417523 6.048996 4.675138 5.842443
## 2 5.941705 4.701242 3.920104 7.204458 4.485306 7.334596 5.865684 4.203705
## 3 7.672582 3.462561 5.254552 4.463835 4.016373 6.311201 4.517438 4.700522
## 4 4.916321 4.609474 5.251307 4.959063 3.014013 6.210320 4.482665 5.731177
## 5 4.497916 6.566364 3.930197 4.736816 5.845934 6.098909 4.571074 6.489460
## 6 4.757196 5.410635 5.014765 5.003775 4.754222 4.567107 3.474825 6.064451
##       V121     V122     V123     V124     V125     V126     V127     V128
## 1 5.167029 5.750582 4.907758 5.865223 5.261798 4.956968 5.643643 5.812596
## 2 3.606904 5.627711 5.587762 5.654150 5.269067 4.922916 4.823450 5.121979
## 3 4.729896 4.529697 5.358662 5.112413 4.884113 6.012762 6.285833 4.477839
## 4 3.996006 6.642559 4.243635 6.820607 6.025009 4.956143 5.423347 5.900742
## 5 4.710165 5.496734 4.419287 4.525932 4.282508 2.885029 5.337152 4.868927
## 6 5.758978 3.747149 4.523980 5.090233 4.284195 5.217521 5.638891 3.792649
##       V129     V130     V131     V132     V133     V134     V135     V136
## 1 5.078079 6.353728 4.885178 5.647268 5.396727 4.887594 3.792632 5.802780
## 2 5.395293 3.354223 5.184184 3.709929 4.930617 4.181895 5.545752 6.747582
## 3 3.045543 4.332591 3.238270 5.711071 5.764195 4.717790 4.795234 5.768002
## 4 6.472146 4.903896 3.245853 5.172849 5.101201 4.803509 6.542996 5.211635
## 5 5.371391 4.888848 4.789275 4.892961 7.136214 6.075162 5.259250 4.575743
## 6 7.184338 5.835156 4.187135 5.513385 2.575642 5.078832 4.306050 5.178530
##       V137     V138     V139     V140     V141     V142     V143     V144
## 1 3.985640 5.488433 5.278389 6.098580 4.490176 4.915071 4.637860 6.020174
## 2 5.242106 6.356912 4.750170 5.434490 5.296510 5.959623 5.673556 2.154634
## 3 5.869586 5.333952 4.018724 7.537137 2.985009 6.095209 4.210550 4.409794
## 4 6.129498 4.581313 5.013140 4.977602 5.929854 5.276177 4.564731 4.240755
## 5 4.704665 4.872247 5.141363 6.538962 3.871012 5.762745 5.553001 5.418704
## 6 4.919545 6.425867 3.329013 4.154823 3.671133 6.976441 5.135457 4.695289
##       V145     V146     V147     V148     V149     V150
## 1 3.860709 4.360477 4.848465 5.005363 6.678806 3.734484
## 2 6.513392 3.879315 4.890934 5.383120 5.062598 6.181172
## 3 5.687756 7.131847 6.448625 3.967368 5.851488 5.865080
## 4 3.590409 5.770194 5.654670 6.992913 6.333515 5.449932
## 5 4.207660 5.257442 4.447403 3.551037 4.092570 5.093482
## 6 5.825259 4.921798 3.497042 3.861398 4.491844 5.183752

Original code for centering columns of a dataframe:

profvis({
  # Store in another variable for this run
  data1 <- data
  
  # Get column means
  means <- apply(data1[, names(data1) != "id"], 2, mean)
  
  # Subtract mean from each column
  for (i in seq_along(means)) {
    data1[, names(data1) != "id"][, i] <-
      data1[, names(data1) != "id"][, i] - means[i]
  }
})

Profile apply vs colMeans vs lapply vs vapply:

profvis({
  data1 <- data
  # Four different ways of getting column means
  means <- apply(data1[, names(data1) != "id"], 2, mean)
  means <- colMeans(data1[, names(data1) != "id"])
  means <- lapply(data1[, names(data1) != "id"], mean)
  means <- vapply(data1[, names(data1) != "id"], mean, numeric(1))
})

We decide to use vapply:

profvis({
  data1 <- data
  means <- vapply(data1[, names(data1) != "id"], mean, numeric(1))

  for (i in seq_along(means)) {
    data1[, names(data1) != "id"][, i] <- data1[, names(data1) != "id"][, i] - means[i]
  }
})

Calculate mean and center in one pass:

profvis({
 data1 <- data
 
 # Given a column, normalize values and return them
 col_norm <- function(col) {
   col - mean(col)
 }
 
 # Apply the normalizer function over all columns except id
 data1[, names(data1) != "id"] <-
   lapply(data1[, names(data1) != "id"], col_norm)
})

Example: profiling memory

Original code for cumulative sums:

data <- data.frame(value = runif(1e5))
profvis({
  data$sum[1] <- data$value[1]
  for (i in seq(2, nrow(data))) {
    data$sum[i] <- data$sum[i-1] + data$value[i]
  }
})

Write a function to avoid expensive indexing by $:

profvis({
  csum <- function(x) {
    if (length(x) < 2) return(x)

    sum <- x[1]
    for (i in seq(2, length(x))) {
      sum[i] <- sum[i-1] + x[i]
    }
    sum
  }
  data$sum <- csum(data$value)
})

Pre-allocate vector:

data <- data.frame(value = runif(1e6))
profvis({
  csum2 <- function(x) {
    if (length(x) < 2) return(x)

    sum <- numeric(length(x))  # Preallocate
    sum[1] <- x[1]
    for (i in seq(2, length(x))) {
      sum[i] <- sum[i-1] + x[i]
    }
    sum
  }
  data$sum <- csum2(data$value)
})

Advice

Modularize big projects into small functions. Profile functions as early and as frequently as possible.

Debugging

Learning sources: