George Ostrouchov (Oak Ridge National Laboratory and University of Tennessee)1
This course covers the use of medium to large computing systems with the R language and other software tools for statistical workflows on large data. It includes an overview of hardware and R-related software for such large systems. Statistical topics exercised on these systems include parallel random number generation, bootstrap and crossvalidation, and matrix computation for statistical methods. The class will work on IT4I systems throughout the semester. Concepts will include strategies for fast and efficient R code and for parallel implementations utilizing multicore and multinode approaches.
Many technologies are useful for statistical computing and data science. In this class, we take a narrow and high-level path through these technologies. We learn tools that specifically target working with R on a supercomputer or a generic cluster computer for the purpose of developing code to analyze large data. While the high-level path appears narrow, most other popular technologies are based on the same or similar lower-level concepts that will be discussed.
Supercomputers are unix systems, accessed remotely. Consequently, the first enabler is knowledge of a few unix commands and familiarity with remote access software. This gives access to a uniform platform experience (unix) for the whole class (whether accessing from Windows, Mac, or Linux laptops), except for the initial login access, which can differ.
A second uniformity enabler is the R language along with git version control, which enable a workflow of local code editing and the ability to synchronize with a remote supercomputer. RStudio is an easy way to use both. The git synchronization is also key to software collaborations.
One lecture per week, scheduled for 90 minutes each, probably Wednesdays, starting at 17:20. There will be exercises to complete on IT4I systems weekly. I plan another scheduled hour to answer questions regarding the exercises. This can be adjusted as the lectures proceed.
The order of concepts and exercises are subject to change as the lectures proceed. I hope to make the lectures as interactive as possible. I would like to work one or two large data sets in a way that intersects with many lectures and exercises. Some potential data sets are listed at the bottom and I welcome other suggestions.
Introductions, expectation, IT4I accounts, workflow (laptop via git to cluster)
1.1. Exercise: IT4I accounts setup, simple unix, ssh concepts
Overview of parallel hardware
2.1 Exercise: Working over ssh with a single node
Overview of parallel software, interactive vs. batch, scaling concepts
3.1 Exercise: Running interactive or batch, PBS scheduler at IT4I (comment on SLURM)
Timing, benchmarking, profiling R code, and git version control overview
3.1 Exercise: Benchmark R code on your laptop and on a cluster node. Code on laptop, get via git to cluster.
Speeding up your serial R code, converting sections to C/C++
4.1 Exercise: Benchmark code examples
Using multicore parallelism, unix fork, multithreaded BLAS, hyperthreading
1.1 Exercise: PBS (SLURM) Managing multicore and multinode work in R
MPI, Regression case study: parallelizing random forest multicore and multinode 8.1 Exercise: Generate timings and make a scaling graphs of code speedup
Reading data in parallel, CSV, HDF5, ADIOS2 7.1 Exercise: Work with a large data set
Parallel matrix computation via OpenBLAS and ScaLAPACK libraries from R
10.1 Exercise:
Distributed PCA case study: Parallel data ingestion to randomized PCA
9.1 Exercise:
Projects and selected further concepts
11.1 Exercise:
Projects and selected further concepts 12.1 Exercise:
Syllabus text produced from a .Rmd file and rendered to html via Knit in RStudio↩︎