Cluster computing (at UBC)

Stat 550

Daniel J. McDonald

Last modified – 03 April 2024

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\mid} \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\U}{\mathbf{U}} \newcommand{\D}{\mathbf{D}} \newcommand{\V}{\mathbf{V}} \renewcommand{\hat}{\widehat} \]

UBC HPC

3 potentially useful systems:

  1. Department VM
  2. UBC ARC Sockeye
  3. Digital Research Alliance of Canada

I’ve only used 1 and 3. I mainly use 3.

Accessing

As far as I know, access for students requires “faculty” support

  1. Email The/Binh.
  2. Possible you can access without a faculty PI.
  3. Email your advisor to ask for an account.

The rest of this will focus on 3.

Prerequisites

  1. Command line interface (Terminal on Mac)

  2. (optional) helpful to have ftp client. (Cyberduck)

  3. Globus Connect. File transfer approved by DRAC

Useful CL commands

cd ~/path/to/directory

cp file/to/copy.txt copied/to/copy1.txt

rm file/to/delete.txt

rm -r dir/to/delete/

ls -a # list all files

How to connect

Login to a system:

ssh dajmcdon@cedar.alliancecan.ca
  • Upon login, you’re on a “head” or “login” node.
  • Jobs > 30min will be killed.
  • You can continuously run short interactive jobs.

Rule 1

Tip

If you’re doing work for school: run it on one of these machines.

  • Yes, there is overhead to push data over and pull results back.
  • But DRAC/Sockeye is much faster than your machine.
  • And this won’t lock up your laptop for 4 hours while you run the job.
  • It’s also a good experience.
  • You can log out and leave the job running. Just log back in to see if it’s done (you should always have some idea how long it will take)

Modules

  • Once you connect with ssh:

  • There are no Applications loaded.

  • You must tell the system what you want.

  • The command is module load r or module load sas

  • If you find yourself using the same modules all the time:

module load StdEnv/2023 r gurobi python # stuff I use

module save my_modules # save loaded modules

module restore my_modules # on login, load the usual set

Running something interactively

  1. Login
  2. Load modules
  3. Request interactive compute
salloc --time=1:0:0 --ntasks=1 --account=def-dajmcdon --mem-per-cpu=4096M
# allocate 1 hour on 1 cpu with 4Gb RAM
  • For the user def-dajmcdon (that’s me, accounts start with def-)

Then I would start R

r

And run whatever I want. If it takes more than an hour or needs more than 4GB of memory, it’ll quit.

Interactive jobs

  • Once started they’ll just go
  • You can do whatever else you want on your machine
  • But you can’t kill the connection
  • So don’t close your laptop and walk away
  • This is not typically the best use of this resource.
  • Better is likely syzygy.

Although, syzygy has little memory and little storage, so it won’t do intensive tasks

I think your home dir is limited to 1GB

Big memory jobs

  • Possible you can do this interactively, but discouraged

Example

  • Neuroscience project
  • Dataset is about 10GB
  • Peak memory usage during analysis is about 24GB
  • Can’t do this on my computer
  • Want to offload onto DRAC
  1. Write a R / python script that does the whole analysis and saves the output.

  2. You need to ask DRAC to run the script for you.

The scheduler

  • You can log in to DRAC and “do stuff”
  • But resources are limited.
  • There’s a process that determines who gets resources when.
  • Technically the salloc command we used before requested some resources.
  • It may “sit” until the resources you want are available, but probably not long.
  • Anything else has to go through the scheduler.
  • DRAC uses the slurm scheduler

Example script

#!/bin/bash

#SBATCH --account=def-dajmcdon
#SBATCH --job-name=dlbcl-suffpcr
#SBATCH --output=%x-%j.out
#SBATCH --error=%x-%j.out
#SBATCH --time=10:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=32G

Rscript -e 'source("dlbcl-nocv.R")'
  • This asks for 10 hours of compute time with 32GB of memory
  • The job-name / output / error fields are for convenience.
  • If unspecified, I’ll end up with files named things like jobid60607-60650934.out

Submitting and other useful commands

  • Suppose that slurm script is saved as dlbcl-slurm.sh
sbatch dlbcl-slurm.sh # submit the job to the scheduler

squeue -u $USER # show status of your jobs ($USER is an env variable)

scancel -u $USER # cancel all your jobs

scancel -t PENDING -u $USER # cancel all your pending jobs

Important

  1. Jobs inherit environment variables. So if you load modules, then submit, your modules are available to run.

  2. On Cedar, jobs cannot run from ~/. It must be run from ~/scratch/ or ~/projects/.

Really big jobs

Types of jobs

  1. Big jobs (need lots of RAM)

  2. GPU jobs (you want deep learning, I don’t know how to do this)

  3. Other jobs with internal parallelism (I almost never do this)

  4. Embarrassingly parallel jobs (I do this all the time)

Simple parallelization

  • Most of my major computing needs are “embarrassingly parallel”
  • I want to run a few algorithms on a bunch of different simulated datasets under different parameter configurations.
  • Perhaps run the algos on some real data too.
  • R has packages which are good for parallelization (snow, snowfall, Rmpi, parallel)
  • This is how I originally learned to do parallel computing. But these packages are not good for the cluster
  • They’re fine for your machine, but we’ve already decided we’re not going to do that anymore.

Example of the bad parallelism

Torque script

#!/bin/bash  
#PBS -l nodes=8:ppn=8,walltime=200:00:00
#PBS -m abe
#PBS -n ClusterPermute 
#PBS -j oe 
mpirun -np 64 -machinefile $PBS_NODEFILE R CMD BATCH ClusterPermute.R
  • Torque is a different scheduler. UBC ARC Sockeye uses Torque. Looks much like Slurm.

  • Here, ClusterPermute.R uses Rmpi to do “parallel lapply

  • So I asked for 8 processors on each of 8 nodes.

Example of the bad parallelism

Torque script

#!/bin/bash  
#PBS -l nodes=8:ppn=8,walltime=200:00:00
#PBS -m abe
#PBS -n ClusterPermute 
#PBS -j oe 
mpirun -np 64 -machinefile $PBS_NODEFILE R CMD BATCH ClusterPermute.R

Problem

  • The scheduler has to find 8 nodes with 8 available processors before this job will start.

  • This often takes a while, sometimes days.

  • But the jobs don’t need those things to happen at the same time.

{batchtools}

  • Using R (or python) to parallelize is inefficient when there’s a scheduler in the middle.

  • Better is to actually submit 64 different jobs each requiring 1 node

  • Then each can get out of the queue whenever a processor becomes available.

  • But that would seem to require writing 64 different slurm scripts

  • {batchtools} does this for you, all in R

    1. It automates writing/submitting slurm / torque scripts.
    2. It automatically stores output, and makes it easy to collect.
    3. It generates lots of jobs.
    4. All this from R directly.

It’s easy to port across machines / schedulers.

I can test parts (or even run) it on my machine without making changes for the cluster.

Setup {batchtools}

  1. Create a directory where all your jobs will live (in subdirectories). Mine is ~/

  2. In that directory, you need a template file. Mine is ~/.batchtools.slurm.tmpl (next slide)

  3. Create a configuration file which lives in your home directory. You must name it ~/.batchtools.conf.R.

# ~/.batchtools.conf.R
cluster.functions <- makeClusterFunctionsSlurm()

~/.batchtools.slurm.tmpl

#!/bin/bash

## Job Resource Interface Definition
##
## ntasks [integer(1)]:       Number of required tasks,
##                            Set larger than 1 if you want to further parallelize
##                            with MPI within your job.
## ncpus [integer(1)]:        Number of required cpus per task,
##                            Set larger than 1 if you want to further parallelize
##                            with multicore/parallel within each task.
## walltime [integer(1)]:     Walltime for this job, in seconds.
##                            Must be at least 60 seconds for Slurm to work properly.
## memory   [integer(1)]:     Memory in megabytes for each cpu.
##                            Must be at least 100 (when I tried lower values my
##                            jobs did not start at all).
##
## Default resources can be set in your .batchtools.conf.R by defining the variable
## 'default.resources' as a named list.

<%
# relative paths are not handled well by Slurm
log.file = fs::path_expand(log.file)
-%>

#SBATCH --account=def-dajmcdon
#SBATCH --mail-user=daniel@stat.ubc.ca
#SBATCH --mail-type=ALL
#SBATCH --job-name=<%= job.name %>
#SBATCH --output=<%= log.file %>
#SBATCH --error=<%= log.file %>
#SBATCH --time=<%= resources$walltime %>
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=<%= resources$ncpus %>
#SBATCH --mem-per-cpu=<%= resources$memory %>
<%= if (array.jobs) sprintf("#SBATCH --array=1-%i", nrow(jobs)) else "" %>

## Run R:
## we merge R output with stdout from SLURM, which gets then logged via --output option
Rscript -e 'batchtools::doJobCollection("<%= uri %>")'

When I’m ready to run, I’ll call something like:

batchtools::submitJobs(
  job.ids, 
  resources = list(ncpus=1, walltime="24:00:00", memory="32G")
)

Workflow

See the vignette: vignette("batchtools")

or the

website

  1. Create a folder to hold your code. Mine usually contains 2 files, one to set up/run the experiment, one to collect results. Code needed to run the experiment lives in an R package.

  2. Write a script to setup the experiment and submit.

  3. Wait.

  4. Collect your results. Copy back to your machine etc.

Do it

Example 1: Use genetics data to predict viral load

  • An “extra” example in a methods paper to appease reviewers

  • Method is:

    1. apply a special version of PCA to a big (wide) data set
    2. Do OLS using the top few PCs
  • This is “principle components regression” with sparse principle components.

  • Got 413 COVID patients, measure “viral load” and gene expression

  • 9435 differentially expressed genes.

  • The method needs to form a 10K x 10K matrix multiple times and do an approximate SVD. Requires 32GB memory. Compute time is ~6 hours.

  • Two tuning parameters: \(\lambda\) and number of PCs

  • Want to do CV to choose, and then use those on the whole data, describe selected genes.

Example 1: Use genetics data to predict viral load

library(batchtools)

reg <- makeExperimentRegistry("spcr-genes", packages = c("tidyverse", "suffpcr"))
x <- readRDS(here::here("suffpcr-covid", "covid_x.rds"))
y <- readRDS(here::here("suffpcr-covid", "covid_y.rds"))

subsample = function(data, job, ratio, ...) {
  n <- nrow(data$x)
  train <- sample(n, floor(n * ratio))
  test <- setdiff(seq_len(n), train)
  list(test = test, train = train)
}

addProblem("cv", data = list(x = x, y = y), fun = subsample)
addProblem("full", data = list(x = x, y = y))

addAlgorithm(
  "spcr_cv",
  fun = function(job, data, instance, ...) { # args are required
    fit <- suffpcr(
      data$x[instance$train, ], data$y[instance$train], 
      lambda_min = 0, lambda_max = 1, ...
    )
    valid_err <- colMeans(
      (
        data$y[instance$test] - 
         as.matrix(predict(fit, newdata = data$x[instance$test, ]))
      )^2,
      na.rm = TRUE
    )
    return(list(fit = fit, valid_err = valid_err))
  }
)

addAlgorithm(
  "spcr_full",
  fun = function(job, data, instance, ...) {
    suffpcr(data$x, data$y, lambda_max = 1, lambda_min = 0, ...)
  }
)

## Experimental design
pdes_cv <- list(cv = data.frame(ratio = .75))
pdes_full <- list(full = data.frame())
ades_cv <- list(spcr_cv = data.frame(d = c(3, 5, 15)))
ades_full <- list(spcr_full = data.frame(d = c(3, 5, 15)))

addExperiments(pdes_cv, ades_cv, repls = 5L)
addExperiments(pdes_full, ades_full)

submitJobs(
  findJobs(), 
  resources = list(ncpus = 1, walltime = "8:00:00", memory = "32G")
)

End up with 18 jobs.

Example 2: Predicting future COVID cases

  • Take a few very simple models and demonstrate that some choices make a big difference in accuracy.

  • At each time \(t\), download COVID cases as observed on day \(t\) for a bunch of locations

  • Estimate a few different models for predicting days \(t+1,\ldots,t+k\)

  • Store point and interval forecasts.

  • Do this for \(t\) every week over a year.

Example 2: Predicting future COVID cases

fcasters <- list.files(here::here("code", "forecasters"))
for (fcaster in fcasters) source(here::here("code", "forecasters", fcaster))
registry_path <- here::here("data", "forecast-experiments")
source(here::here("code", "common-pars.R"))

# Setup the data ----------------------------------------------------
reg <- makeExperimentRegistry(
  registry_path,
  packages = c("tidyverse", "covidcast"),
  source = c(
    here::here("code", "forecasters", fcasters), 
    here::here("code", "common-pars.R")
  )
)

grab_data <- function(data, job, forecast_date, ...) {
  dat <- covidcast_signals(
    data_sources, signals, as_of = forecast_date, 
    geo_type = geo_type, start_day = "2020-04-15") %>% 
    aggregate_signals(format = "wide") 
  names(dat)[3:5] <- c("value", "num", "covariate") # assumes 2 signals
  dat %>% 
    filter(!(geo_value %in% drop_geos)) %>% 
    group_by(geo_value) %>% 
    arrange(time_value)
}
addProblem("covidcast_proper", fun = grab_data, cache = TRUE)

# Algorithm wrappers -----------------------------------------------------
baseline <- function(data, job, instance, ...) {
  instance %>% 
    dplyr::select(geo_value, value) %>% 
    group_modify(prob_baseline, ...)
}
ar <- function(data, job, instance, ...) {
  instance %>% 
    dplyr::select(geo_value, time_value, value) %>% 
    group_modify(prob_ar, ...)
}
qar <- function(data, job, instance, ...) {
  instance %>% 
    dplyr::select(geo_value, time_value, value) %>% 
    group_modify(quant_ar, ...)
}
gam <- function(data, job, instance, ...) {
  instance %>% 
    dplyr::select(geo_value, time_value, value) %>%
    group_modify(safe_prob_gam_ar, ...)
}
ar_cov <- function(data, job, instance, ...) {
  instance %>% 
    group_modify(prob_ar_cov, ...)
}
joint <- function(data, job, instance, ...) {
  instance %>% 
    dplyr::select(geo_value, time_value, value) %>% 
    joint_ar(...)
}
corrected_ar <- function(data, job, instance, ...) {
  instance %>% 
    dplyr::select(geo_value, time_value, num) %>% 
    rename(value = num) %>% 
    corrections_single_signal(cparams) %>% 
    group_modify(prob_ar, ...)
}

addAlgorithm("baseline", baseline)
addAlgorithm("ar", ar)
addAlgorithm("qar", qar)
addAlgorithm("gam", gam)
addAlgorithm("ar_cov", ar_cov)
addAlgorithm("joint_ar", joint)
addAlgorithm("corrections", corrected_ar)

# Experimental design -----------------------------------------------------
problem_design <- list(covidcast_proper = data.frame(forecast_date = forecast_dates))
algorithm_design <- list(
  baseline = CJ(
    train_window = train_windows, min_train_window = min(train_windows), ahead = aheads
  ),
  ar = CJ(
    train_window = train_windows, min_train_window = min(train_windows), 
    lags = lags_list, ahead = aheads
  ),
  qar = CJ(
    train_window = train_windows, min_train_window = min(train_windows),
    lags = lags_list, ahead = aheads
  ),
  gam = CJ(
    train_window = train_windows, min_train_window = min(train_windows),
    lags = lags_list, ahead = aheads, df = gam_df
  ),
  ar_cov = CJ(
    train_window = train_windows, min_train_window = min(train_windows), 
    lags = lags_list, ahead = aheads
  ),
  joint_ar = CJ(
    train_window = joint_train_windows, min_train_window = min(joint_train_windows), 
    lags = lags_list, ahead = aheads
  ),
  corrections = CJ(
    train_window = train_windows, min_train_window = min(train_windows),
    lags = lags_list, ahead = aheads
  )
)

addExperiments(problem_design, algorithm_design)
ids <- unwrap(getJobPars()) %>% 
  select(job.id, forecast_date) %>% 
  mutate(chunk = as.integer(as.factor(forecast_date))) %>% 
  select(-forecast_date)

## ~13000 jobs, we don't want to submit that many since they run fast
## Chunk them into groups by forecast_date (to download once for the group)
## Results in 68 chunks

submitJobs(
  ids, 
  resources = list(ncpus = 1, walltime = "4:00:00", memory = "16G")
)

Takeaways

Benefits of this workflow:

  • Don’t lock up your computer
  • Stuff runs much faster
  • Can easily scale up to many jobs
  • Logs are stored for debugging
  • Forces you to think about the design
  • No overhead to store results
  • Easy to add more experiments later, adjust parameters, etc.

Costs:

  • I only know how to do this in R
  • Overhead of moving between machines
  • Some headaches to understand the syntax