Cluster computing (at UBC)

Stat 550

Daniel J. McDonald

Last modified – 03 April 2024

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\mid} \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\U}{\mathbf{U}} \newcommand{\D}{\mathbf{D}} \newcommand{\V}{\mathbf{V}} \renewcommand{\hat}{\widehat} \]

UBC HPC

3 potentially useful systems:

I’ve only used 1 and 3. I mainly use 3.

Accessing

As far as I know, access for students requires “faculty” support

Email The/Binh.
Possible you can access without a faculty PI.
Email your advisor to ask for an account.

The rest of this will focus on 3.

Prerequisites

Command line interface (Terminal on Mac)
(optional) helpful to have ftp client. (Cyberduck)
Globus Connect. File transfer approved by DRAC

Useful CL commands

cd ~/path/to/directory

cp file/to/copy.txt copied/to/copy1.txt

rm file/to/delete.txt

rm -r dir/to/delete/

ls -a # list all files

How to connect

ssh dajmcdon@cedar.alliancecan.ca

Upon login, you’re on a “head” or “login” node.
Jobs > 30min will be killed.
You can continuously run short interactive jobs.

Rule 1

Tip

If you’re doing work for school: run it on one of these machines.

Yes, there is overhead to push data over and pull results back.
But DRAC/Sockeye is much faster than your machine.
And this won’t lock up your laptop for 4 hours while you run the job.
It’s also a good experience.
You can log out and leave the job running. Just log back in to see if it’s done (you should always have some idea how long it will take)

Modules

Once you connect with ssh:
There are no Applications loaded.
You must tell the system what you want.
The command is module load r or module load sas
If you find yourself using the same modules all the time:

module load StdEnv/2023 r gurobi python # stuff I use

module save my_modules # save loaded modules

module restore my_modules # on login, load the usual set

Running something interactively

Login
Load modules
Request interactive compute

salloc --time=1:0:0 --ntasks=1 --account=def-dajmcdon --mem-per-cpu=4096M
# allocate 1 hour on 1 cpu with 4Gb RAM

For the user def-dajmcdon (that’s me, accounts start with def-)

Then I would start R

And run whatever I want. If it takes more than an hour or needs more than 4GB of memory, it’ll quit.

Interactive jobs

Once started they’ll just go
You can do whatever else you want on your machine
But you can’t kill the connection
So don’t close your laptop and walk away
This is not typically the best use of this resource.
Better is likely syzygy.

Although, syzygy has little memory and little storage, so it won’t do intensive tasks

I think your home dir is limited to 1GB

Big memory jobs

Possible you can do this interactively, but discouraged

Example

Neuroscience project
Dataset is about 10GB
Peak memory usage during analysis is about 24GB
Can’t do this on my computer
Want to offload onto DRAC

Write a R / python script that does the whole analysis and saves the output.
You need to ask DRAC to run the script for you.

The scheduler

You can log in to DRAC and “do stuff”
But resources are limited.
There’s a process that determines who gets resources when.
Technically the salloc command we used before requested some resources.
It may “sit” until the resources you want are available, but probably not long.
Anything else has to go through the scheduler.
DRAC uses the slurm scheduler

Example script

#!/bin/bash

#SBATCH --account=def-dajmcdon
#SBATCH --job-name=dlbcl-suffpcr
#SBATCH --output=%x-%j.out
#SBATCH --error=%x-%j.out
#SBATCH --time=10:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=32G

Rscript -e 'source("dlbcl-nocv.R")'

This asks for 10 hours of compute time with 32GB of memory
The job-name / output / error fields are for convenience.
If unspecified, I’ll end up with files named things like jobid60607-60650934.out

Submitting and other useful commands

Suppose that slurm script is saved as dlbcl-slurm.sh

sbatch dlbcl-slurm.sh # submit the job to the scheduler

squeue -u $USER # show status of your jobs ($USER is an env variable)

scancel -u $USER # cancel all your jobs

scancel -t PENDING -u $USER # cancel all your pending jobs

Important

Jobs inherit environment variables. So if you load modules, then submit, your modules are available to run.
On Cedar, jobs cannot run from ~/. It must be run from ~/scratch/ or ~/projects/.

Really big jobs

Types of jobs

Big jobs (need lots of RAM)
GPU jobs (you want deep learning, I don’t know how to do this)
Other jobs with internal parallelism (I almost never do this)
Embarrassingly parallel jobs (I do this all the time)

Simple parallelization

Most of my major computing needs are “embarrassingly parallel”
I want to run a few algorithms on a bunch of different simulated datasets under different parameter configurations.
Perhaps run the algos on some real data too.
R has packages which are good for parallelization (snow, snowfall, Rmpi, parallel)
This is how I originally learned to do parallel computing. But these packages are not good for the cluster
They’re fine for your machine, but we’ve already decided we’re not going to do that anymore.

Example of the bad parallelism

Torque script

#!/bin/bash  
#PBS -l nodes=8:ppn=8,walltime=200:00:00
#PBS -m abe
#PBS -n ClusterPermute 
#PBS -j oe 
mpirun -np 64 -machinefile $PBS_NODEFILE R CMD BATCH ClusterPermute.R

Torque is a different scheduler. UBC ARC Sockeye uses Torque. Looks much like Slurm.
Here, ClusterPermute.R uses Rmpi to do “parallel lapply”
So I asked for 8 processors on each of 8 nodes.

Example of the bad parallelism

Torque script

#!/bin/bash  
#PBS -l nodes=8:ppn=8,walltime=200:00:00
#PBS -m abe
#PBS -n ClusterPermute 
#PBS -j oe 
mpirun -np 64 -machinefile $PBS_NODEFILE R CMD BATCH ClusterPermute.R

Problem

The scheduler has to find 8 nodes with 8 available processors before this job will start.
This often takes a while, sometimes days.
But the jobs don’t need those things to happen at the same time.

`{batchtools}`

Using R (or python) to parallelize is inefficient when there’s a scheduler in the middle.
Better is to actually submit 64 different jobs each requiring 1 node
Then each can get out of the queue whenever a processor becomes available.
But that would seem to require writing 64 different slurm scripts
{batchtools} does this for you, all in R
1. It automates writing/submitting slurm / torque scripts.
2. It automatically stores output, and makes it easy to collect.
3. It generates lots of jobs.
4. All this from R directly.

It’s easy to port across machines / schedulers.

I can test parts (or even run) it on my machine without making changes for the cluster.

Setup `{batchtools}`

Create a directory where all your jobs will live (in subdirectories). Mine is ~/
In that directory, you need a template file. Mine is ~/.batchtools.slurm.tmpl (next slide)
Create a configuration file which lives in your home directory. You must name it ~/.batchtools.conf.R.

# ~/.batchtools.conf.R
cluster.functions <- makeClusterFunctionsSlurm()

`~/.batchtools.slurm.tmpl`

#!/bin/bash

## Job Resource Interface Definition
##
## ntasks [integer(1)]:       Number of required tasks,
##                            Set larger than 1 if you want to further parallelize
##                            with MPI within your job.
## ncpus [integer(1)]:        Number of required cpus per task,
##                            Set larger than 1 if you want to further parallelize
##                            with multicore/parallel within each task.
## walltime [integer(1)]:     Walltime for this job, in seconds.
##                            Must be at least 60 seconds for Slurm to work properly.
## memory   [integer(1)]:     Memory in megabytes for each cpu.
##                            Must be at least 100 (when I tried lower values my
##                            jobs did not start at all).
##
## Default resources can be set in your .batchtools.conf.R by defining the variable
## 'default.resources' as a named list.

<%
# relative paths are not handled well by Slurm
log.file = fs::path_expand(log.file)
-%>

#SBATCH --account=def-dajmcdon
#SBATCH --mail-user=daniel@stat.ubc.ca
#SBATCH --mail-type=ALL
#SBATCH --job-name=<%= job.name %>
#SBATCH --output=<%= log.file %>
#SBATCH --error=<%= log.file %>
#SBATCH --time=<%= resources$walltime %>
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=<%= resources$ncpus %>
#SBATCH --mem-per-cpu=<%= resources$memory %>
<%= if (array.jobs) sprintf("#SBATCH --array=1-%i", nrow(jobs)) else "" %>

## Run R:
## we merge R output with stdout from SLURM, which gets then logged via --output option
Rscript -e 'batchtools::doJobCollection("<%= uri %>")'

When I’m ready to run, I’ll call something like:

batchtools::submitJobs(
  job.ids, 
  resources = list(ncpus=1, walltime="24:00:00", memory="32G")
)

Workflow

See the vignette: vignette("batchtools")

or the

website

Create a folder to hold your code. Mine usually contains 2 files, one to set up/run the experiment, one to collect results. Code needed to run the experiment lives in an R package.
Write a script to setup the experiment and submit.
Wait.
Collect your results. Copy back to your machine etc.

Do it

Example 1: Use genetics data to predict viral load

An “extra” example in a methods paper to appease reviewers
Method is:
1. apply a special version of PCA to a big (wide) data set
2. Do OLS using the top few PCs
This is “principle components regression” with sparse principle components.
Got 413 COVID patients, measure “viral load” and gene expression
9435 differentially expressed genes.
The method needs to form a 10K x 10K matrix multiple times and do an approximate SVD. Requires 32GB memory. Compute time is ~6 hours.
Two tuning parameters: \(\lambda\) and number of PCs
Want to do CV to choose, and then use those on the whole data, describe selected genes.

Example 1: Use genetics data to predict viral load

library(batchtools)

reg <- makeExperimentRegistry("spcr-genes", packages = c("tidyverse", "suffpcr"))
x <- readRDS(here::here("suffpcr-covid", "covid_x.rds"))
y <- readRDS(here::here("suffpcr-covid", "covid_y.rds"))

subsample = function(data, job, ratio, ...) {
  n <- nrow(data$x)
  train <- sample(n, floor(n * ratio))
  test <- setdiff(seq_len(n), train)
  list(test = test, train = train)
}

addProblem("cv", data = list(x = x, y = y), fun = subsample)
addProblem("full", data = list(x = x, y = y))

addAlgorithm(
  "spcr_cv",
  fun = function(job, data, instance, ...) { # args are required
    fit <- suffpcr(
      data$x[instance$train, ], data$y[instance$train], 
      lambda_min = 0, lambda_max = 1, ...
    )
    valid_err <- colMeans(
      (
        data$y[instance$test] - 
         as.matrix(predict(fit, newdata = data$x[instance$test, ]))
      )^2,
      na.rm = TRUE
    )
    return(list(fit = fit, valid_err = valid_err))
  }
)

addAlgorithm(
  "spcr_full",
  fun = function(job, data, instance, ...) {
    suffpcr(data$x, data$y, lambda_max = 1, lambda_min = 0, ...)
  }
)

## Experimental design
pdes_cv <- list(cv = data.frame(ratio = .75))
pdes_full <- list(full = data.frame())
ades_cv <- list(spcr_cv = data.frame(d = c(3, 5, 15)))
ades_full <- list(spcr_full = data.frame(d = c(3, 5, 15)))

addExperiments(pdes_cv, ades_cv, repls = 5L)
addExperiments(pdes_full, ades_full)

submitJobs(
  findJobs(), 
  resources = list(ncpus = 1, walltime = "8:00:00", memory = "32G")
)

End up with 18 jobs.

Example 2: Predicting future COVID cases

Take a few very simple models and demonstrate that some choices make a big difference in accuracy.
At each time \(t\), download COVID cases as observed on day \(t\) for a bunch of locations
Estimate a few different models for predicting days \(t+1,\ldots,t+k\)
Store point and interval forecasts.
Do this for \(t\) every week over a year.

Example 2: Predicting future COVID cases

fcasters <- list.files(here::here("code", "forecasters"))
for (fcaster in fcasters) source(here::here("code", "forecasters", fcaster))
registry_path <- here::here("data", "forecast-experiments")
source(here::here("code", "common-pars.R"))

# Setup the data ----------------------------------------------------
reg <- makeExperimentRegistry(
  registry_path,
  packages = c("tidyverse", "covidcast"),
  source = c(
    here::here("code", "forecasters", fcasters), 
    here::here("code", "common-pars.R")
  )
)

grab_data <- function(data, job, forecast_date, ...) {
  dat <- covidcast_signals(
    data_sources, signals, as_of = forecast_date, 
    geo_type = geo_type, start_day = "2020-04-15") %>% 
    aggregate_signals(format = "wide") 
  names(dat)[3:5] <- c("value", "num", "covariate") # assumes 2 signals
  dat %>% 
    filter(!(geo_value %in% drop_geos)) %>% 
    group_by(geo_value) %>% 
    arrange(time_value)
}
addProblem("covidcast_proper", fun = grab_data, cache = TRUE)

# Algorithm wrappers -----------------------------------------------------
baseline <- function(data, job, instance, ...) {
  instance %>% 
    dplyr::select(geo_value, value) %>% 
    group_modify(prob_baseline, ...)
}
ar <- function(data, job, instance, ...) {
  instance %>% 
    dplyr::select(geo_value, time_value, value) %>% 
    group_modify(prob_ar, ...)
}
qar <- function(data, job, instance, ...) {
  instance %>% 
    dplyr::select(geo_value, time_value, value) %>% 
    group_modify(quant_ar, ...)
}
gam <- function(data, job, instance, ...) {
  instance %>% 
    dplyr::select(geo_value, time_value, value) %>%
    group_modify(safe_prob_gam_ar, ...)
}
ar_cov <- function(data, job, instance, ...) {
  instance %>% 
    group_modify(prob_ar_cov, ...)
}
joint <- function(data, job, instance, ...) {
  instance %>% 
    dplyr::select(geo_value, time_value, value) %>% 
    joint_ar(...)
}
corrected_ar <- function(data, job, instance, ...) {
  instance %>% 
    dplyr::select(geo_value, time_value, num) %>% 
    rename(value = num) %>% 
    corrections_single_signal(cparams) %>% 
    group_modify(prob_ar, ...)
}

addAlgorithm("baseline", baseline)
addAlgorithm("ar", ar)
addAlgorithm("qar", qar)
addAlgorithm("gam", gam)
addAlgorithm("ar_cov", ar_cov)
addAlgorithm("joint_ar", joint)
addAlgorithm("corrections", corrected_ar)

# Experimental design -----------------------------------------------------
problem_design <- list(covidcast_proper = data.frame(forecast_date = forecast_dates))
algorithm_design <- list(
  baseline = CJ(
    train_window = train_windows, min_train_window = min(train_windows), ahead = aheads
  ),
  ar = CJ(
    train_window = train_windows, min_train_window = min(train_windows), 
    lags = lags_list, ahead = aheads
  ),
  qar = CJ(
    train_window = train_windows, min_train_window = min(train_windows),
    lags = lags_list, ahead = aheads
  ),
  gam = CJ(
    train_window = train_windows, min_train_window = min(train_windows),
    lags = lags_list, ahead = aheads, df = gam_df
  ),
  ar_cov = CJ(
    train_window = train_windows, min_train_window = min(train_windows), 
    lags = lags_list, ahead = aheads
  ),
  joint_ar = CJ(
    train_window = joint_train_windows, min_train_window = min(joint_train_windows), 
    lags = lags_list, ahead = aheads
  ),
  corrections = CJ(
    train_window = train_windows, min_train_window = min(train_windows),
    lags = lags_list, ahead = aheads
  )
)

addExperiments(problem_design, algorithm_design)
ids <- unwrap(getJobPars()) %>% 
  select(job.id, forecast_date) %>% 
  mutate(chunk = as.integer(as.factor(forecast_date))) %>% 
  select(-forecast_date)

## ~13000 jobs, we don't want to submit that many since they run fast
## Chunk them into groups by forecast_date (to download once for the group)
## Results in 68 chunks

submitJobs(
  ids, 
  resources = list(ncpus = 1, walltime = "4:00:00", memory = "16G")
)

Takeaways

Benefits of this workflow:

Don’t lock up your computer
Stuff runs much faster
Can easily scale up to many jobs
Logs are stored for debugging
Forces you to think about the design
No overhead to store results
Easy to add more experiments later, adjust parameters, etc.

Costs:

I only know how to do this in R
Overhead of moving between machines
Some headaches to understand the syntax

Cluster computing (at UBC)

UBC HPC

3 potentially useful systems:

Accessing

The rest of this will focus on 3.

Prerequisites

How to connect

Rule 1

Modules

Running something interactively

Interactive jobs

Big memory jobs

The scheduler

Example script

Submitting and other useful commands

Really big jobs

Types of jobs

Simple parallelization

Example of the bad parallelism

Example of the bad parallelism

{batchtools}

Setup {batchtools}

~/.batchtools.slurm.tmpl

Workflow

Do it

Example 1: Use genetics data to predict viral load

Example 1: Use genetics data to predict viral load

Example 2: Predicting future COVID cases

Example 2: Predicting future COVID cases

Takeaways

Benefits of this workflow:

Costs:

`{batchtools}`

Setup `{batchtools}`

`~/.batchtools.slurm.tmpl`