Stat 550
Daniel J. McDonald
Last modified – 03 April 2024
\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\mid} \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\U}{\mathbf{U}} \newcommand{\D}{\mathbf{D}} \newcommand{\V}{\mathbf{V}} \renewcommand{\hat}{\widehat} \]
I’ve only used 1 and 3. I mainly use 3.
As far as I know, access for students requires “faculty” support
Command line interface (Terminal on Mac)
(optional) helpful to have ftp client. (Cyberduck)
Globus Connect. File transfer approved by DRAC
Login to a system:
Tip
If you’re doing work for school: run it on one of these machines.
Once you connect with ssh
:
There are no Applications loaded.
You must tell the system what you want.
The command is module load r
or module load sas
If you find yourself using the same modules all the time:
def-dajmcdon
(that’s me, accounts start with def-
)Then I would start R
And run whatever I want. If it takes more than an hour or needs more than 4GB of memory, it’ll quit.
Although, syzygy has little memory and little storage, so it won’t do intensive tasks
I think your home dir is limited to 1GB
Example
Write a R
/ python
script that does the whole analysis and saves the output.
You need to ask DRAC to run the script for you.
salloc
command we used before requested some resources.slurm
schedulerjob-name
/ output
/ error
fields are for convenience.jobid60607-60650934.out
slurm
script is saved as dlbcl-slurm.sh
Important
Jobs inherit environment variables. So if you load modules, then submit, your modules are available to run.
On Cedar, jobs cannot run from ~/
. It must be run from ~/scratch/
or ~/projects/
.
Big jobs (need lots of RAM)
GPU jobs (you want deep learning, I don’t know how to do this)
Other jobs with internal parallelism (I almost never do this)
Embarrassingly parallel jobs (I do this all the time)
R
has packages which are good for parallelization (snow
, snowfall
, Rmpi
, parallel
)Torque script
Torque is a different scheduler. UBC ARC Sockeye uses Torque. Looks much like Slurm.
Here, ClusterPermute.R
uses Rmpi
to do “parallel lapply
”
So I asked for 8 processors on each of 8 nodes.
Torque script
Problem
The scheduler has to find 8 nodes with 8 available processors before this job will start.
This often takes a while, sometimes days.
But the jobs don’t need those things to happen at the same time.
{batchtools}
Using R
(or python
) to parallelize is inefficient when there’s a scheduler in the middle.
Better is to actually submit 64 different jobs each requiring 1 node
Then each can get out of the queue whenever a processor becomes available.
But that would seem to require writing 64 different slurm
scripts
{batchtools}
does this for you, all in R
slurm
/ torque
scripts.R
directly.It’s easy to port across machines / schedulers.
I can test parts (or even run) it on my machine without making changes for the cluster.
{batchtools}
Create a directory where all your jobs will live (in subdirectories). Mine is ~/
In that directory, you need a template file. Mine is ~/.batchtools.slurm.tmpl
(next slide)
Create a configuration file which lives in your home directory. You must name it ~/.batchtools.conf.R
.
~/.batchtools.slurm.tmpl
#!/bin/bash
## Job Resource Interface Definition
##
## ntasks [integer(1)]: Number of required tasks,
## Set larger than 1 if you want to further parallelize
## with MPI within your job.
## ncpus [integer(1)]: Number of required cpus per task,
## Set larger than 1 if you want to further parallelize
## with multicore/parallel within each task.
## walltime [integer(1)]: Walltime for this job, in seconds.
## Must be at least 60 seconds for Slurm to work properly.
## memory [integer(1)]: Memory in megabytes for each cpu.
## Must be at least 100 (when I tried lower values my
## jobs did not start at all).
##
## Default resources can be set in your .batchtools.conf.R by defining the variable
## 'default.resources' as a named list.
<%
# relative paths are not handled well by Slurm
log.file = fs::path_expand(log.file)
-%>
#SBATCH --account=def-dajmcdon
#SBATCH --mail-user=daniel@stat.ubc.ca
#SBATCH --mail-type=ALL
#SBATCH --job-name=<%= job.name %>
#SBATCH --output=<%= log.file %>
#SBATCH --error=<%= log.file %>
#SBATCH --time=<%= resources$walltime %>
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=<%= resources$ncpus %>
#SBATCH --mem-per-cpu=<%= resources$memory %>
<%= if (array.jobs) sprintf("#SBATCH --array=1-%i", nrow(jobs)) else "" %>
## Run R:
## we merge R output with stdout from SLURM, which gets then logged via --output option
Rscript -e 'batchtools::doJobCollection("<%= uri %>")'
See the vignette: vignette("batchtools")
or the
Create a folder to hold your code. Mine usually contains 2 files, one to set up/run the experiment, one to collect results. Code needed to run the experiment lives in an R
package.
Write a script to setup the experiment and submit.
Wait.
Collect your results. Copy back to your machine etc.
An “extra” example in a methods paper to appease reviewers
Method is:
This is “principle components regression” with sparse principle components.
Got 413 COVID patients, measure “viral load” and gene expression
9435 differentially expressed genes.
The method needs to form a 10K x 10K matrix multiple times and do an approximate SVD. Requires 32GB memory. Compute time is ~6 hours.
Two tuning parameters: \(\lambda\) and number of PCs
Want to do CV to choose, and then use those on the whole data, describe selected genes.
library(batchtools)
reg <- makeExperimentRegistry("spcr-genes", packages = c("tidyverse", "suffpcr"))
x <- readRDS(here::here("suffpcr-covid", "covid_x.rds"))
y <- readRDS(here::here("suffpcr-covid", "covid_y.rds"))
subsample = function(data, job, ratio, ...) {
n <- nrow(data$x)
train <- sample(n, floor(n * ratio))
test <- setdiff(seq_len(n), train)
list(test = test, train = train)
}
addProblem("cv", data = list(x = x, y = y), fun = subsample)
addProblem("full", data = list(x = x, y = y))
addAlgorithm(
"spcr_cv",
fun = function(job, data, instance, ...) { # args are required
fit <- suffpcr(
data$x[instance$train, ], data$y[instance$train],
lambda_min = 0, lambda_max = 1, ...
)
valid_err <- colMeans(
(
data$y[instance$test] -
as.matrix(predict(fit, newdata = data$x[instance$test, ]))
)^2,
na.rm = TRUE
)
return(list(fit = fit, valid_err = valid_err))
}
)
addAlgorithm(
"spcr_full",
fun = function(job, data, instance, ...) {
suffpcr(data$x, data$y, lambda_max = 1, lambda_min = 0, ...)
}
)
## Experimental design
pdes_cv <- list(cv = data.frame(ratio = .75))
pdes_full <- list(full = data.frame())
ades_cv <- list(spcr_cv = data.frame(d = c(3, 5, 15)))
ades_full <- list(spcr_full = data.frame(d = c(3, 5, 15)))
addExperiments(pdes_cv, ades_cv, repls = 5L)
addExperiments(pdes_full, ades_full)
submitJobs(
findJobs(),
resources = list(ncpus = 1, walltime = "8:00:00", memory = "32G")
)
End up with 18 jobs.
Take a few very simple models and demonstrate that some choices make a big difference in accuracy.
At each time \(t\), download COVID cases as observed on day \(t\) for a bunch of locations
Estimate a few different models for predicting days \(t+1,\ldots,t+k\)
Store point and interval forecasts.
Do this for \(t\) every week over a year.
fcasters <- list.files(here::here("code", "forecasters"))
for (fcaster in fcasters) source(here::here("code", "forecasters", fcaster))
registry_path <- here::here("data", "forecast-experiments")
source(here::here("code", "common-pars.R"))
# Setup the data ----------------------------------------------------
reg <- makeExperimentRegistry(
registry_path,
packages = c("tidyverse", "covidcast"),
source = c(
here::here("code", "forecasters", fcasters),
here::here("code", "common-pars.R")
)
)
grab_data <- function(data, job, forecast_date, ...) {
dat <- covidcast_signals(
data_sources, signals, as_of = forecast_date,
geo_type = geo_type, start_day = "2020-04-15") %>%
aggregate_signals(format = "wide")
names(dat)[3:5] <- c("value", "num", "covariate") # assumes 2 signals
dat %>%
filter(!(geo_value %in% drop_geos)) %>%
group_by(geo_value) %>%
arrange(time_value)
}
addProblem("covidcast_proper", fun = grab_data, cache = TRUE)
# Algorithm wrappers -----------------------------------------------------
baseline <- function(data, job, instance, ...) {
instance %>%
dplyr::select(geo_value, value) %>%
group_modify(prob_baseline, ...)
}
ar <- function(data, job, instance, ...) {
instance %>%
dplyr::select(geo_value, time_value, value) %>%
group_modify(prob_ar, ...)
}
qar <- function(data, job, instance, ...) {
instance %>%
dplyr::select(geo_value, time_value, value) %>%
group_modify(quant_ar, ...)
}
gam <- function(data, job, instance, ...) {
instance %>%
dplyr::select(geo_value, time_value, value) %>%
group_modify(safe_prob_gam_ar, ...)
}
ar_cov <- function(data, job, instance, ...) {
instance %>%
group_modify(prob_ar_cov, ...)
}
joint <- function(data, job, instance, ...) {
instance %>%
dplyr::select(geo_value, time_value, value) %>%
joint_ar(...)
}
corrected_ar <- function(data, job, instance, ...) {
instance %>%
dplyr::select(geo_value, time_value, num) %>%
rename(value = num) %>%
corrections_single_signal(cparams) %>%
group_modify(prob_ar, ...)
}
addAlgorithm("baseline", baseline)
addAlgorithm("ar", ar)
addAlgorithm("qar", qar)
addAlgorithm("gam", gam)
addAlgorithm("ar_cov", ar_cov)
addAlgorithm("joint_ar", joint)
addAlgorithm("corrections", corrected_ar)
# Experimental design -----------------------------------------------------
problem_design <- list(covidcast_proper = data.frame(forecast_date = forecast_dates))
algorithm_design <- list(
baseline = CJ(
train_window = train_windows, min_train_window = min(train_windows), ahead = aheads
),
ar = CJ(
train_window = train_windows, min_train_window = min(train_windows),
lags = lags_list, ahead = aheads
),
qar = CJ(
train_window = train_windows, min_train_window = min(train_windows),
lags = lags_list, ahead = aheads
),
gam = CJ(
train_window = train_windows, min_train_window = min(train_windows),
lags = lags_list, ahead = aheads, df = gam_df
),
ar_cov = CJ(
train_window = train_windows, min_train_window = min(train_windows),
lags = lags_list, ahead = aheads
),
joint_ar = CJ(
train_window = joint_train_windows, min_train_window = min(joint_train_windows),
lags = lags_list, ahead = aheads
),
corrections = CJ(
train_window = train_windows, min_train_window = min(train_windows),
lags = lags_list, ahead = aheads
)
)
addExperiments(problem_design, algorithm_design)
ids <- unwrap(getJobPars()) %>%
select(job.id, forecast_date) %>%
mutate(chunk = as.integer(as.factor(forecast_date))) %>%
select(-forecast_date)
## ~13000 jobs, we don't want to submit that many since they run fast
## Chunk them into groups by forecast_date (to download once for the group)
## Results in 68 chunks
submitJobs(
ids,
resources = list(ncpus = 1, walltime = "4:00:00", memory = "16G")
)
R
UBC Stat 550 - 2024