Overview of using Slurm: Difference between revisions

From UCSC Genomics Institute Computing Infrastructure Information

mNo edit summary
 
(19 intermediate revisions by 2 users not shown)
Line 1: Line 1:
When using Slurm, you will need to log into the Slurm head node (currently phoenix-01.gi.ucsc.edu, a one node cluster at the moment).  Once you have ssh'd in there, you can execute slurm batch or interactive commands.
When using Slurm, you will need to log into one of the interactive compute servers in the PRISM area (such as emerald, mustard, crimson or razzmatazz).  Once you have ssh'd in there, you can execute slurm batch or interactive commands.
 
You might also want to consult the [[Quick Reference Guide]].


== Submit a Slurm Batch Job ==
== Submit a Slurm Batch Job ==


In order to submit a Slurm batch job list, you will need to create a directory that you will have read and write access to on all the nodes (which will often be a shared space).  Let's say I have a batch named "experiment-1".  I would create that directory in my groups area:
In order to submit a Slurm batch job list, you will need to create a directory that you will have both read and write access to on all the nodes (which will often be a shared space).  Let's say I have a batch named "experiment-1".  I would create that directory in my groups area:


  % mkdir /public/groups/clusteradmin/weiler/slurm-jobs/experiment-1
  % mkdir -p /private/groups/clusteradmin/weiler/slurm-jobs/experiment-1
  % cd /public/groups/clusteradmin/weiler/slurm-jobs/experiment-1
  % cd /private/groups/clusteradmin/weiler/slurm-jobs/experiment-1


Then you will need to create your job submission batch file.  It will look something like this.  My file is called 'slurm-test.sh':
Then you will need to create your job submission batch file.  It will look something like this.  My file is called 'slurm-test.sh':
Line 19: Line 21:
  #
  #
  # Partition - This is the queue it goes in:
  # Partition - This is the queue it goes in:
  #SBATCH --partition=batch
  #SBATCH --partition=short
  #
  #
  # Where to send email (optional)
  # Where to send email (optional)
Line 30: Line 32:
  #SBATCH --mem=4gb
  #SBATCH --mem=4gb
  #
  #
  # Number of tasks (one for each GPU desired for use case) (example):
  # Number of tasks (one for each CPU desired for use case) (example):
  #SBATCH --ntasks=1
  #SBATCH --ntasks=1
  #
  #
Line 37: Line 39:
  #SBATCH --cpus-per-task=1
  #SBATCH --cpus-per-task=1
  #
  #
  # Number of GPUs, this can be in the format of "gpu:[1-4]", or "gpu:K80:[1-4] with the type included (optional)
  # Number of GPUs, this can be in the format of "--gres=gpu:[1-8]", or "--gres=gpu:A5500:[1-8]" with the type included (optional)
  #SBATCH --gres=gpu:1
  #SBATCH --gres=gpu:1
  #
  #
Line 48: Line 50:
  ## Command(s) to run (example):
  ## Command(s) to run (example):
  pwd; hostname; date
  pwd; hostname; date
module load python
  echo "Running test script on a single CPU core"
  echo "Running test script on a single CPU core"
  python /public/groups/clusteradmin/weiler/slurm-jobs/experiment-1/mytest.py
  sleep 5
echo "Test done!"
  date
  date


Line 67: Line 69:


The job will output any STDOUT or STDERR in the directory you launched the job from.  Other than that, it will do whatever the job does, even if there is no STDOUT.
The job will output any STDOUT or STDERR in the directory you launched the job from.  Other than that, it will do whatever the job does, even if there is no STDOUT.
== Launching Several Jobs at Once ==
You can launch many jobs at once using the $SLURM_ARRAY_TASK_ID variable.  Add something like the following to your batch submission file:
#SBATCH --array=0-31
#SBATCH --output=array_job_%A_task_%a.out
#SBATCH --error=array_job_%A_task_%a.err
## Command(s) to run:
echo "I am task $SLURM_ARRAY_TASK_ID"


== CGROUPS and Resource Management ==
== CGROUPS and Resource Management ==


Our installation of Slurm will utilize Linux CGROUPS, which puts a hard resource cap on jobs.  If you define that your job will need 4GB of RAM, and it uses 5GB, it will fail with an OOM exception.  Likewise with CPU or GPU resources; if your job ends up using more than you specify, it will fail.  This is to keep the nodes from crashing from runaway jobs that use more resources than you think they will.
Our installation of Slurm will utilize Linux CGROUPS, which puts a hard resource cap on jobs.  If you define that your job will need 4GB of RAM, and it uses 5GB, it will fail with an OOM exception.  Likewise with CPU or GPU resources; if your job ends up using more than you specify, it will fail.  Or the "--time" batch file option, your job will fail if it takes longer than what you specify there.  This is to keep the nodes from crashing from runaway jobs that use more resources than you think they will.
 
So...  TEST YOUR JOBS!  Find out how much in resources a single job needs before you launch 100 of them.
 
== TEST YOUR JOBS! ==


So..TEST YOUR JOBSFind our how much in resources a single job needs before you launch 100 of them.
Let me say that one more timeTest your jobs before launching a bunch of themIf it fails, you don't want it to fail 100 or more times.  You can also get a good idea of how much RAM and CPU it will need so you can better define your batch files.  It's critical to get a good idea of how many resources each of your jobs will use and define your job file appropriately.

Latest revision as of 21:01, 17 May 2024

When using Slurm, you will need to log into one of the interactive compute servers in the PRISM area (such as emerald, mustard, crimson or razzmatazz). Once you have ssh'd in there, you can execute slurm batch or interactive commands.

You might also want to consult the Quick Reference Guide.

Submit a Slurm Batch Job

In order to submit a Slurm batch job list, you will need to create a directory that you will have both read and write access to on all the nodes (which will often be a shared space). Let's say I have a batch named "experiment-1". I would create that directory in my groups area:

% mkdir -p /private/groups/clusteradmin/weiler/slurm-jobs/experiment-1
% cd /private/groups/clusteradmin/weiler/slurm-jobs/experiment-1

Then you will need to create your job submission batch file. It will look something like this. My file is called 'slurm-test.sh':

% vim slurm-test.sh

Then populate the file as necessary:

#!/bin/bash
# Job name:
#SBATCH --job-name=weiler_test
#
# Partition - This is the queue it goes in:
#SBATCH --partition=short
#
# Where to send email (optional)
#SBATCH --mail-user=weiler@ucsc.edu
#
# Number of nodes you need per job:
#SBATCH --nodes=1
#
# Memory needed for the jobs.  Try very hard to make this accurate.  DEFAULT = 4gb
#SBATCH --mem=4gb
#
# Number of tasks (one for each CPU desired for use case) (example):
#SBATCH --ntasks=1
#
# Processors per task:
# At least eight times the number of GPUs needed for nVidia RTX A5500
#SBATCH --cpus-per-task=1
#
# Number of GPUs, this can be in the format of "--gres=gpu:[1-8]", or "--gres=gpu:A5500:[1-8]" with the type included (optional)
#SBATCH --gres=gpu:1
#
# Standard output and error log
#SBATCH --output=serial_test_%j.log
#
# Wall clock limit in hrs:min:sec:
#SBATCH --time=00:00:30
#
## Command(s) to run (example):
pwd; hostname; date
echo "Running test script on a single CPU core"
sleep 5
echo "Test done!"
date

Keep the "SBATCH" lines commented, the scheduler will read them anyway. If you don't need a particular option, just don't include it in the file.

To submit the batch job:

% sbatch slurm-test.sh
Submitted batch job 7

The job(s) will then be scheduled. You can see the state of the queue as such:

 % squeue
            JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                7     batch weiler_t   weiler  R       0:07      1 phoenix-01

The job will output any STDOUT or STDERR in the directory you launched the job from. Other than that, it will do whatever the job does, even if there is no STDOUT.

Launching Several Jobs at Once

You can launch many jobs at once using the $SLURM_ARRAY_TASK_ID variable. Add something like the following to your batch submission file:

#SBATCH --array=0-31
#SBATCH --output=array_job_%A_task_%a.out
#SBATCH --error=array_job_%A_task_%a.err
## Command(s) to run:
echo "I am task $SLURM_ARRAY_TASK_ID"

CGROUPS and Resource Management

Our installation of Slurm will utilize Linux CGROUPS, which puts a hard resource cap on jobs. If you define that your job will need 4GB of RAM, and it uses 5GB, it will fail with an OOM exception. Likewise with CPU or GPU resources; if your job ends up using more than you specify, it will fail. Or the "--time" batch file option, your job will fail if it takes longer than what you specify there. This is to keep the nodes from crashing from runaway jobs that use more resources than you think they will.

So... TEST YOUR JOBS! Find out how much in resources a single job needs before you launch 100 of them.

TEST YOUR JOBS!

Let me say that one more time. Test your jobs before launching a bunch of them! If it fails, you don't want it to fail 100 or more times. You can also get a good idea of how much RAM and CPU it will need so you can better define your batch files. It's critical to get a good idea of how many resources each of your jobs will use and define your job file appropriately.