Slurm Best Practices

PLEASE READ THIS CAREFULLY! It will guide you on how to best use Slurm to get your work done as efficiently as possible without causing issues for other users.

Test Your Jobs!

This is the number one thing to remember. Test your job batch by running one or two jobs by themselves first and check how long it takes to run, how much memory and CPU it needs, and how much disk I/O it needs. Ask for more resources than you think you need, but when the job ends, use seff <job_id> to see how much CPU, RAM, time, etc the job used. Then when you schedule more of that same type of job, you can better know how much, in terms of resources, to ask for. THIS IS VERY IMPORTANT. When you ask for resources, Slurm pins those resources and no one else can use them. So, you need to know how much your job will use before launching a bunch of them. Don't just use a co-worker's script they gave you and run it blindly. UNDERSTAND what the script does and how it works!! Failure to understand how a script someone gave you works can crash the cluster and grind operations to a halt.

Disk I/O

Also of critical importance is making sure your jobs are not overrunning disk I/O on our shared filesystems, especially /private/groups. If your jobs read the same data in more than once, it is often a good idea to copy your data, as part of your job script, to /data/tmp/ first, then your job can read that data locally on each node very fast, then be sure to delete the data at the end of your job script so others can use the space.

To clarify, /data/tmp exists on each cluster node. Do not copy your data to /data/tmp on the login nodes (mustard, emerald, crimson and razzmatazz). Have your slurm batch script copy the data to /data/tmp as part of the job workflow, and it will get copied to /data/tmp on whatever node the job is scheduled to run on.

You can also watch to see if your jobs create a massive spike in filesystem bandwidth by watching the Grafana filesystem bandwidth page:

http://grafana.prism/d/d545805a-843b-4f75-8efe-f2a4510d9d98/phoenix-cluster-aggregate-statistics?orgId=1&refresh=1m

username: guest
password: MoreStats4me

If you see a big spike in read and/or write bandwidth right as your job starts, then you are probably using a very large amount of filesystem I/O and may want to consider running fewer concurrent jobs.

Limit Number of Concurrent Jobs

It is almost always good to limit the number of concurrent jobs that Slurm will run at once from your batch. If your batch has 100 jobs, and you just launch it without limits, Slurm will try to run them all at once, or, if the queue if full, it will try to run them as soon as it can. This can often lead to:

1: Pushing other jobs aside
2: Creating excessive disk and network I/O
3: Forcing other jobs in the queue to wait for long periods of time to start

In the case of 100 jobs, you can utilize the --array parameter to limit the maximum number of jobs to run concurrently, as such (in your batch script, for example:

#SBATCH --array=0-99%10

That indicates that of your 100 jobs, they will be assigned task ID numbers of 0-99, and it will only allow 10 of them to run at once. You can also use this option on the command line if launching via sbatch on the command line or similar using --array=0-99%10 as an argument to sbatch on the command line.

Remember that the cluster is a shared resource, and everyone needs to have relatively equal time on it - as much as possible.

Job Prioritization

Each queued job has a priority score. Jobs start when sufficient resources (CPUs, GPUs, memory, licenses) are available and not already reserved for jobs with a higher priority.

To see the priorities of your currently pending jobs you can use the command sprio -u $USER.

Job Age

Job priority slowly rises with time as a pending job gets older -1 point per hour for up to 3 weeks.

Job Size or "TRES" (Trackable RESources)

This slightly favors jobs which request a larger count of CPUs (or memory or GPUs) as a means of countering their otherwise inherently longer wait times.

Whole-node jobs and others with a similarly high count of cores-per-node will get a priority boost (visible in the “site factor” of sprio). This is to help whole-node jobs get ahead of large distributed jobs with many tasks spread over many nodes.

Nice values

It is possible to give a job a "nice" value which is subtracted from its priority. You can do that with the --nice option of sbatch or the scontrol update command. The command scontrol top <jobid> adjusts nice values to increase the priority of one of your jobs at the expense of any others you have in the same partition.

Holds

Jobs with a priority of 0 are in a "held" state and will never start without further intervention. You can hold jobs with the command scontrol hold <jobid> and release them with scontrol release <jobid>. Jobs can also end up in this state when they get requeued after a node failure.

Slurm Interactive Sessions

A SLURM interactive session reserves resources on compute nodes allowing you to use them interactively as you would the login nodes.

WARNING: An interactive session will, once it starts, use the entire requested block of CPU time and other resources unless earlier exited, even if unused. Don't forget to exit an interactive session once finished. And also, VERY important to remember, don't start an interactive session with the maximum available walltime just to pin the node for yourself so your jobs don't have to wait in the queue. This is extremely bad etiquette, and people often notice when idle interactive sessions exist and complain about it. Just don't do it. If you have a need for an interactive session, launch it, run your code and then exit as soon as you are done.

There are two main commands that can be used to make a session, srun and salloc, both of which use most of the same options available to sbatch.

Using srun --pty bash

srun will add your resource request to the queue. When the allocation starts, a new bash session will start up on one of the granted nodes.

For example;

srun --job-name "InteractiveJob" --cpus-per-task 8 --mem-per-cpu 1500 --time 24:00:00 --pty bash

You will receive a message.

srun: job 10256812 queued and waiting for resources

And when the job starts:

srun: job 10256812 has been allocated resources

For a full description of srun and its options, see the schedmd documentation.

Using salloc

salloc functions similarly srun --pty bash in that it will add your resource request to the queue. However the allocation starts, a new bash session will start up on the login node. This is useful for running a GUI on the login node, but your processes on the compute nodes.

For example:

salloc --job-name "InteractiveJob" --cpus-per-task 8 --mem-per-cpu 1500 --time 24:00:00

You will receive a message.

salloc: Pending job allocation 10256925
salloc: job 10256925 queued and waiting for resources

And when the job starts;

salloc: job 10256925 has been allocated resources
salloc: Granted job allocation 10256925

Note the that you are still on the login node mustard or emerald, however you will now have permission to ssh to any node you have a session on.