Slurm Queues (Partitions) and Resource Management

From UCSC Genomics Institute Computing Infrastructure Information

Partitions

Due to heterogeneous workloads and different batch requirements, we have implemented partitions in slurm, which are similar to queues.

Each partition has different default and maximum walltime limits (aka "runtime" limits). You will need to select a partition to launch your jobs in based on what kind of jobs they are and how long they are expected to run.

Partition Name Default Walltime Limit Maximum Walltime Limit Default Partition? Job Priority Maximum Nodes Utilized
short 10 minutes 1 hour Yes Normal All
medium 1 hour 12 hours No Normal 15
long 12 hours 7 days No Normal 10
high_priority 10 minutes 7 days No High All
gpu 10 minutes 7 days No Normal 6

If you do not specify a partition to run your job in (with e.g. --partition=medium), it will automatically be assigned the "short" partition by default. If you do not specify a walltime value in your job submission script (with e.g. --time=00:30:00), it will inherit the "Default Walltime Limit" of the partition it is assigned. Therefore, it is a very good idea to specify which partition your job will go in, and you should also specify a walltime limit, otherwise your jobs will inherit the default walltime limit in the chart above.

This all means that it is very important to TEST your jobs before running many of them! Submit one job and note how much resources it takes (RAM, CPU) and how long it takes to run. Then when you submit many of those jobs, you can correctly specify the number of CPU cores your job needs, how much RAM it needs (pad it by about 20% just in case), and how much time it needs to run (pad it by 40% to account for environmental variables like disk IO load and CPU context switching load).

You can test your jobs by running one job via srun with fairly high CPU, RAM and walltime limits (just so it isn't killed due to default limits), then noting how much in resources it consumed while running (after it finishes).

Example

seff 769059

Output

Job ID: 769059
Cluster: phoenix
User/Group: <user-name>/<group-name>
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 16
CPU Utilized: 00:00:01
CPU Efficiency: 0.11% of 00:15:28 core-walltime
Job Wall-clock time: 00:00:58
Memory Utilized: 4.79 MB
Memory Efficiency: 4.79% of 100.00 MB

So if I needed to run like 1000 of these jobs, and they were all similar, I would select the "short" partition, 1 CPU core, maybe specify 8MB RAM, and maybe 90 seconds walltime limit. Note how I padded the RAM and walltime a bit to account for unexpected variable cluster conditions.

high_priority Partition Notes

The "high_priority" partition is special in that it will have the highest priority of all jobs on the cluster and will push all other jobs aside in an effort to finish jobs in that partition as fast as possible. This is only available for emergency or mission critical batches that need to be completed in an unexpectedly critically fast way. Access to this partition is only granted on a per request basis, and is temporary until your batch finishes. Email cluster-admin@soe.ucsc.edu if you need to access the high_priority queue and make your case why it is necessary.

My job is not running but I want it to be running

Even if your job is in the high-priority partition, that doesn't mean that the cluster will drop everything and run it immediately. Because we don't have pre-emption set up, high priority jobs still have to wait for currently-running jobs to finish, as well as for other high-priority jobs. And since, as noted above, jobs can be allowed to run for up to 7 days each, it is physically possible for even the highest-priority job in the whole cluster to not start for a whole week.

Here is a good resource from Berkeley about understanding and debugging Slurm job scheduling. Basically, Slurm uses the wall-clock limits of running jobs, and of jobs in the queue, to make a plan to start each job on some node at some time in the future. If jobs finish early, other jobs can start sooner than scheduled, and if there is space around higher-priority jobs, lower-priority jobs can be filled in.

If you want to know when Slurm plans to run your job, and why that is not right now, you can use the --start option for the squeue command:

   $ squeue -j 1719584 --start
                JOBID PARTITION     NAME     USER ST          START_TIME  NODES SCHEDNODES           NODELIST(REASON)
              1719584     short snakemak flastnam PD 2024-01-22T10:20:00      1 phoenix-00           (Priority)

The START_TIME column is the time by which Slurm is sure it will be able to start your job if no higher-priority jobs come in first, and the NODELIST(REASON) column shows the nodes the job is running on, or the reason it is not running now, in parentheses. In this case, the job is not running because higher-priority jobs are in the way.