Slurm Queues (Partitions) and Resource Management: Difference between revisions

Revision as of 22:50, 14 November 2023

Due to heterogeneous workloads and different batch requirements, we have implemented partitions in slurm, which are similar to queues.

Each partition has different default and maximum walltime limits (aka "runtime" limits). You will need to select a partition to launch your jobs in based on what kind of jobs they are and how long they are expected to run.

Partition Name	Default Walltime Limit	Maximum Walltime Limit	Default Partition?	Job Priority	Maximum Nodes Utilized
short	10 minutes	1 hour	Yes	Normal	All
medium	1 hour	12 hours	No	Normal	15
long	12 hours	7 days	No	Normal	10
high_priority	10 minutes	7 days	No	High	All
gpu	10 minutes	7 days	No	Normal	6

If you do not specify a partition to run your job in, it will automatically be assigned the "short" partition by deafult. If you do not specify a walltime value in your job submission script, it will inherit the "Default Walltime Limit" of the partition it is assigned. Therefore, it is a very good idea to specify which partition your job will go in, and you should also specify a walltime limit, otherwise your jobs will inherit the default walltime limit in the chart above.

This all means that it is very important to TEST your jobs before running many of them! Submit one jobs and note how much resources it takes (RAM, CPU) and how long it takes to run. Then when you submit many of those jobs, you can correctly specify the number of CPU cores your job needs, how much RAM it needs (pad it by about 20% just in case), and how much time it needs to run (pad it by 40% to account for environmental variables like disk IO load and CPU context switching load).

You can test your jobs by running one job via srun and then noting how much in resources it consumed while running (after it finishes).

Example

seff 769059

Output

Job ID: 769059
Cluster: discovery
User/Group: <user-name>/<group-name>
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 16
CPU Utilized: 00:00:01
CPU Efficiency: 0.11% of 00:15:28 core-walltime
Job Wall-clock time: 00:00:58
Memory Utilized: 4.79 MB
Memory Efficiency: 4.79% of 100.00 MB