Slurm Queues (Partitions) and Resource Management: Difference between revisions
(Created page with "Due to heterogeneous workloads and different batch requirements, we have implemented partitions in slurm, which are similar to queues. Each partition has different default and maximum walltime limits (aka "runtime" limits). You will need to select a partition to launch your jobs in based on what kind of jobs they are and how long they are expected to run. {| class="wikitable" |- style="font-weight:bold;" ! Partition Name ! Default Walltime Limit ! Maximum Walltime Li...") |
mNo edit summary |
||
Line 47: | Line 47: | ||
| 6 | | 6 | ||
|} | |} | ||
If you do not specify a partition to run your job in, it will automatically be assigned the "short" partition by deafult. If you do not specify a walltime value in your job submission script, it will inherit the "Default Walltime Limit" of the partition it is assigned. Therefore, it is a very good idea to specify which partition your job will go in, and you should also specify a walltime limit, otherwise your jobs will inherit the default walltime limit in the chart above. | |||
This all means that it is very important to '''TEST''' your jobs before running many of them! Submit one jobs and note how much resources it takes (RAM, CPU) and how long it takes to run. Then when you submit many of those jobs, you can correctly specify the number of CPU cores your job needs, how much RAM it needs (pad it by about 20% just in case), and how much time it needs to run (pad it by 40% to account for environmental variables like disk IO load and CPU context switching load). | |||
You can test your jobs by running one job via '''srun''' and then noting how much in resources it consumed while running (after it finishes). | |||
'''Example''' | |||
seff 769059 | |||
'''Output''' | |||
Job ID: 769059 | |||
Cluster: discovery | |||
User/Group: <user-name>/<group-name> | |||
State: COMPLETED (exit code 0) | |||
Nodes: 1 | |||
Cores per node: 16 | |||
CPU Utilized: 00:00:01 | |||
CPU Efficiency: 0.11% of 00:15:28 core-walltime | |||
Job Wall-clock time: 00:00:58 | |||
Memory Utilized: 4.79 MB | |||
Memory Efficiency: 4.79% of 100.00 MB |
Revision as of 22:50, 14 November 2023
Due to heterogeneous workloads and different batch requirements, we have implemented partitions in slurm, which are similar to queues.
Each partition has different default and maximum walltime limits (aka "runtime" limits). You will need to select a partition to launch your jobs in based on what kind of jobs they are and how long they are expected to run.
Partition Name | Default Walltime Limit | Maximum Walltime Limit | Default Partition? | Job Priority | Maximum Nodes Utilized |
---|---|---|---|---|---|
short | 10 minutes | 1 hour | Yes | Normal | All |
medium | 1 hour | 12 hours | No | Normal | 15 |
long | 12 hours | 7 days | No | Normal | 10 |
high_priority | 10 minutes | 7 days | No | High | All |
gpu | 10 minutes | 7 days | No | Normal | 6 |
If you do not specify a partition to run your job in, it will automatically be assigned the "short" partition by deafult. If you do not specify a walltime value in your job submission script, it will inherit the "Default Walltime Limit" of the partition it is assigned. Therefore, it is a very good idea to specify which partition your job will go in, and you should also specify a walltime limit, otherwise your jobs will inherit the default walltime limit in the chart above.
This all means that it is very important to TEST your jobs before running many of them! Submit one jobs and note how much resources it takes (RAM, CPU) and how long it takes to run. Then when you submit many of those jobs, you can correctly specify the number of CPU cores your job needs, how much RAM it needs (pad it by about 20% just in case), and how much time it needs to run (pad it by 40% to account for environmental variables like disk IO load and CPU context switching load).
You can test your jobs by running one job via srun and then noting how much in resources it consumed while running (after it finishes).
Example
seff 769059
Output
Job ID: 769059 Cluster: discovery User/Group: <user-name>/<group-name> State: COMPLETED (exit code 0) Nodes: 1 Cores per node: 16 CPU Utilized: 00:00:01 CPU Efficiency: 0.11% of 00:15:28 core-walltime Job Wall-clock time: 00:00:58 Memory Utilized: 4.79 MB Memory Efficiency: 4.79% of 100.00 MB