Slurm Queues (Partitions) and Resource Management: Difference between revisions
mNo edit summary |
mNo edit summary |
||
(7 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
== Partitions == | |||
Due to heterogeneous workloads and different batch requirements, we have implemented partitions in slurm, which are similar to queues. | Due to heterogeneous workloads and different batch requirements, we have implemented partitions in slurm, which are similar to queues. | ||
Line 28: | Line 30: | ||
| long | | long | ||
| 12 hours | | 12 hours | ||
| | | 14 days | ||
| style="border-color:inherit;" | No | | style="border-color:inherit;" | No | ||
| Normal | | Normal | ||
Line 48: | Line 50: | ||
|} | |} | ||
If you do not specify a partition to run your job in, it will automatically be assigned the "short" partition by | If you do not specify a partition to run your job in (with e.g. <code>--partition=medium</code>), it will automatically be assigned the "short" partition by default. If you do not specify a walltime value in your job submission script (with e.g. <code>--time=00:30:00</code>), it will inherit the "Default Walltime Limit" of the partition it is assigned. Therefore, it is a very good idea to specify which partition your job will go in, and you should also specify a walltime limit, otherwise your jobs will inherit the default walltime limit in the chart above. | ||
This all means that it is very important to '''TEST''' your jobs before running many of them! Submit one | This all means that it is very important to '''TEST''' your jobs before running many of them! Submit one job and note how much resources it takes (RAM, CPU) and how long it takes to run. Then when you submit many of those jobs, you can correctly specify the number of CPU cores your job needs, how much RAM it needs (pad it by about 20% just in case), and how much time it needs to run (pad it by 40% to account for environmental variables like disk IO load and CPU context switching load). | ||
You can test your jobs by running one job via '''srun''' and then noting how much in resources it consumed while running (after it finishes). | You can test your jobs by running one job via '''srun''' with fairly high CPU, RAM and walltime limits (just so it isn't killed due to default limits), then noting how much in resources it consumed while running (after it finishes). | ||
'''Example''' | '''Example''' | ||
Line 61: | Line 63: | ||
Job ID: 769059 | Job ID: 769059 | ||
Cluster: | Cluster: phoenix | ||
User/Group: <user-name>/<group-name> | User/Group: <user-name>/<group-name> | ||
State: COMPLETED (exit code 0) | State: COMPLETED (exit code 0) | ||
Line 71: | Line 73: | ||
Memory Utilized: 4.79 MB | Memory Utilized: 4.79 MB | ||
Memory Efficiency: 4.79% of 100.00 MB | Memory Efficiency: 4.79% of 100.00 MB | ||
So if I needed to run like 1000 of these jobs, and they were all similar, I would select the "short" partition, 1 CPU core, maybe specify 8MB RAM, and maybe 90 seconds walltime limit. Note how I padded the RAM and walltime a bit to account for unexpected variable cluster conditions. | |||
== '''high_priority''' Partition Notes == | |||
The "high_priority" partition is special in that it will have the highest priority of all jobs on the cluster and will push all other jobs aside in an effort to finish jobs in that partition as fast as possible. This is only available for emergency or mission critical batches that need to be completed in an unexpectedly critically fast way. Access to this partition is only granted on a per request basis, and is temporary until your batch finishes. Email '''cluster-admin@soe.ucsc.edu''' if you need to access the high_priority queue and make your case why it is necessary. | |||
== My job is not running but I want it to be running == | |||
Even if your job is in the high-priority partition, that doesn't mean that the cluster will drop everything and run it immediately. Because we don't have pre-emption set up, high priority jobs still have to wait for currently-running jobs to finish, as well as for other high-priority jobs. And since, as noted above, jobs can be allowed to run for up to 7 days each, it is physically possible for even the highest-priority job in the whole cluster to not start for a whole week. | |||
Here is a [https://docs-research-it.berkeley.edu/services/high-performance-computing/user-guide/running-your-jobs/why-job-not-run/ good resource from Berkeley] about understanding and debugging Slurm job scheduling. Basically, Slurm uses the wall-clock limits of running jobs, and of jobs in the queue, to make a plan to start each job on some node at some time in the future. If jobs finish early, other jobs can start sooner than scheduled, and if there is space around higher-priority jobs, lower-priority jobs can be filled in. | |||
If you want to know when Slurm plans to run your job, and why that is not right now, you can use the <code>--start</code> option for the <code>squeue</code> command: | |||
$ squeue -j 1719584 --start | |||
JOBID PARTITION NAME USER ST START_TIME NODES SCHEDNODES NODELIST(REASON) | |||
1719584 short snakemak flastnam PD 2024-01-22T10:20:00 1 phoenix-00 (Priority) | |||
The <code>START_TIME</code> column is the time by which Slurm is sure it will be able to start your job if no higher-priority jobs come in first, and the <code>NODELIST(REASON)</code> column shows the nodes the job is running on, or the reason it is not running now, in parentheses. In this case, the job is not running because higher-priority jobs are in the way. |
Latest revision as of 16:02, 29 June 2024
Partitions
Due to heterogeneous workloads and different batch requirements, we have implemented partitions in slurm, which are similar to queues.
Each partition has different default and maximum walltime limits (aka "runtime" limits). You will need to select a partition to launch your jobs in based on what kind of jobs they are and how long they are expected to run.
Partition Name | Default Walltime Limit | Maximum Walltime Limit | Default Partition? | Job Priority | Maximum Nodes Utilized |
---|---|---|---|---|---|
short | 10 minutes | 1 hour | Yes | Normal | All |
medium | 1 hour | 12 hours | No | Normal | 15 |
long | 12 hours | 14 days | No | Normal | 10 |
high_priority | 10 minutes | 7 days | No | High | All |
gpu | 10 minutes | 7 days | No | Normal | 6 |
If you do not specify a partition to run your job in (with e.g. --partition=medium
), it will automatically be assigned the "short" partition by default. If you do not specify a walltime value in your job submission script (with e.g. --time=00:30:00
), it will inherit the "Default Walltime Limit" of the partition it is assigned. Therefore, it is a very good idea to specify which partition your job will go in, and you should also specify a walltime limit, otherwise your jobs will inherit the default walltime limit in the chart above.
This all means that it is very important to TEST your jobs before running many of them! Submit one job and note how much resources it takes (RAM, CPU) and how long it takes to run. Then when you submit many of those jobs, you can correctly specify the number of CPU cores your job needs, how much RAM it needs (pad it by about 20% just in case), and how much time it needs to run (pad it by 40% to account for environmental variables like disk IO load and CPU context switching load).
You can test your jobs by running one job via srun with fairly high CPU, RAM and walltime limits (just so it isn't killed due to default limits), then noting how much in resources it consumed while running (after it finishes).
Example
seff 769059
Output
Job ID: 769059 Cluster: phoenix User/Group: <user-name>/<group-name> State: COMPLETED (exit code 0) Nodes: 1 Cores per node: 16 CPU Utilized: 00:00:01 CPU Efficiency: 0.11% of 00:15:28 core-walltime Job Wall-clock time: 00:00:58 Memory Utilized: 4.79 MB Memory Efficiency: 4.79% of 100.00 MB
So if I needed to run like 1000 of these jobs, and they were all similar, I would select the "short" partition, 1 CPU core, maybe specify 8MB RAM, and maybe 90 seconds walltime limit. Note how I padded the RAM and walltime a bit to account for unexpected variable cluster conditions.
high_priority Partition Notes
The "high_priority" partition is special in that it will have the highest priority of all jobs on the cluster and will push all other jobs aside in an effort to finish jobs in that partition as fast as possible. This is only available for emergency or mission critical batches that need to be completed in an unexpectedly critically fast way. Access to this partition is only granted on a per request basis, and is temporary until your batch finishes. Email cluster-admin@soe.ucsc.edu if you need to access the high_priority queue and make your case why it is necessary.
My job is not running but I want it to be running
Even if your job is in the high-priority partition, that doesn't mean that the cluster will drop everything and run it immediately. Because we don't have pre-emption set up, high priority jobs still have to wait for currently-running jobs to finish, as well as for other high-priority jobs. And since, as noted above, jobs can be allowed to run for up to 7 days each, it is physically possible for even the highest-priority job in the whole cluster to not start for a whole week.
Here is a good resource from Berkeley about understanding and debugging Slurm job scheduling. Basically, Slurm uses the wall-clock limits of running jobs, and of jobs in the queue, to make a plan to start each job on some node at some time in the future. If jobs finish early, other jobs can start sooner than scheduled, and if there is space around higher-priority jobs, lower-priority jobs can be filled in.
If you want to know when Slurm plans to run your job, and why that is not right now, you can use the --start
option for the squeue
command:
$ squeue -j 1719584 --start JOBID PARTITION NAME USER ST START_TIME NODES SCHEDNODES NODELIST(REASON) 1719584 short snakemak flastnam PD 2024-01-22T10:20:00 1 phoenix-00 (Priority)
The START_TIME
column is the time by which Slurm is sure it will be able to start your job if no higher-priority jobs come in first, and the NODELIST(REASON)
column shows the nodes the job is running on, or the reason it is not running now, in parentheses. In this case, the job is not running because higher-priority jobs are in the way.