Slurm Best Practices: Difference between revisions
mNo edit summary |
mNo edit summary |
||
| Line 26: | Line 26: | ||
Jobs with a priority of 0 are in a "held" state and will never start without further intervention. You can hold jobs with the command scontrol hold <jobid> and release them with scontrol release <jobid>. Jobs can also end up in this state when they get requeued after a node failure. | Jobs with a priority of 0 are in a "held" state and will never start without further intervention. You can hold jobs with the command scontrol hold <jobid> and release them with scontrol release <jobid>. Jobs can also end up in this state when they get requeued after a node failure. | ||
== Slurm Interactive Sessions == | |||
A SLURM interactive session reserves resources on compute nodes allowing you to use them interactively as you would the login node. | |||
There are two main commands that can be used to make a session, '''srun''' and '''salloc''', both of which use most of the same options available to '''sbatch'''. | |||
'''WARNING''': An interactive session will, once it starts, use the entire requested block of CPU time and other resources unless earlier exited, even if unused. To avoid unnecessary charges to your project, don't forget to exit an interactive session once finished. | |||
'''Using srun --pty bash''' | |||
srun will add your resource request to the queue. When the allocation starts, a new bash session will start up on one of the granted nodes. | |||
For example; | |||
srun --job-name "InteractiveJob" --cpus-per-task 8 --mem-per-cpu 1500 --time 24:00:00 --pty bash | |||
You will receive a message. | |||
srun: job 10256812 queued and waiting for resources | |||
And when the job starts: | |||
srun: job 10256812 has been allocated resources | |||
For a full description of srun and its options, see the schedmd documentation. | |||
'''Using salloc''' | |||
salloc functions similarly '''srun --pty bash''' in that it will add your resource request to the queue. However the allocation starts, a new bash session will start up on the login node. This is useful for running a GUI on the login node, but your processes on the compute nodes. | |||
For example: | |||
salloc --job-name "InteractiveJob" --cpus-per-task 8 --mem-per-cpu 1500 --time 24:00:00 | |||
You will receive a message. | |||
salloc: Pending job allocation 10256925 | |||
salloc: job 10256925 queued and waiting for resources | |||
And when the job starts; | |||
salloc: job 10256925 has been allocated resources | |||
salloc: Granted job allocation 10256925 | |||
Note the that you are still on the login node mustard or emerald, however you will now have permission to ssh to any node you have a session on. | |||
Revision as of 16:59, 3 November 2025
PLEASE READ THIS CAREFULLY! It will guide you on how to best use Slurm to get your work done as efficiently as possible without causing issues for other users.
Job Prioritization
Each queued job has a priority score. Jobs start when sufficient resources (CPUs, GPUs, memory, licenses) are available and not already reserved for jobs with a higher priority.
To see the priorities of your currently pending jobs you can use the command sprio -u $USER.
Job Age
Job priority slowly rises with time as a pending job gets older -1 point per hour for up to 3 weeks.
Job Size or "TRES" (Trackable RESources)
This slightly favors jobs which request a larger count of CPUs (or memory or GPUs) as a means of countering their otherwise inherently longer wait times.
Whole-node jobs and others with a similarly high count of cores-per-node will get a priority boost (visible in the “site factor” of sprio). This is to help whole-node jobs get ahead of large distributed jobs with many tasks spread over many nodes.
Nice values
It is possible to give a job a "nice" value which is subtracted from its priority. You can do that with the --nice option of sbatch or the scontrol update command. The command scontrol top <jobid> adjusts nice values to increase the priority of one of your jobs at the expense of any others you have in the same partition.
Holds
Jobs with a priority of 0 are in a "held" state and will never start without further intervention. You can hold jobs with the command scontrol hold <jobid> and release them with scontrol release <jobid>. Jobs can also end up in this state when they get requeued after a node failure.
Slurm Interactive Sessions
A SLURM interactive session reserves resources on compute nodes allowing you to use them interactively as you would the login node.
There are two main commands that can be used to make a session, srun and salloc, both of which use most of the same options available to sbatch.
WARNING: An interactive session will, once it starts, use the entire requested block of CPU time and other resources unless earlier exited, even if unused. To avoid unnecessary charges to your project, don't forget to exit an interactive session once finished.
Using srun --pty bash
srun will add your resource request to the queue. When the allocation starts, a new bash session will start up on one of the granted nodes.
For example;
srun --job-name "InteractiveJob" --cpus-per-task 8 --mem-per-cpu 1500 --time 24:00:00 --pty bash
You will receive a message.
srun: job 10256812 queued and waiting for resources
And when the job starts:
srun: job 10256812 has been allocated resources
For a full description of srun and its options, see the schedmd documentation.
Using salloc
salloc functions similarly srun --pty bash in that it will add your resource request to the queue. However the allocation starts, a new bash session will start up on the login node. This is useful for running a GUI on the login node, but your processes on the compute nodes.
For example:
salloc --job-name "InteractiveJob" --cpus-per-task 8 --mem-per-cpu 1500 --time 24:00:00
You will receive a message.
salloc: Pending job allocation 10256925 salloc: job 10256925 queued and waiting for resources
And when the job starts;
salloc: job 10256925 has been allocated resources salloc: Granted job allocation 10256925
Note the that you are still on the login node mustard or emerald, however you will now have permission to ssh to any node you have a session on.