Slurm Best Practices: Difference between revisions
(Created page with "__TOC__ PLEASE READ THIS CAREFULLY! It will guide you on how to best use Slurm to get your work done as efficiently as possible without causing issues for other users. == Job Prioritization == Each queued job has a priority score. Jobs start when sufficient resources (CPUs, GPUs, memory, licenses) are available and not already reserved for jobs with a higher priority. To see the priorities of your currently pending jobs you can use the command '''sprio -u $USER'''....") |
mNo edit summary |
||
| Line 10: | Line 10: | ||
'''Job Age''' | '''Job Age''' | ||
Job priority slowly rises with time as a pending job gets older -1 point per hour for up to 3 weeks. | Job priority slowly rises with time as a pending job gets older -1 point per hour for up to 3 weeks. | ||
'''Job Size or "TRES" (Trackable RESources)''' | '''Job Size or "TRES" (Trackable RESources)''' | ||
This slightly favors jobs which request a larger count of CPUs (or memory or GPUs) as a means of countering their otherwise inherently longer wait times. | This slightly favors jobs which request a larger count of CPUs (or memory or GPUs) as a means of countering their otherwise inherently longer wait times. | ||
| Line 18: | Line 20: | ||
'''Nice values''' | '''Nice values''' | ||
It is possible to give a job a "nice" value which is subtracted from its priority. You can do that with the --nice option of sbatch or the scontrol update command. The command scontrol top <jobid> adjusts nice values to increase the priority of one of your jobs at the expense of any others you have in the same partition. | It is possible to give a job a "nice" value which is subtracted from its priority. You can do that with the --nice option of sbatch or the scontrol update command. The command scontrol top <jobid> adjusts nice values to increase the priority of one of your jobs at the expense of any others you have in the same partition. | ||
'''Holds''' | '''Holds''' | ||
Jobs with a priority of 0 are in a "held" state and will never start without further intervention. You can hold jobs with the command scontrol hold <jobid> and release them with scontrol release <jobid>. Jobs can also end up in this state when they get requeued after a node failure. | Jobs with a priority of 0 are in a "held" state and will never start without further intervention. You can hold jobs with the command scontrol hold <jobid> and release them with scontrol release <jobid>. Jobs can also end up in this state when they get requeued after a node failure. | ||
Revision as of 16:54, 3 November 2025
PLEASE READ THIS CAREFULLY! It will guide you on how to best use Slurm to get your work done as efficiently as possible without causing issues for other users.
Job Prioritization
Each queued job has a priority score. Jobs start when sufficient resources (CPUs, GPUs, memory, licenses) are available and not already reserved for jobs with a higher priority.
To see the priorities of your currently pending jobs you can use the command sprio -u $USER.
Job Age
Job priority slowly rises with time as a pending job gets older -1 point per hour for up to 3 weeks.
Job Size or "TRES" (Trackable RESources)
This slightly favors jobs which request a larger count of CPUs (or memory or GPUs) as a means of countering their otherwise inherently longer wait times.
Whole-node jobs and others with a similarly high count of cores-per-node will get a priority boost (visible in the “site factor” of sprio). This is to help whole-node jobs get ahead of large distributed jobs with many tasks spread over many nodes.
Nice values
It is possible to give a job a "nice" value which is subtracted from its priority. You can do that with the --nice option of sbatch or the scontrol update command. The command scontrol top <jobid> adjusts nice values to increase the priority of one of your jobs at the expense of any others you have in the same partition.
Holds
Jobs with a priority of 0 are in a "held" state and will never start without further intervention. You can hold jobs with the command scontrol hold <jobid> and release them with scontrol release <jobid>. Jobs can also end up in this state when they get requeued after a node failure.