Slurm Best Practices
PLEASE READ THIS CAREFULLY! It will guide you on how to best use Slurm to get your work done as efficiently as possible without causing issues for other users.
Job Prioritization
Each queued job has a priority score. Jobs start when sufficient resources (CPUs, GPUs, memory, licenses) are available and not already reserved for jobs with a higher priority.
To see the priorities of your currently pending jobs you can use the command sprio -u $USER.
Job Age
Job priority slowly rises with time as a pending job gets older -1 point per hour for up to 3 weeks.
Job Size or "TRES" (Trackable RESources)
This slightly favors jobs which request a larger count of CPUs (or memory or GPUs) as a means of countering their otherwise inherently longer wait times.
Whole-node jobs and others with a similarly high count of cores-per-node will get a priority boost (visible in the “site factor” of sprio). This is to help whole-node jobs get ahead of large distributed jobs with many tasks spread over many nodes.
Nice values
It is possible to give a job a "nice" value which is subtracted from its priority. You can do that with the --nice option of sbatch or the scontrol update command. The command scontrol top <jobid> adjusts nice values to increase the priority of one of your jobs at the expense of any others you have in the same partition.
Holds
Jobs with a priority of 0 are in a "held" state and will never start without further intervention. You can hold jobs with the command scontrol hold <jobid> and release them with scontrol release <jobid>. Jobs can also end up in this state when they get requeued after a node failure.