Cluster Etiquette: Difference between revisions

Revision as of 21:54, 25 October 2023

When running jobs on the cluster, you must be very aware of how those jobs will affect other users.

1: Always test your job by running one first. Just one. Note how much RAM, how many CPU cores and how much time it takes to run. Then, when you submit 50 or 100 of those, you can specify limits in your Slurm batch file on how long the job should run, how much RAM it should use and how much time it takes. In that case, slurm can stop jibs that inadvertantly go too long or use too many resources.

2: Don't run too many jobs at once if they use a lot of disk I/O. If every job reads in a 100GB file, and you launch 20 of them at the same time, you could bring down the file server serving /private/groups. Run only maybe 5 at once in that case, or introduce a random delay at the start of your jobs. You can limit your concurrent jobs by specifying something like this in your job batch file:

#SBATCH --array=[1-279]%10

inputList=$1

input=$(sed -n "$SLURM_ARRAY_TASK_ID"p $inputList)

some_command $input

@@ Line 3: / Line 3: @@
 : Always test your job by running one first.  Just one.  Note how much RAM, how many CPU cores and how much time it takes to run.  Then, when you submit 50 or 100 of those, you can specify limits in your Slurm batch file on how long the job should run, how much RAM it should use and how much time it takes.  In that case, slurm can stop jibs that inadvertantly go too long or use too many resources.
-: Don't run too many jobs at once if they use a lot of disk I/O.  If every job reads in a 100GB file, and you launch 20 of them, you could bring down the file server serving /private/groups.  Run only maybe 5 at once in that case.  You can limit your concurrent jobs by specifying something like this in your job batch file:
+: Don't run too many jobs at once if they use a lot of disk I/O.  If every job reads in a 100GB file, and you launch 20 of them at the same time, you could bring down the file server serving /private/groups.  Run only maybe 5 at once in that case, or introduce a random delay at the start of your jobs.  You can limit your concurrent jobs by specifying something like this in your job batch file:
   #SBATCH --array=[1-279]%10