Cluster Etiquette

From UCSC Genomics Institute Computing Infrastructure Information

When running jobs on the cluster, you must be very aware of how those jobs will affect other users.

1: Always test your job by running one first. Just one. Note how much RAM, how many CPU cores and how much time it takes to run. Then, when you submit 50 or 100 of those, you can specify limits in your Slurm batch file on how long the job should run, how much RAM it should use and how much time it takes. In that case, slurm can stop jobs that inadvertently go too long or use too many resources.

2: Don't run too many jobs at once if they use a lot of disk I/O. If every job reads in a 100GB file, and you launch 20 of them at the same time, you could bring down the /private/groups filesystem. Run only maybe 5 at once in that case, or introduce a random delay at the start of your jobs. You can limit your concurrent jobs by specifying something like this in your job batch file:

#SBATCH --array=[1-279]%10

inputList=$1

input=$(sed -n "$SLURM_ARRAY_TASK_ID"p $inputList)

some_command $input

3: Don't use too much storage. Use http://logserv.gi.ucsc.edu/cgi-bin/private-groups.cgi to look at how your storage use is divided among your directories, and clean up large chunks of data that you do not need.