Cluster Etiquette: Difference between revisions

Revision as of 17:24, 6 March 2024

When running jobs on the cluster, you must be very aware of how those jobs will affect other users.

1: Always test your job by running one first. Just one. Note how much RAM, how many CPU cores and how much time it takes to run. Then, when you submit 50 or 100 of those, you can specify limits in your Slurm batch file on how long the job should run, how much RAM it should use and how much time it takes. In that case, slurm can stop jibs that inadvertantly go too long or use too many resources.

2: Don't run too many jobs at once if they use a lot of disk I/O. If every job reads in a 100GB file, and you launch 20 of them at the same time, you could bring down the file server serving /private/groups. Run only maybe 5 at once in that case, or introduce a random delay at the start of your jobs. You can limit your concurrent jobs by specifying something like this in your job batch file:

#SBATCH --array=[1-279]%10

inputList=$1

input=$(sed -n "$SLURM_ARRAY_TASK_ID"p $inputList)

some_command $input

3: Don't use too much storage. Use http://logserv.gi.ucsc.edu/cgi-bin/private-groups.cgi to look at how your storage use is divided among your directories, and clean up large chunks of data that you do not need.

Revision as of 21:54, 25 October 2023 (view source) Weiler (talk \| contribs) mNo edit summary ← Older edit		Revision as of 17:24, 6 March 2024 (view source) Anovak (talk \| contribs) (Link to the storage visualization) Newer edit →
Line 12:		Line 12:

	some_command $input		some_command $input

			3: Don't use too much storage. Use http://logserv.gi.ucsc.edu/cgi-bin/private-groups.cgi to look at how your storage use is divided among your directories, and clean up large chunks of data that you do not need.