Cluster Etiquette: Difference between revisions
mNo edit summary |
mNo edit summary |
||
(4 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
When running jobs on the cluster, you must be very aware of how those jobs will affect other users. | When running jobs on the cluster, you must be very aware of how those jobs will affect other users. | ||
1: Always test your job by running one first. Just one. Note how much RAM, how many CPU cores and how much time it takes to run. Then, when you submit 50 or 100 of those, you can specify limits in your Slurm batch file on how long the job should run, how much RAM it should use and how much time it takes. In that case, slurm can stop | 1: Always test your job by running one first. Just one. Note how much RAM, how many CPU cores and how much time it takes to run. Then, when you submit 50 or 100 of those, you can specify limits in your Slurm batch file on how long the job should run, how much RAM it should use and how much time it takes. In that case, slurm can stop jobs that inadvertently go too long or use too many resources. | ||
2: Don't run too many jobs at once if they use a lot of disk I/O. If every job reads in a 100GB file, and you launch 20 of them, you could bring down the | 2: Don't run too many jobs at once if they use a lot of disk I/O. If every job reads in a 100GB file, and you launch 20 of them at the same time, you could bring down the /private/groups filesystem. Run only maybe 5 at once in that case, or introduce a random delay at the start of your jobs. You can limit your concurrent jobs by specifying something like this in your job batch file: | ||
#SBATCH --array=[1-279]%10 | #SBATCH --array=[1-279]%10 | ||
Line 12: | Line 12: | ||
some_command $input | some_command $input | ||
3: Please do not pin cluster resources with interactive jobs and let them sit idle. Sometimes folks will open an interactive cluster job with a week long runtime and just let it sit in order to "hold" a spot in the queue for when they might eventually want to run something. This is a waste of resources, and it also forces other who have work ready to go to wait in the queue while nodes sit idle. If you use an interactive job via '''srun''' or '''salloc''', please start it immediately upon launch and close it immediately upon the job's completion. | |||
4: Don't use too much storage. Use http://logserv.gi.ucsc.edu/cgi-bin/private-groups.cgi to look at how your storage use is divided among your directories, and clean up large chunks of data that you do not need. |
Latest revision as of 17:28, 2 August 2024
When running jobs on the cluster, you must be very aware of how those jobs will affect other users.
1: Always test your job by running one first. Just one. Note how much RAM, how many CPU cores and how much time it takes to run. Then, when you submit 50 or 100 of those, you can specify limits in your Slurm batch file on how long the job should run, how much RAM it should use and how much time it takes. In that case, slurm can stop jobs that inadvertently go too long or use too many resources.
2: Don't run too many jobs at once if they use a lot of disk I/O. If every job reads in a 100GB file, and you launch 20 of them at the same time, you could bring down the /private/groups filesystem. Run only maybe 5 at once in that case, or introduce a random delay at the start of your jobs. You can limit your concurrent jobs by specifying something like this in your job batch file:
#SBATCH --array=[1-279]%10 inputList=$1 input=$(sed -n "$SLURM_ARRAY_TASK_ID"p $inputList) some_command $input
3: Please do not pin cluster resources with interactive jobs and let them sit idle. Sometimes folks will open an interactive cluster job with a week long runtime and just let it sit in order to "hold" a spot in the queue for when they might eventually want to run something. This is a waste of resources, and it also forces other who have work ready to go to wait in the queue while nodes sit idle. If you use an interactive job via srun or salloc, please start it immediately upon launch and close it immediately upon the job's completion.
4: Don't use too much storage. Use http://logserv.gi.ucsc.edu/cgi-bin/private-groups.cgi to look at how your storage use is divided among your directories, and clean up large chunks of data that you do not need.