Overview of using Slurm: Difference between revisions
mNo edit summary |
|||
(2 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
When using Slurm, you will need to log into the | When using Slurm, you will need to log into one of the interactive compute servers in the PRISM area (such as emerald, mustard, crimson or razzmatazz). Once you have ssh'd in there, you can execute slurm batch or interactive commands. | ||
You might also want to consult the [[Quick Reference Guide]]. | You might also want to consult the [[Quick Reference Guide]]. | ||
Line 5: | Line 5: | ||
== Submit a Slurm Batch Job == | == Submit a Slurm Batch Job == | ||
In order to submit a Slurm batch job list, you will need to create a directory that you will have read and write access to on all the nodes (which will often be a shared space). Let's say I have a batch named "experiment-1". I would create that directory in my groups area: | In order to submit a Slurm batch job list, you will need to create a directory that you will have both read and write access to on all the nodes (which will often be a shared space). Let's say I have a batch named "experiment-1". I would create that directory in my groups area: | ||
% mkdir -p /private/groups/clusteradmin/weiler/slurm-jobs/experiment-1 | % mkdir -p /private/groups/clusteradmin/weiler/slurm-jobs/experiment-1 | ||
Line 84: | Line 84: | ||
Our installation of Slurm will utilize Linux CGROUPS, which puts a hard resource cap on jobs. If you define that your job will need 4GB of RAM, and it uses 5GB, it will fail with an OOM exception. Likewise with CPU or GPU resources; if your job ends up using more than you specify, it will fail. Or the "--time" batch file option, your job will fail if it takes longer than what you specify there. This is to keep the nodes from crashing from runaway jobs that use more resources than you think they will. | Our installation of Slurm will utilize Linux CGROUPS, which puts a hard resource cap on jobs. If you define that your job will need 4GB of RAM, and it uses 5GB, it will fail with an OOM exception. Likewise with CPU or GPU resources; if your job ends up using more than you specify, it will fail. Or the "--time" batch file option, your job will fail if it takes longer than what you specify there. This is to keep the nodes from crashing from runaway jobs that use more resources than you think they will. | ||
So... TEST YOUR JOBS! Find | So... TEST YOUR JOBS! Find out how much in resources a single job needs before you launch 100 of them. | ||
== TEST YOUR JOBS! == | == TEST YOUR JOBS! == | ||
Let me say that one more time. Test your jobs before launching a bunch of them! If it fails, you don't want it to fail 100 or more times. You can also get a good idea of how much RAM and CPU it will need so you can better define your batch files. It's critical to get a good idea of how many resources each of your jobs will use and define your job file appropriately. | Let me say that one more time. Test your jobs before launching a bunch of them! If it fails, you don't want it to fail 100 or more times. You can also get a good idea of how much RAM and CPU it will need so you can better define your batch files. It's critical to get a good idea of how many resources each of your jobs will use and define your job file appropriately. |
Latest revision as of 21:01, 17 May 2024
When using Slurm, you will need to log into one of the interactive compute servers in the PRISM area (such as emerald, mustard, crimson or razzmatazz). Once you have ssh'd in there, you can execute slurm batch or interactive commands.
You might also want to consult the Quick Reference Guide.
Submit a Slurm Batch Job
In order to submit a Slurm batch job list, you will need to create a directory that you will have both read and write access to on all the nodes (which will often be a shared space). Let's say I have a batch named "experiment-1". I would create that directory in my groups area:
% mkdir -p /private/groups/clusteradmin/weiler/slurm-jobs/experiment-1 % cd /private/groups/clusteradmin/weiler/slurm-jobs/experiment-1
Then you will need to create your job submission batch file. It will look something like this. My file is called 'slurm-test.sh':
% vim slurm-test.sh
Then populate the file as necessary:
#!/bin/bash # Job name: #SBATCH --job-name=weiler_test # # Partition - This is the queue it goes in: #SBATCH --partition=short # # Where to send email (optional) #SBATCH --mail-user=weiler@ucsc.edu # # Number of nodes you need per job: #SBATCH --nodes=1 # # Memory needed for the jobs. Try very hard to make this accurate. DEFAULT = 4gb #SBATCH --mem=4gb # # Number of tasks (one for each CPU desired for use case) (example): #SBATCH --ntasks=1 # # Processors per task: # At least eight times the number of GPUs needed for nVidia RTX A5500 #SBATCH --cpus-per-task=1 # # Number of GPUs, this can be in the format of "--gres=gpu:[1-8]", or "--gres=gpu:A5500:[1-8]" with the type included (optional) #SBATCH --gres=gpu:1 # # Standard output and error log #SBATCH --output=serial_test_%j.log # # Wall clock limit in hrs:min:sec: #SBATCH --time=00:00:30 # ## Command(s) to run (example): pwd; hostname; date echo "Running test script on a single CPU core" sleep 5 echo "Test done!" date
Keep the "SBATCH" lines commented, the scheduler will read them anyway. If you don't need a particular option, just don't include it in the file.
To submit the batch job:
% sbatch slurm-test.sh Submitted batch job 7
The job(s) will then be scheduled. You can see the state of the queue as such:
% squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 7 batch weiler_t weiler R 0:07 1 phoenix-01
The job will output any STDOUT or STDERR in the directory you launched the job from. Other than that, it will do whatever the job does, even if there is no STDOUT.
Launching Several Jobs at Once
You can launch many jobs at once using the $SLURM_ARRAY_TASK_ID variable. Add something like the following to your batch submission file:
#SBATCH --array=0-31 #SBATCH --output=array_job_%A_task_%a.out #SBATCH --error=array_job_%A_task_%a.err ## Command(s) to run: echo "I am task $SLURM_ARRAY_TASK_ID"
CGROUPS and Resource Management
Our installation of Slurm will utilize Linux CGROUPS, which puts a hard resource cap on jobs. If you define that your job will need 4GB of RAM, and it uses 5GB, it will fail with an OOM exception. Likewise with CPU or GPU resources; if your job ends up using more than you specify, it will fail. Or the "--time" batch file option, your job will fail if it takes longer than what you specify there. This is to keep the nodes from crashing from runaway jobs that use more resources than you think they will.
So... TEST YOUR JOBS! Find out how much in resources a single job needs before you launch 100 of them.
TEST YOUR JOBS!
Let me say that one more time. Test your jobs before launching a bunch of them! If it fails, you don't want it to fail 100 or more times. You can also get a good idea of how much RAM and CPU it will need so you can better define your batch files. It's critical to get a good idea of how many resources each of your jobs will use and define your job file appropriately.