Overview of using Slurm
When using Slurm, you will need to log into the Slurm head node (currently phoenix-01.gi.ucsc.edu, a one node cluster at the moment). Once you have ssh'd in there, you can execute slurm batch or interactive commands.
Submit a Slurm Batch Job
In order to submit a Slurm batch job list, you will need to create a directory that you will have read and write access to on all the nodes (which will often be a shared space). Let's say I have a batch named "experiment-1". I would create that directory in my groups area:
% mkdir /public/groups/clusteradmin/weiler/slurm-jobs/experiment-1 % cd /public/groups/clusteradmin/weiler/slurm-jobs/experiment-1
Then you will need to create your job submission batch file. It will look something like this. My file is called 'slurm-test.sh':
% vim slurm-test.sh
Then populate the file as necessary:
#!/bin/bash # Job name: #SBATCH --job-name=weiler_test # # Partition - This is the queue it goes in: #SBATCH --partition=batch # # Where to send email (optional) #SBATCH --mail-user=weiler@ucsc.edu # # Number of nodes you need per job: #SBATCH --nodes=1 # # Memory needed for the jobs. Try very hard to make this accurate. DEFAULT = 4gb #SBATCH --mem=4gb # # Number of tasks (one for each GPU desired for use case) (example): #SBATCH --ntasks=1 # # Processors per task: # At least eight times the number of GPUs needed for nVidia RTX A5500 #SBATCH --cpus-per-task=1 # # Number of GPUs, this can be in the format of "gpu:[1-4]", or "gpu:K80:[1-4] with the type included (optional) #SBATCH --gres=gpu:1 # # Standard output and error log #SBATCH --output=serial_test_%j.log # # Wall clock limit in hrs:min:sec: #SBATCH --time=00:00:30 # ## Command(s) to run (example): pwd; hostname; date module load python echo "Running test script on a single CPU core" python /public/groups/clusteradmin/weiler/slurm-jobs/experiment-1/mytest.py date
Keep the "SBATCH" lines commented, the scheduler will read them anyway. If you don't need a particular option, just don't include it in the file.
To submit the batch job:
% sbatch slurm-test.sh Submitted batch job 7
The job(s) will then be scheduled. You can see the state of the queue as such:
% squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 7 batch weiler_t weiler R 0:07 1 phoenix-01
The job will output any STDOUT or STDERR in the directory you launched the job from. Other than that, it will do whatever the job does, even if there is no STDOUT.
CGROUPS and Resource Management
Our installation of Slurm will utilize Linux CGROUPS, which puts a hard resource cap on jobs. If you define that your job will need 4GB of RAM, and it uses 5GB, it will fail with an OOM exception. Likewise with CPU or GPU resources; if your job ends up using more than you specify, it will fail. This is to keep the nodes from crashing from runaway jobs that use more resources than you think they will.
So... TEST YOUR JOBS! Find our how much in resources a single job needs before you launch 100 of them.