Quick Reference Guide: Difference between revisions

From UCSC Genomics Institute Computing Infrastructure Information

(Created page with "== Job scheduling commands == {| class="wikitable" |- ! Commands ! Function ! Basic Usage ! Example |- ! sbatch ! submit a slurm job ! sbatch [script] ! $ sbatch job.sub |- !...")
 
mNo edit summary
Line 1: Line 1:
== Job scheduling commands ==
== General Commands ==
{| class="wikitable"
|-
! Commands
! Function
! Basic Usage
! Example
|-
! sbatch
! submit a slurm job
! sbatch [script]
! $ sbatch job.sub
|-
! scancel
! delete slurm batch job
! scancel [job_id]
! $ scancel 123456
|-
! scontrol hold
! hold slurm batch jobs
! scontrol hold [job_id]
! $ scontrol hold 123456
|-
! scontrol release
! release hold on slurm batch jobs
! scontrol release [job_id]
! $ scontrol release 123456
|}


== Job management commands ==
Get documentation on a command:
Job Status Commands
man <command>
sinfo -a list all queues
Try the following commands:
squeue list all jobs
man sbatch
squeue -u userid list jobs for userid
man squeue
squeue -t R list running jobs
man scancel
smap show jobs, partitions and nodes in a graphical network topology
== Submitting jobs ==
Job script basics
The following example script specifies a partition, time limit, memory allocation and number of cores. All your scripts should specify values for these four parameters. You can also set additional parameters as shown, such as jobname and output file. For This script performs a simple task — it generates of file of random numbers and then sorts it. A detailed explanation the script is available here.


A typical job script will look like this:
#!/bin/bash
#
#SBATCH -p shared # partition (queue)
#SBATCH -c 1 # number of cores
#SBATCH --mem 100 # memory pool for all cores
#SBATCH -t 0-2:00 # time (D-HH:MM)
#SBATCH -o slurm.%N.%j.out # STDOUT
#SBATCH -e slurm.%N.%j.err # STDERR
for i in {1..100000}; do
echo $RANDOM >> SomeRandomNumbers.txt
donesort SomeRandomNumbers.txt


#!/bin/bash
Now you can submit your job with the command:
#SBATCH --nodes=1
sbatch myscript.sh
#SBATCH --cpus-per-task=8
If you want to test your job and find out when your job is estimated to run use (note this does not actually submit the job):
#SBATCH --time=02:00:00
sbatch --test-only myscript.sh
#SBATCH --mem=128G
== Information on Jobs ==
#SBATCH --mail-user=netid@gmail.com
#SBATCH --mail-type=begin
#SBATCH --mail-type=end
#SBATCH --error=JobName.%J.err
#SBATCH --output=JobName.%J.out


cd $SLURM_SUBMIT_DIR
List all current jobs for a user:
squeue -u <username>
List all running jobs for a user:
squeue -u <username> -t RUNNING
List all pending jobs for a user:
squeue -u <username> -t PENDING
List all current jobs in the shared partition for a user:
squeue -u <username> -p shared
List detailed information for a job (useful for troubleshooting):
scontrol show jobid -dd <jobid>
List status info for a currently running job:
sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j <jobid> --allsteps
Once your job has completed, you can get additional information that was not available during the run. This includes run time, memory used, etc.
To get statistics on completed jobs by jobID:
sacct -j <jobid> --format=JobID,JobName,MaxRSS,Elapsed
To view the same information for all jobs of a user:
sacct -u <username> --format=JobID,JobName,MaxRSS,Elapsed
== Controlling jobs ==


module load modulename
To cancel one job:
scancel <jobid>
To cancel all the jobs for a user:
scancel -u <username>
To cancel all the pending jobs for a user:
scancel -t PENDING -u <username>
To cancel one or more jobs by name:
scancel --name myJobName
To hold a particular job from being scheduled:
scontrol hold <jobid>
To release a particular job to be scheduled:
scontrol release <jobid>
To requeue (cancel and rerun) a particular job:
scontrol requeue <jobid>
== Job arrays and useful commands ==


your_commands_goes_here
As shown in the commands above, its easy to refer to one job by its Job ID, or to all your jobs via your username. What if you want to refer to a subset of your jobs? The answer is to submit your job set as a job array. Then you can use the job array ID to refer to the set when running SLURM commands.


Lines starting with #SBATCH are for SLURM resource manager to request resources for HPC. Some important options are as follows:
== SLURM job arrays ==


{| class="wikitable"
To cancel an indexed job in a job array:
|+ Caption: Batch File
scancel <jobid>_<index>
|-
e.g.
! Option
scancel 1234_4
! Examples
To find the original submit time for your job array
! Description
sacct -j 32532756 -o submit -X --noheader | uniq
|-
== Advanced (but useful!) Commands ==


--nodes #SBATCH --nodes=1 Number of nodes
The following commands work for individual jobs and for job arrays, and allow easy manipulation of large numbers of jobs. You can combine these commands with the parameters shown above to provide great flexibility and precision in job control. (Note that all of these commands are entered on one line)
--cpus-per-task #SBATCH --cpus-per-task=16 Number of CPUs per node
--time #SBATCH --time=HH:MM:SS Total time requested for your job
--output #SBATCH -output filename STDOUT to a file
--error #SBATCH --error filename STDERR to a file
--mail-user #SBATCH --mail-user user@domain.edu Email address to send notifications
Interactive session


To start a interactive session execute the following:
Suspend all running jobs for a user (takes into account job arrays):
 
squeue -ho %A -t R | xargs -n 1 scontrol suspend
1
Resume all suspended jobs for a user:
2
squeue -o "%.18A %.18t" -u <username> | awk '{if ($2 =="S"){print $1}}' | xargs -n 1 scontrol resume
3
After resuming, check if any are still suspended:
 
squeue -ho %A -u $USER -t S | wc -l
View Cluster State
 
shost
#this command will give 1 Node for a time of 4 hours
 
srun -N 1 -t 4:00:00 --pty /bin/bash
 
Getting information on past jobs
 
You can use slurm database to see how much memory your previous jobs used, e.g. the following command will report requested memory and used residential and virtual memory for job
 
1
2
 
 
sacct -j <JOBID> --format JobID,Partition,Submit,Start,End,NodeList%40,ReqMem,MaxRSS,MaxRSSNode,MaxRSSTask,MaxVMSize,ExitCode
 
Aliases that provide useful information parsed from the SLURM commands
 
Place these alias’ into your .bashrc
 
1
2
 
 
alias si="sinfo -o \"%20P %5D %14F %8z %10m %10d %11l %16f %N\""
alias sq="squeue -o \"%8i %12j %4t %10u %20q %20a %10g %20P %10Q %5D %11l %11L %R\""

Revision as of 03:51, 9 March 2023

General Commands

Get documentation on a command:

man <command>

Try the following commands:

man sbatch
man squeue
man scancel

Submitting jobs

The following example script specifies a partition, time limit, memory allocation and number of cores. All your scripts should specify values for these four parameters. You can also set additional parameters as shown, such as jobname and output file. For This script performs a simple task — it generates of file of random numbers and then sorts it. A detailed explanation the script is available here.

#!/bin/bash
#
#SBATCH -p shared # partition (queue)
#SBATCH -c 1 # number of cores
#SBATCH --mem 100 # memory pool for all cores
#SBATCH -t 0-2:00 # time (D-HH:MM)
#SBATCH -o slurm.%N.%j.out # STDOUT
#SBATCH -e slurm.%N.%j.err # STDERR
for i in {1..100000}; do
echo $RANDOM >> SomeRandomNumbers.txt
donesort SomeRandomNumbers.txt

Now you can submit your job with the command:

sbatch myscript.sh

If you want to test your job and find out when your job is estimated to run use (note this does not actually submit the job):

sbatch --test-only myscript.sh

Information on Jobs

List all current jobs for a user:

squeue -u <username>

List all running jobs for a user:

squeue -u <username> -t RUNNING

List all pending jobs for a user:

squeue -u <username> -t PENDING

List all current jobs in the shared partition for a user:

squeue -u <username> -p shared

List detailed information for a job (useful for troubleshooting):

scontrol show jobid -dd <jobid>

List status info for a currently running job:

sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -j <jobid> --allsteps

Once your job has completed, you can get additional information that was not available during the run. This includes run time, memory used, etc. To get statistics on completed jobs by jobID:

sacct -j <jobid> --format=JobID,JobName,MaxRSS,Elapsed

To view the same information for all jobs of a user:

sacct -u <username> --format=JobID,JobName,MaxRSS,Elapsed

Controlling jobs

To cancel one job:

scancel <jobid>

To cancel all the jobs for a user:

scancel -u <username>

To cancel all the pending jobs for a user:

scancel -t PENDING -u <username>

To cancel one or more jobs by name:

scancel --name myJobName

To hold a particular job from being scheduled:

scontrol hold <jobid>

To release a particular job to be scheduled:

scontrol release <jobid>

To requeue (cancel and rerun) a particular job:

scontrol requeue <jobid>

Job arrays and useful commands

As shown in the commands above, its easy to refer to one job by its Job ID, or to all your jobs via your username. What if you want to refer to a subset of your jobs? The answer is to submit your job set as a job array. Then you can use the job array ID to refer to the set when running SLURM commands.

SLURM job arrays

To cancel an indexed job in a job array:

scancel <jobid>_<index>

e.g.

scancel 1234_4

To find the original submit time for your job array

sacct -j 32532756 -o submit -X --noheader | uniq

Advanced (but useful!) Commands

The following commands work for individual jobs and for job arrays, and allow easy manipulation of large numbers of jobs. You can combine these commands with the parameters shown above to provide great flexibility and precision in job control. (Note that all of these commands are entered on one line)

Suspend all running jobs for a user (takes into account job arrays):

squeue -ho %A -t R | xargs -n 1 scontrol suspend

Resume all suspended jobs for a user:

squeue -o "%.18A %.18t" -u <username> | awk '{if ($2 =="S"){print $1}}' | xargs -n 1 scontrol resume

After resuming, check if any are still suspended:

squeue -ho %A -u $USER -t S | wc -l

View Cluster State

shost