Scheduling Jobs

Anatomy of a job

Jobs are made of one or multiple sequential steps, each consisting in one or multiple parallel tasks that will be dispatched to possibly distinct nodes.

Each task will be allocated CPUs, memory, and possible other generic resources in an exclusive manner by Slurm. Two jobs cannot share the same resource unless explicitly forced by the admins, but that is generally not the case. Therefore, jobs can only start when all needed resources are free and not needed by another higher priority job. Jobs are indeed assigned a priority when they are submitted, which can depend upon multiple factors.

Each job has two parts:

Resource requests: describe the amount of computing resource (CPUs, memory, expected run time, etc.) that the job will need to successfully run.
Job steps: describe individual tasks that must be executed into a job. You can execute a job step with the SLURM command srun. A job can has one or more steps, each consisting in one or more tasks each using one or more CPU, GPU, etc.

Job Types

There are a two popular types of jobs you could submit to the HPC cluster:

batch which are unattended shell scripts
interactive where you run and test out your workflows live interactively on the command line,

Starting a SLURM Job

There are two ways of starting jobs with SLURM; either interactively with srun or as a script with sbatch commands.

Defining and submitting A Batch job

For the scheduling process to work properly, you will need to describe your job before you submit it:

what are the steps (i.e. which program must be run and how) ;
how many tasks there will be ;
what resource each task needs (CPU, memory, etc.), and
for how long the job is supposed to run.

All of these, along with potentially aditionnal job parameter submission options, can be described in a submission script.

A submission script is a shell script, e.g. a Bash script, whose comments, if they are prefixed with #SBATCH, are understood by Slurm as directives describing resource requests and other submissions options. You can get the complete list of parameters from the sbatch manpage man sbatch.

Important

The #SBATCH directives must appear at the top of the submission file, before any other line except for the very first line which should be the shebang (e.g. #!/bin/bash ).

The script itself is a job step. Other job steps are created with the srun command, that takes as argument the name and options of the program to run.

Note

If there is only one step, the srun command can be omitted, but it has consequences in how signals are handled and how accurate reporting is. It is often best to use it in all cases.

Below is a slurm script template to Submit a batch job from the login node by calling sbatch <script_name>.slurm.

$cat script.slurm

#!/bin/bash

#SBATCH --partition=debug      # partition name. Eg. Debug, bio, bigmem
#SBATCH --job-name=demosample        # job name
#SBATCH --nodes=2               # number of nodes allocated for this job
#SBATCH --ntasks=2              # total number of tasks / mpi processes
#SBATCH --cpus-per-task=8       # number OpenMP Threads per process
#SBATCH --time=08:00:00         # total run time limit ([[D]D-]HH:MM:SS)
##SBATCH --gres=gpu:tesla:2      # number of GPUs
##The above line and this  will be ignored by Slurm
# Get email notification when job begins, finishes or fails
#SBATCH --mail-type=ALL         # type of notification: BEGIN, END, FAIL, ALL
#SBATCH --mail-user=your@mail   # e-mail address
SBATCH --chdir=<working directory>
#SBATCH --export=all
#SBATCH --output=<file> # where STDOUT goes
#SBATCH --error=<file> # where STDERR goes


# Modules to use (optional).
#<e.g., module load singularity>

# Run the application.
#<my_programs>
echo [`date '+%Y-%m-%d %H:%M:%S'`] Running $AE_ARCH
srun  hostname
sleep  60

For instance, the following batch script, hypothetically named submit.slurm,

#!/bin/bash
#
#SBATCH --job-name=testrun
#SBATCH --output=result.txt
#SBATCH --partition=debug
#
#SBATCH --time=5:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=200

srun hostname
srun sleep 30

describes a job made of 2+1 steps (the submission script itself plus two explicit steps created by calling srun twice), each step consisting of only one task that needs one CPU and 200MB of RAM. The first step will run the hostname command, and the second one the useless sleep command. The job is supposed to run for 5 minutes on the debug partition, be named testrun, and create an output file named result.txt.

Important

It is important to note that as the job will run unattended, it will not be attached to a terminal (screen) so everything that the program that is run would like to write to the terminal will be redirected by Slurm to a file, whose name is specified by the –output parameter. (Note that for the same reasons, the program cannot expect input from the user through the keyboard to run properly.)

Once the submission script is written properly, you need to submit it to slurm through the sbatch command, which, upon success, responds with the jobid attributed to the job. (The dollar sign below is the shell prompt)

$ sbatch submit.slurm
sbatch: Submitted batch job 12321

Warning

Make sure to submit the job with sbatch and not bash; also do not execute it directly. This would ignore all resource request and your job would run with minimal resources, or could possible run on the login node rather than on a compute node.

The job then enters the Job queue in the PENDING state which be verified with;

$ squeue --me

Once resources become available and the job has highest priority, an allocation is created for it and switches to the RUNNING state. If the job completes correctly, it goes to the COMPLETED state, otherwise, it is set to the FAILED state.

Slurm Arguments

These are the common and recommended arguments suggested at a minimum to get a job in any form.

A rgume nts	Co mmand Fl ags	Notes
Ac count	-A or –ac count	What lab are you part of? If you run the groups command you can see what groups (usually labs) you’re a member of, these are associated with resource limits on the cluster.
Part ition	-p or - -part ition	What resource partition are you interested in using? This could be anything you see when you run sinfo -s as each partition corresponds to a class of nodes (e.g., high memory, GPU).
Nodes	-N or – nodes	How many nodes are these resources spread across? In the overwhelming number of cases this is 1 (for a single node) but more sophisticated multi-node jobs could be run if your code supports it.
Cores	c or –cpu s-per -task	How many compute cores do you need? Not all codes can make use of multiple cores and if they do, the performance of the code is not always linear with the resources requested. If in doubt consider contacting the research computing team to assist in this optimization.
M emory	–mem	How much memory do you need for this job? This is in the format size[units] were size is a number and units are either M, G, or T for megabyte, gigabyte, and terabyte respectively. Megabyte is the default unit if none is provided.
Time	-t or - -time	What’s the maximum runtime for this job? Common acceptable time formats include hours:minutes:seconds, days-hours, and minutes.

Slurm Environment Variables

When a job scheduled by Slurm begins, it needs to about how it was scheduled, what its working directory is, who submitted the job, the number of nodes and cores allocated to it, etc. This information is passed to Slurm via environment variables. Additionally, these environment variables are also used as default values by programs like mpirun. To view a node’s Slurm environment variables, use export | grep SLURM. A comprehensive list of the environment variables Slurm sets for each job can be found at the end of the sbatch man page.

Interactive jobs

Slurm jobs are normally batch jobs in the sense that they are run unattended. If you want to have a direct view on your job, for tests or debugging, you have two options.

If you need simply to have an interactive Bash session on a compute node, with the same environment set as the batch jobs, run the following command:

srun --pty bash -l

Doing that, you are submitting a 1-CPU, default memory, default duration job that will return a Bash prompt when it starts.

If you need more flexibility, you will need to use the salloc command. The salloc command accepts the same parameters as sbatch as far as resource requirement are concerned. Once the allocation is granted, you can use srun the same way you would in a submission script.

Starting an interactive job

You can run an interactive job like this:

$ srun --nodes=1 --ntasks-per-node=1 --time=01:00:00 --pty bash -i

Here we ask for a single core on one interactive node for one hour with the default amount of memory. The command prompt will appear as soon as the job starts.

This is how it looks once the interactive job starts:

srun: job 12345 queued **and** waiting **for** resources
srun: job 12345 has been allocated resources

Exit the bash shell to end the job. If you exceed the time or memory limits the job will also abort.

Interactive jobs have the same policies as normal batch jobs, there are no extra restrictions. You should be aware that you might be sharing the node with other users, so play nice.

Some users have experienced problems with the command, then it has helped to specify the cpu account:

$ srun --account=<NAME_OF_MY_ACCOUNT> --nodes=1 --ntasks-per-node=1
--time=01:00:00 --pty bash -i

Interactive Jobs (Single Node)

Resources for interactive jobs are attained either using salloc. To request a compute node from the bio partition (biol) interactively consider the example below.

# Below replace the word account with an account name you belong to

# Use sinfo to see your accounts and partitions

You are asking slurm for 4 compute cores with 12GB of memory for 2 hours and 45 minutes spread across 1 node (single machine). The salloc command will automatically create an interactive shell session on an allocated node.

Interactive Jobs (Multi Node)

Building upon the previous section, if -N or –nodes is >1 when running salloc you are automatically placed into a shell of one of the allocated nodes. This shell is NOT part of a Slurm task. To view the names of the remainder of your allocated nodes use scontrol show hostnames. The srun command can be used to execute a command on all of the allocated nodes as shown in the example session below.

[user@allot ~]$ salloc -N 2 -p full -A stf --time=5 --mem=5G
salloc: Pending job allocation 2620960
salloc: job 2620960 queued and waiting for resources
salloc: job 2620960 has been allocated resources
salloc: Granted job allocation 2620960
salloc: Waiting for resource configuration
salloc: Nodes giga[002-003] are ready for job

[user@allot ~]$ srun hostname
giga002
giga003

If you are not allocated a session with the specified –mem value, try smaller memory values

For more details, read the salloc man page.

Keeping interactive jobs alive

Interactive jobs die when you disconnect from the login node either by choice or by internet connection problems. To keep a job alive you can use a terminal multiplexer like tmux or screen.

tmux allows you to run processes as usual in your standard bash shell

You start tmux on the login node before you get a interactive slurm session with srun and then do all the work in it. In case of a disconnect you simply reconnect to the login node and attach to the tmux session again by typing:

tmux attach

or in case you have multiple sessions running:

tmux list-session
tmux attach -t SESSION_NUMBER

As long as the tmux session is not closed or terminated (e.g. by a server restart) your session should continue.

To log out a tmux session without closing it you have to press CTRL-B (that the Ctrl key and simultaneously “b”, which is the standard tmux prefix) and then “d” (without the quotation marks). To close a session just close the bash session with either CTRL-D or type exit. You can get a list of all tmux commands by CTRL-B and the ? (question mark). See also this page for a short tutorial of tmux. Otherwise working inside of a tmux session is almost the same as a normal bash session.

Job related environment variables

Here we list some environment variables that are defined when you run a job script. These is not a complete list. Please consult the SLURM documentation for a complete list.

Variable	Description
$SLURM_JOB_ID	The Job ID.
$SLURM_JOBID	Deprecated. Same as $SLURM_JOB_ID
$SLURM_SUBMIT_DIR	The path of the job submission directory.
$SLURM_SUBMIT_HOST	The hostname of the node used for job submission.
$SLURM_JOB_NODELIST	Contains the definition (list) of the nodes that is assigned to the job.
$SLURM_NODELIST	Deprecated. Same as SLURM_JOB_NODELIST.
$SLURM_CPUS_PER_TASK	Number of CPUs per task.
$SLURM_CPUS_ON_NODE	Number of CPUs on the allocated node.
$SLURM_JOB_CPUS_PER_NODE	Count of processors available to the job on this node.
$SLURM_CPUS_PER_GPU	Number of CPUs requested per allocated GPU.
$SLURM_MEM_PER_CPU	Memory per CPU. Same as –mem-per-cpu .
$SLURM_MEM_PER_GPU	Memory per GPU.
$SLURM_MEM_PER_NODE	Memory per node. Same as –mem .
$SLURM_GPUS	Number of GPUs requested.
$SLURM_NTASKS	Same as -n, –ntasks. The number of tasks.
$SLURM_NTASKS_PER_NODE	Number of tasks requested per node.
$SLURM_NTASKS_PER_SOCKET	Number of tasks requested per socket.
$SLURM_NTASKS_PER_CORE	Number of tasks requested per core.
$SLURM_NTASKS_PER_GPU	Number of tasks requested per GPU.
$SLURM_NPROCS	Same as -n, –ntasks. See $SLURM_NTASKS.
$SLURM_NNODES	Total number of nodes in the job’s resource allocation.
$SLURM_TASKS_PER_NODE	Number of tasks to be initiated on each node.
$SLURM_ARRAY_JOB_ID	Job array’s master job ID number.
$SLURM_ARRAY_TASK_ID	Job array ID (index) number.
$SLURM_ARRAY_TASK_COUNT	Total number of tasks in a job array.
$SLURM_ARRAY_TASK_MAX	Job array’s maximum ID (index) number.
$SLURM_ARRAY_TASK_MIN	Job array’s minimum ID (index) number.