Running Jobs
Accessing the Compute Nodes
DeltaAI implements the Slurm batch environment to manage access to the compute nodes. Use Slurm commands to run batch jobs or for interactive access to compute nodes (an “interactive job”). See the Slurm quick start guide for an introduction to Slurm. There are multiple ways to access compute nodes on DeltaAI:
Batch scripts (
sbatch
) or interactive jobs (srun
,salloc
).sbatch: Use batch scripts for jobs that are debugged, ready to run, and don’t require interaction. Go to Sample Scripts for sample Slurm batch job scripts. For mixed resource heterogeneous jobs, see the Slurm job support documentation. Slurm also supports job arrays for easy management of a set of similar jobs, see the Slurm job array documentation for more information.
srun:
srun
will run a single command through Slurm on a compute node.srun
blocks; it will wait until Slurm has scheduled compute resources, and when it returns, the job is complete.srun
can be used to launch a shell to get interactive access to a compute node(s), this is an “interactive job”. The one thing you can’t do in an interactive job created bysrun
is runsrun
commands; if you want to do that, usesalloc
.salloc: Also interactive, use
salloc
when you want to reserve compute resources for a period of time and interact with them using multiple commands. Each command you type after yoursalloc
session begins will run on the login node if it is just a normal command, or on your reserved compute resources if prefixed withsrun
. Typeexit
when finished with asalloc
allocation if you want to end it before the time expires.
Open OnDemand provides compute node access via JupyterLab (VSCode Code Server and the noVNC Desktop virtual desktop coming soon!).
Direct
ssh
access to a compute node in a running job is enabled once the job has started. See also, Monitoring a Node During a Job. In the following example, JobID 12345 is running on node gh001.$ squeue --job jobid JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 12345 cpu bash gbauer R 0:17 1 gh001
Then in a terminal session:
$ ssh gh001 gh001.delta.internal.ncsa.edu (172.28.22.64) Site: mgmt Role: compute $
Partitions (Queues)
You can use sinfo -s
to see which partitions are currently available.
Partition/Queue |
Node Type |
Max Nodes per job |
Max Duration |
Max Running in Queue/user |
Charge Factor |
---|---|---|---|---|---|
ghx4* |
GPU |
TBD |
48 hr |
TBD |
1.0 |
ghx4-interactive |
GPU |
TBD |
2 hr |
TBD |
2.0 |
Default Partition Values
Property |
Value |
---|---|
Memory per core |
1000 MB |
Wall-clock time |
30 minutes |
sview
Use sview
for a GUI of the partitions. See the Slurm - sview documentation for more information.
Job and Node Policies
The default job requeue or restart policy is set to not allow jobs to be automatically requeued or restarted. To enable automatic requeue and restart of a job by Slurm, add the following Slurm option:
--requeue
When a job is requeued due to an event like a node failure, the batch script is initiated from its beginning. Job scripts need to be written to handle automatically restarting from checkpoints.
Node-sharing is the default for jobs. Node-exclusive mode can be set by specifying all the consumable resources for that node type or adding the following Slurm options:
--exclusive --mem=0
Batch Jobs
Batch jobs are submitted through a job script using the sbatch
command.
Job scripts generally start with a series of Slurm directives that describe requirements of the job, such as number of nodes and wall-clock time required, to the batch system/scheduler.
The rest of the batch script consists of user commands. See Sample Scripts for example batch job scripts.
sbatch
Slurm directives can also be specified as options on the sbatch
command line. Command line options take precedence over options in the job script.
The syntax for sbatch
is sbatch [list of sbatch options] script_name
. Refer to the sbatch man page for detailed information on the options.
$ sbatch tensorflow_cpu.slurm
Submitted batch job 2337924
$ squeue -u $USER
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2337924 ghx4 tfcpu mylogin R 0:46 1 gh006
Useful Batch Job Environment Variables
Description |
Slurm Environment Variable |
Detailed Description |
---|---|---|
Array JobID |
$SLURM_ARRAY_JOB_ID $SLURM_ARRAY_TASK_ID |
Each member of a job array is assigned a unique identifier. |
Job Submission Directory |
$SLURM_SUBMIT_DIR |
By default, jobs start in the directory that the job was submitted
from. So the |
JobID |
$SLURM_JOB_ID |
Job identifier assigned to the job. |
Machine (node) list |
$SLURM_NODELIST |
Variable name that contains the list of nodes assigned to the batch job. |
See the sbatch man page for additional environment variables available.
Interactive Jobs
Interactive jobs can be implemented in several ways, depending on what is needed. The following examples start up a bash shell terminal on a CPU or GPU node. (Replace account_name
with one of your available accounts; these are listed under “Project” when you run the accounts
command.)
Single core with 16GB of memory, with one task on a CPU node.
srun --account=account_name --partition=ghx4 \ --nodes=1 --tasks=1 --tasks-per-node=1 \ --cpus-per-task=8 --mem=16g \ --gpus-per-node=1 \ --pty bash
Single core with 20GB of memory, with one task.
srun --account=account_name --partition=ghx4 \ --nodes=1 --gpus-per-node=1 --tasks=1 \ --tasks-per-node=1 --cpus-per-task=8 --mem=20g \ --pty bash
srun
The srun
command initiates an interactive job or process on compute nodes.
For example, the following command will run an interactive job in the ghx4 partition with a wall-clock time limit of 30 minutes, using one node and 16 cores per node and 1 GPU. (Replace account_name
with one of your available accounts; these are listed under “Project” when you run the accounts
command.)
srun -A account_name --time=00:30:00 --nodes=1 --ntasks-per-node=16 \
--partition=ghx4 --gpus=1 --mem=16g --pty /bin/bash
After entering the command, wait for Slurm to start the job. As with any job, an interactive job is queued until the specified number of nodes is available. Specifying a small number of nodes for smaller amounts of time should shorten the wait time because such jobs will backfill among larger jobs. You will see something like this:
$ srun --mem=16g --nodes=1 --ntasks-per-node=1 --cpus-per-task=8 \
--partition=ghx4 --account=account_name \
--gpus-per-node=1 --time=00:30:00 --x11 --pty /bin/bash
[login_name@gh022 bin]$ #<-- note the compute node name in the shell prompt
[login_name@gh022 bin]$ echo $SLURM_JOB_ID
2337913
[login_name@gh022 ~]$ c/a.out 500
count=500
sum= 0.516221
[login_name@gh022 ~]$ exit
exit
$
When you’re finished, use the exit
command to end the bash shell on the compute resource and therefore the Slurm srun
job.
salloc
While being interactive like srun
, salloc
allocates compute resources for you, while leaving your shell on the login node.
Run commands on the login node as usual, use exit
to end a salloc session early, and use srun
with no extra flags to launch processes on the compute resources. (Replace account_name
with one of your available accounts; these are listed under “Project” when you run the accounts
command.)
$ salloc --mem=16g --nodes=1 --ntasks-per-node=1 --cpus-per-task=8 \
--partition=ghx4 \
--account=account_name --time=00:30:00 --gpus-per-node=1
salloc: Pending job allocation 2323230
salloc: job 2323230 queued and waiting for resources
salloc: job 2323230 has been allocated resources
salloc: Granted job allocation 2323230
salloc: Waiting for resource configuration
salloc: Nodes gh073 are ready for job
$ hostname #<-- on the login node
gh-login03.delta.ncsa.illinois.edu
$ srun bandwidthTest --htod #<-- on the compute resource, honoring your salloc settings
CUDA Bandwidth Test - Starting...
Running on...
Device 0: NVIDIA H100
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 24.5
Result = PASS
$ exit
salloc: Relinquishing job allocation 2323230
MPI Interactive Jobs: Use salloc Followed by srun
Interactive jobs are already a child process of srun
, therefore, you cannot srun
(or mpirun
) applications from within them.
Within standard batch jobs submitted via sbatch
, use srun
to launch MPI codes.
For true interactive MPI, use salloc
in place of srun
, then srun my_mpi.exe
after you get a prompt from salloc
(exit
to end the salloc
interactive allocation).
Interactive MPI, salloc and srun (click to expand/collapse)
(Replace account_name
with one of your available accounts; these are listed under “Project” when you run the accounts
command.)
[arnoldg@gh-login01 collective]$ cat osu_reduce.salloc
salloc --account=account_name --partition=ghx4 \
--nodes=2 --tasks-per-node=4 \
--cpus-per-task=2 --mem=20g
[arnoldg@gh-login01 collective]$ ./osu_reduce.salloc
salloc: Pending job allocation 1180009
salloc: job 1180009 queued and waiting for resources
salloc: job 1180009 has been allocated resources
salloc: Granted job allocation 1180009
salloc: Waiting for resource configuration
salloc: Nodes cn[009-010] are ready for job
[arnoldg@gh-login01 collective]$ srun osu_reduce
# OSU MPI Reduce Latency Test v5.9
# Size Avg Latency(us)
4 1.76
8 1.70
16 1.72
32 1.80
64 2.06
128 2.00
256 2.29
512 2.39
1024 2.66
2048 3.29
4096 4.24
8192 2.36
16384 3.91
32768 6.37
65536 10.49
131072 26.84
262144 198.38
524288 342.45
1048576 687.78
[arnoldg@gh-login01 collective]$ exit
exit
salloc: Relinquishing job allocation 1180009
[arnoldg@gh-login01 collective]$
Interactive X11 Support
To run an X11-based application on a compute node in an interactive session, use of the --x11
switch with srun
is needed.
For example, to run a single core job that uses 1G of memory with X11 (in this case an xterm) do the following. (Replace account_name
with one of your available accounts; these are listed under “Project” when you run the accounts
command.)
srun -A account_name --partition=ghx4 \
--nodes=1 --tasks=1 --tasks-per-node=1 \
--cpus-per-task=8 --mem=16g \
--x11 xterm
File System Dependency Specification for Jobs
NCSA requests that jobs specify the file system or systems being used to enable response to resource availability issues.
All jobs are assumed to depend on the HOME file system. Jobs that do not specify a dependency on PROJECTS (/projects
) and WORK (/work
) will be assumed to depend only on the HOME (/u
) file system.
File System |
Feature/Constraint Label |
---|---|
PROJECTS ( |
projects |
WORK - HDD ( |
workhdd [1] |
WORK - NVME ( |
worknvme [1] |
TAIGA ( |
taiga |
The Slurm constraint specifier and Slurm Feature attribute for jobs are used to add file system dependencies to a job.
Slurm Feature Specification
For already submitted and pending (PD) jobs, use the Slurm Feature attribute as follows:
$ scontrol update job=JOBID Features="feature1&feature2"
For example, to add the projects Feature to an already submitted job:
$ scontrol update job=713210 Features="projects"
To verify the setting:
$ scontrol show job 713210 | grep Feature
Features=projects DelayBoot=00:00:00
Slurm Constraint Specification
To add Slurm job constraint attributes when submitting a job with sbatch
(or with srun
as a command line argument) use:
#SBATCH --constraint="constraint1&constraint2.."
For example, to add a projects constraint when submitting a job:
#SBATCH --constraint="projects"
To verify the setting:
$ scontrol show job 713267 | grep Feature
Features=projects DelayBoot=00:00:00
Job Management
squeue/scontrol/sinfo
The squeue
, scontrol
, and sinfo
commands display batch job and partition information. The following table has a list of common commands, see the man pages for other available options.
In squeue
results, if the NODELIST(REASON)
for a job is MaxGRESPerAccount
, the user has exceeded the number of cores or GPUs allotted per user or project for a given partition.
Slurm Command |
Description |
---|---|
squeue -a
|
Lists the status of all jobs on the system. |
squeue -u $USER
|
Lists the status of all your jobs in the batch system.
Replace |
squeue -j JobID
|
Lists nodes allocated to a running job in addition
to basic information. Replace |
scontrol show job JobID
|
Lists detailed information on a particular job. Replace
|
sinfo -a
|
Lists summary information on all the partition. |
scancel
The scancel
command deletes a queued job or terminates a running job. The following example deletes/terminates the job with the associated JobID
.
scancel JobID
Using Job Dependency to Stagger Job Starts
When submitting multiple jobs, consider using --dependency
to prevent all of the jobs from starting at the same time. Staggering the job startup resource load prevents system slowdowns. This is especially recommended for Python users because multiple jobs that load Python on startup can slow down the system if they are all started at the same time.
From the --dependency
man page:
-d, --dependency=<dependency_list>
after:job_id[[+time][:jobid[+time]...]]
After the specified jobs start or are cancelled and 'time' in minutes from job start or cancellation happens,
this job can begin execution. If no 'time' is given then there is no delay after start or cancellation.
The following sample script staggers the start of five jobs by 5 minutes each. You can use this script as a template and modify it to the number of jobs you have. The minimum recommended delay time is 3 minutes; 5 minutes is a more conservative choice.
Sample script that automates the delay dependency (click to expand/collapse)
[gbauer@gh-login01 depend]$ cat start
#!/bin/bash
# this is the time in minutes to have Slurm wait before starting the next job after the previous one started.
export DELAY=5 # in minutes
# submit first job and grab jobid
JOBID=`sbatch testjob.slurm | cut -d" " -f4`
echo "submitted $JOBID"
# loop 4 times submitting a job depending on the previous job to start
for count in `seq 1 4`; do
OJOBID=$JOBID
JOBID=`sbatch --dependency=after:${OJOBID}+${DELAY} testjob.slurm | cut -d" " -f4`
echo "submitted $JOBID with $DELAY minute delayed start from $OJOBID "
done
Here is what the jobs look like when submitting using the above example script:
[gbauer@gh-login01 depend]$ ./start
submitted 2267583
submitted 2267584 with 5 minute delayed start from 2267583
submitted 2267585 with 5 minute delayed start from 2267584
submitted 2267586 with 5 minute delayed start from 2267585
submitted 2267587 with 5 minute delayed start from 2267586
After 5 minutes from the start of the first job, the next job starts, and so on.
[gbauer@gh-login01 depend]$ squeue -u gbauer
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2267587 cpu-inter testjob. gbauer PD 0:00 1 (Dependency)
2267586 cpu-inter testjob. gbauer PD 0:00 1 (Dependency)
2267585 cpu-inter testjob. gbauer PD 0:00 1 (Dependency)
2267584 cpu-inter testjob. gbauer R 2:14 1 cn093
2267583 cpu-inter testjob. gbauer R 7:21 1 cn093
You can use the sacct
command with a specific job number to see how the job was submitted and show the dependency.
[gbauer@gh-login01 depend]$ sacct --job=2267584 --format=submitline -P
SubmitLine
sbatch --dependency=after:2267583+5 testjob.slurm
Monitoring a Node During a Job
You have SSH access to nodes in your running job(s). Some of the basic monitoring tools are demonstrated in the example transcript below. Most common Linux utilities are available from the compute nodes (free, strace, ps, and so on). Example commands are shown below.
[arnoldg@gh-login03 python]$ squeue -u $USER
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1214412 ghx4 interact arnoldg R 8:14 1 gh045
[arnoldg@gh-login03 python]$ ssh gh045
Last login: Wed Dec 14 09:45:26 2028 from 141.142.144.42
[arnoldg@gh045 ~]$ nvidia-smi
[arnoldg@gh045 ~]$ module load nvitop
[arnoldg@gh045 ~]$ nvitop
[arnoldg@gh045 ~]$ top -u $USER
Preempt Queue
Coming soon! A Preempt queue is planned for DeltaAI, but not currently implemented.
Sample Scripts
Serial Jobs (PyTorch GPU)
serial with gpu example script (click to expand/collapse)
#!/bin/bash
#SBATCH --mem=64g
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --ntasks-per-socket=1
#SBATCH --mem-bind=verbose,local
#SBATCH --cpus-per-task=16
#SBATCH --partition=ghx4
#SBATCH --time=00:20:00
#SBATCH --job-name=pytorch
#SBATCH --account=account_name
### GPU options ###
#SBATCH --gpus-per-node=1
#SBATCH --gpu-bind=verbose,closest
module load python/miniforge3_pytorch
module list
echo "job is starting on `hostname`"
time srun \
numactl --cpunodebind=0 --membind=0 \
python3 tensor_gpu.py
exit
TensorFlow NGC Container
tensorflow ngc container example script (click to expand/collapse)
#!/bin/bash
#SBATCH --mem=40g
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --ntasks-per-socket=1
#SBATCH --mem-bind=verbose,local
#SBATCH --cpus-per-task=64
#SBATCH --partition=ghx4
#SBATCH --time=00:20:00
#SBATCH --job-name=tfngc
#SBATCH --account=account_name
### GPU options ###
#SBATCH --gpus-per-node=1
#SBATCH --gpu-bind=verbose,closest
module list # job documentation and metadata
echo "job is starting on `hostname`"
# change the --bind to your preferences if you need access to data outside of $HOME
time srun \
numactl --cpunodebind=0 --membind=0 \
apptainer run --nv \
--bind /projects/bbka/slurm_test_scripts \
/sw/user/NGC_containers/tensorflow_23.09-tf2-py3.sif python3 \
cifar10gpu.py
PyTorch Multi-Node
pytorch multi-node example script (click to expand/collapse)
#!/bin/bash
#SBATCH --account=account_name
#SBATCH --job-name=multinode-example
#SBATCH --partition=ghx4
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=4
#SBATCH --cpus-per-task=8
#SBATCH --time=02:00:00
#SBATCH --output=ddp_training_%j.log
#SBATCH --error=ddp_training_%j.err
# Setup variables for torchrun rdzv_endpoint
nodes=( $( scontrol show hostnames $SLURM_JOB_NODELIST ) )
nodes_array=($nodes)
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname -I | awk '{print $1}')
echo "Head node: $head_node"
echo "Head node IP: $head_node_ip"
export LOGLEVEL=INFO
module load python/miniforge3_pytorch/2.5.0
export NCCL_DEBUG=INFO
export NCCL_SOCKET_IFNAME=hsn
module load nccl # loads the nccl built with the AWS nccl plugin for Slingshot11
module list
echo "Job is starting on `hostname`"
time srun torchrun --nnodes ${SLURM_NNODES} \
--nproc_per_node ${SLURM_GPUS_PER_NODE} \
--rdzv_id $RANDOM --rdzv_backend c10d \
--rdzv_endpoint="$head_node_ip:29500" \
${SLURM_SUBMIT_DIR}/multinode.py 50 10
MPI
mpi example script (click to expand/collapse)
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=32
#SBATCH --cpus-per-task=1
#SBATCH --partition=ghx4
#SBATCH --time=00:50:00
#SBATCH --job-name=osu_bw
#SBATCH --account=account_name
#SBATCH --gpus-per-node=1
module list
export MSGSIZE=16777216
export ITERS=400
srun /sw/admin/osu-micro-benchmarks-7.4/c/mpi/pt2pt/standard/osu_bw -m $MSGSIZE:$MSGSIZE -i$ITERS
TensorFlow on CPUs
serial with cpu and threads example script (click to expand/collapse)
#!/bin/bash
#SBATCH --mem=100g
#SBATCH --nodes=1
#SBATCH --mem-bind=verbose,local
#SBATCH --ntasks-per-node=1
#SBATCH --ntasks-per-socket=1
#SBATCH --cpus-per-task=70
#SBATCH --partition=ghx4
#SBATCH --time=00:20:00
#SBATCH --job-name=tfcpu
#SBATCH --account=bbkg-dtai-gh
### GPU options ###
#SBATCH --gpus-per-node=1
module load python/miniforge3_tensorflow_cpu
module list
echo "job is starting on `hostname`"
time srun \
numactl --cpunodebind=0 --membind=0 \
python3 cifar10cpu.py