Running Jobs

Accessing the Compute Nodes

DeltaAI implements the Slurm batch environment to manage access to the compute nodes. Use Slurm commands to run batch jobs or for interactive access to compute nodes (an “interactive job”). See the Slurm quick start guide for an introduction to Slurm. There are multiple ways to access compute nodes on DeltaAI:

Batch scripts (sbatch) or interactive jobs (srun, salloc).
- sbatch: Use batch scripts for jobs that are debugged, ready to run, and don’t require interaction. Go to Sample Scripts for sample Slurm batch job scripts. For mixed resource heterogeneous jobs, see the Slurm job support documentation. Slurm also supports job arrays for easy management of a set of similar jobs, see the Slurm job array documentation for more information.
- srun: srun will run a single command through Slurm on a compute node. srun blocks; it will wait until Slurm has scheduled compute resources, and when it returns, the job is complete. srun can be used to launch a shell to get interactive access to a compute node(s), this is an “interactive job”. The one thing you can’t do in an interactive job created by srun is run srun commands; if you want to do that, use salloc.
- salloc: Also interactive, use salloc when you want to reserve compute resources for a period of time and interact with them using multiple commands. Each command you type after your salloc session begins will run on the login node if it is just a normal command, or on your reserved compute resources if prefixed with srun. Type exit when finished with a salloc allocation if you want to end it before the time expires.
Open OnDemand provides compute node access via JupyterLab (VSCode Code Server and the noVNC Desktop virtual desktop coming soon!).

Direct ssh access to a compute node in a running job is enabled once the job has started. See also, Monitoring a Node During a Job. In the following example, JobID 12345 is running on node gh001.

$ squeue --job jobid
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             12345       cpu     bash   gbauer  R       0:17      1 gh001

Then in a terminal session:

$ ssh gh001
gh001.delta.internal.ncsa.edu (172.28.22.64)
  Site: mgmt  Role: compute
$

Partitions (Queues)

You can use sinfo -s to see which partitions are currently available.

DeltaAI Partitions/Queues
Partition/Queue	Node Type	Max Nodes per job	Max Duration	Max Running in Queue/user	Charge Factor
ghx4*	GPU	TBD	48 hr	TBD	1.0
ghx4-interactive	GPU	4	2 hr	TBD	2.0

Default Partition Values

DeltaAI Default Partition Values
Property	Value
Memory per core	1000 MB
Wall-clock time	30 minutes

sview

Use sview for a GUI of the partitions. See the Slurm - sview documentation for more information.

Job and Node Policies

The default job requeue or restart policy is set to not allow jobs to be automatically requeued or restarted. To enable automatic requeue and restart of a job by Slurm, add the following Slurm option:
```
--requeue
```
When a job is requeued due to an event like a node failure, the batch script is initiated from its beginning. Job scripts need to be written to handle automatically restarting from checkpoints.
Node-sharing is the default for jobs. Node-exclusive mode can be set by specifying all the consumable resources for that node type or adding the following Slurm options:
```
--exclusive --mem=0
```

Batch Jobs

Batch jobs are submitted through a job script using the sbatch command. Job scripts generally start with a series of Slurm directives that describe requirements of the job, such as number of nodes and wall-clock time required, to the batch system/scheduler. The rest of the batch script consists of user commands. See Sample Scripts for example batch job scripts.

sbatch

Slurm directives can also be specified as options on the sbatch command line. Command line options take precedence over options in the job script.

The syntax for sbatch is sbatch [list of sbatch options] script_name. Refer to the sbatch man page for detailed information on the options.

$ sbatch tensorflow_cpu.slurm
Submitted batch job 2337924
$ squeue -u $USER
          JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
        2337924 ghx4         tfcpu  mylogin  R       0:46      1 gh006

Useful Batch Job Environment Variables

Useful Batch Job Environment Variables
Description	Slurm Environment Variable	Detailed Description
Array JobID	$SLURM_ARRAY_JOB_ID $SLURM_ARRAY_TASK_ID	Each member of a job array is assigned a unique identifier.
Job Submission Directory	$SLURM_SUBMIT_DIR	By default, jobs start in the directory that the job was submitted from. So the `cd $SLURM_SUBMIT_DIR` command is not needed.
JobID	$SLURM_JOB_ID	Job identifier assigned to the job.
Machine (node) list	$SLURM_NODELIST	Variable name that contains the list of nodes assigned to the batch job.

See the sbatch man page for additional environment variables available.

Interactive Jobs

Interactive jobs can be implemented in several ways, depending on what is needed. The following examples start up a bash shell terminal on a CPU or GPU node. (Replace account_name with one of your available accounts; these are listed under “Project” when you run the accounts command.)

Single core with 16GB of memory, with one task on a CPU node.

srun --account=account_name --partition=ghx4 \
  --nodes=1 --tasks=1 --tasks-per-node=1 \
  --cpus-per-task=8 --mem=16g \
  --gpus-per-node=1 \
  --pty bash

Single core with 20GB of memory, with one task.

srun --account=account_name --partition=ghx4 \
  --nodes=1 --gpus-per-node=1 --tasks=1 \
  --tasks-per-node=1 --cpus-per-task=8 --mem=20g \
  --pty bash

srun

The srun command initiates an interactive job or process on compute nodes. For example, the following command will run an interactive job in the ghx4 partition with a wall-clock time limit of 30 minutes, using one node and 16 cores per node and 1 GPU. (Replace account_name with one of your available accounts; these are listed under “Project” when you run the accounts command.)

srun -A account_name --time=00:30:00 --nodes=1 --ntasks-per-node=16 \
--partition=ghx4 --gpus=1 --mem=16g --pty /bin/bash

After entering the command, wait for Slurm to start the job. As with any job, an interactive job is queued until the specified number of nodes is available. Specifying a small number of nodes for smaller amounts of time should shorten the wait time because such jobs will backfill among larger jobs. You will see something like this:

$ srun --mem=16g --nodes=1 --ntasks-per-node=1 --cpus-per-task=8 \
--partition=ghx4 --account=account_name \
--gpus-per-node=1 --time=00:30:00 --x11 --pty /bin/bash
[login_name@gh022 bin]$                                               #<-- note the compute node name in the shell prompt
[login_name@gh022 bin]$ echo $SLURM_JOB_ID
2337913
[login_name@gh022 ~]$ c/a.out 500
count=500
sum= 0.516221
[login_name@gh022 ~]$ exit
exit
$

When you’re finished, use the exit command to end the bash shell on the compute resource and therefore the Slurm srun job.

salloc

While being interactive like srun, salloc allocates compute resources for you, while leaving your shell on the login node. Run commands on the login node as usual, use exit to end a salloc session early, and use srun with no extra flags to launch processes on the compute resources. (Replace account_name with one of your available accounts; these are listed under “Project” when you run the accounts command.)

$ salloc --mem=16g --nodes=1 --ntasks-per-node=1 --cpus-per-task=8 \
  --partition=ghx4 \
  --account=account_name --time=00:30:00 --gpus-per-node=1
salloc: Pending job allocation 2323230
salloc: job 2323230 queued and waiting for resources
salloc: job 2323230 has been allocated resources
salloc: Granted job allocation 2323230
salloc: Waiting for resource configuration
salloc: Nodes gh073 are ready for job
$ hostname #<-- on the login node
gh-login03.delta.ncsa.illinois.edu
$ srun bandwidthTest --htod #<-- on the compute resource, honoring your salloc settings
CUDA Bandwidth Test - Starting...
Running on...

Device 0: NVIDIA H100
Quick Mode

Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes)        Bandwidth(GB/s)
32000000                     24.5

Result = PASS
$ exit
salloc: Relinquishing job allocation 2323230

MPI Interactive Jobs: Use salloc Followed by srun

Interactive jobs are already a child process of srun, therefore, you cannot srun (or mpirun) applications from within them. Within standard batch jobs submitted via sbatch, use srun to launch MPI codes. For true interactive MPI, use salloc in place of srun, then srun my_mpi.exe after you get a prompt from salloc (exit to end the salloc interactive allocation).

Interactive MPI, salloc and srun (click to expand/collapse)

(Replace account_name with one of your available accounts; these are listed under “Project” when you run the accounts command.)

[arnoldg@gh-login01 collective]$ cat osu_reduce.salloc
salloc --account=account_name --partition=ghx4 \
  --nodes=2 --tasks-per-node=4 \
  --cpus-per-task=2 --mem=20g

[arnoldg@gh-login01 collective]$ ./osu_reduce.salloc
salloc: Pending job allocation 1180009
salloc: job 1180009 queued and waiting for resources
salloc: job 1180009 has been allocated resources
salloc: Granted job allocation 1180009
salloc: Waiting for resource configuration
salloc: Nodes cn[009-010] are ready for job
[arnoldg@gh-login01 collective]$ srun osu_reduce

# OSU MPI Reduce Latency Test v5.9
# Size       Avg Latency(us)
4                       1.76
8                       1.70
16                      1.72
32                      1.80
64                      2.06
128                     2.00
256                     2.29
512                     2.39
1024                    2.66
2048                    3.29
4096                    4.24
8192                    2.36
16384                   3.91
32768                   6.37
65536                  10.49
131072                 26.84
262144                198.38
524288                342.45
1048576               687.78
[arnoldg@gh-login01 collective]$ exit
exit
salloc: Relinquishing job allocation 1180009
[arnoldg@gh-login01 collective]$

Interactive X11 Support

To run an X11-based application on a compute node in an interactive session, use of the --x11 switch with srun is needed. For example, to run a single core job that uses 1G of memory with X11 (in this case an xterm) do the following. (Replace account_name with one of your available accounts; these are listed under “Project” when you run the accounts command.)

srun -A account_name  --partition=ghx4 \
  --nodes=1 --tasks=1 --tasks-per-node=1 \
  --cpus-per-task=8 --mem=16g \
  --x11  xterm

File System Dependency Specification for Jobs

NCSA requests that jobs specify the file system or systems being used to enable response to resource availability issues. All jobs are assumed to depend on the HOME file system. Jobs that do not specify a dependency on PROJECTS (/projects) and WORK (/work) will be assumed to depend only on the HOME (/u) file system.

Slurm Feature/Constraint Labels
File System	Feature/Constraint Label
PROJECTS (`/projects`)	projects
WORK - HDD (`/work/hdd`)	workhdd [1]
WORK - NVME (`/work/nvme`)	worknvme [1]
TAIGA (`/taiga`)	taiga

The Slurm constraint specifier and Slurm Feature attribute for jobs are used to add file system dependencies to a job.

Slurm Feature Specification

For already submitted and pending (PD) jobs, use the Slurm Feature attribute as follows:

$ scontrol update job=JOBID Features="feature1&feature2"

For example, to add the projects Feature to an already submitted job:

$ scontrol update job=713210 Features="projects"

To verify the setting:

$ scontrol show job 713210 | grep Feature
   Features=projects DelayBoot=00:00:00

Slurm Constraint Specification

To add Slurm job constraint attributes when submitting a job with sbatch (or with srun as a command line argument) use:

#SBATCH --constraint="constraint1&constraint2.."

For example, to add a projects constraint when submitting a job:

#SBATCH --constraint="projects"

To verify the setting:

$ scontrol show job 713267 | grep Feature
   Features=projects DelayBoot=00:00:00

Job Management

squeue/scontrol/sinfo

The squeue, scontrol, and sinfo commands display batch job and partition information. The following table has a list of common commands, see the man pages for other available options.

In squeue results, if the NODELIST(REASON) for a job is MaxGRESPerAccount, the user has exceeded the number of cores or GPUs allotted per user or project for a given partition.

Common squeue, scontrol, and sinfo Commands
Slurm Command	Description
squeue -a	Lists the status of all jobs on the system.
squeue -u $USER	Lists the status of all your jobs in the batch system. Replace `$USER` with your username.
squeue -j JobID	Lists nodes allocated to a running job in addition to basic information. Replace `JobID` with the JobID of interest.
scontrol show job JobID	Lists detailed information on a particular job. Replace `JobID` with the JobID of interest.
sinfo -a	Lists summary information on all the partition.

scancel

The scancel command deletes a queued job or terminates a running job. The following example deletes/terminates the job with the associated JobID.

scancel JobID

Using Job Dependency to Stagger Job Starts

When submitting multiple jobs, consider using --dependency to prevent all of the jobs from starting at the same time. Staggering the job startup resource load prevents system slowdowns. This is especially recommended for Python users because multiple jobs that load Python on startup can slow down the system if they are all started at the same time.

From the --dependency man page:

-d, --dependency=<dependency_list>

                 after:job_id[[+time][:jobid[+time]...]]

After the specified jobs start or are cancelled and 'time' in minutes from job start or cancellation happens,
this job can begin  execution. If  no 'time' is given then there is no delay after start or cancellation.

The following sample script staggers the start of five jobs by 5 minutes each. You can use this script as a template and modify it to the number of jobs you have. The minimum recommended delay time is 3 minutes; 5 minutes is a more conservative choice.

Sample script that automates the delay dependency (click to expand/collapse)

[gbauer@gh-login01 depend]$ cat start
#!/bin/bash

# this is the time in minutes to have Slurm wait before starting the next job after the previous one started.

export DELAY=5   # in minutes

# submit first job and grab jobid
JOBID=`sbatch testjob.slurm | cut -d" " -f4`
echo "submitted $JOBID"

# loop 4 times submitting a job depending on the previous job to start
for count in `seq 1 4`; do

OJOBID=$JOBID

JOBID=`sbatch --dependency=after:${OJOBID}+${DELAY} testjob.slurm | cut -d" " -f4`

echo "submitted $JOBID with $DELAY minute delayed start from $OJOBID "

done

Here is what the jobs look like when submitting using the above example script:

[gbauer@gh-login01 depend]$ ./start
submitted 2267583
submitted 2267584 with 5 minute delayed start from 2267583
submitted 2267585 with 5 minute delayed start from 2267584
submitted 2267586 with 5 minute delayed start from 2267585
submitted 2267587 with 5 minute delayed start from 2267586

After 5 minutes from the start of the first job, the next job starts, and so on.

[gbauer@gh-login01 depend]$ squeue -u gbauer
         JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
       2267587 cpu-inter testjob.   gbauer PD       0:00      1 (Dependency)
       2267586 cpu-inter testjob.   gbauer PD       0:00      1 (Dependency)
       2267585 cpu-inter testjob.   gbauer PD       0:00      1 (Dependency)
       2267584 cpu-inter testjob.   gbauer  R       2:14      1 cn093
       2267583 cpu-inter testjob.   gbauer  R       7:21      1 cn093

You can use the sacct command with a specific job number to see how the job was submitted and show the dependency.

[gbauer@gh-login01 depend]$ sacct --job=2267584 --format=submitline -P
SubmitLine
sbatch --dependency=after:2267583+5 testjob.slurm

Monitoring a Node During a Job

You have SSH access to nodes in your running job(s). Some of the basic monitoring tools are demonstrated in the example transcript below. Most common Linux utilities are available from the compute nodes (free, strace, ps, and so on). Example commands are shown below.

[arnoldg@gh-login03 python]$ squeue -u $USER
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           1214412 ghx4         interact  arnoldg  R       8:14      1 gh045
[arnoldg@gh-login03 python]$ ssh gh045
Last login: Wed Dec 14 09:45:26 2028 from 141.142.144.42
[arnoldg@gh045 ~]$ nvidia-smi

[arnoldg@gh045 ~]$ module load nvitop
[arnoldg@gh045 ~]$ nvitop

[arnoldg@gh045 ~]$ top -u $USER

Preempt Queue

Coming soon! A Preempt queue is planned for DeltaAI, but not currently implemented.

Sample Scripts

Serial Jobs (PyTorch GPU)

serial with gpu example script (click to expand/collapse)

#!/bin/bash
#SBATCH --mem=64g
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --ntasks-per-socket=1
#SBATCH --cpus-per-task=16
#SBATCH --partition=ghx4
#SBATCH --time=00:20:00
#SBATCH --job-name=pytorch
#SBATCH --account=account_name
### GPU options ###
#SBATCH --gpus-per-node=1
#SBATCH --gpu-bind=verbose,closest

module load python/miniforge3_pytorch
module list
echo "job is starting on `hostname`"

time srun \
  python3 tensor_gpu.py

exit

TensorFlow NGC Container

tensorflow ngc container example script (click to expand/collapse)

#!/bin/bash
#SBATCH --mem=40g
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --ntasks-per-socket=1
#SBATCH --mem-bind=verbose,local
#SBATCH --cpus-per-task=64
#SBATCH --partition=ghx4
#SBATCH --time=00:20:00
#SBATCH --job-name=tfngc
#SBATCH --account=account_name
### GPU options ###
#SBATCH --gpus-per-node=1
#SBATCH --gpu-bind=verbose,closest

module list  # job documentation and metadata

echo "job is starting on `hostname`"
# change the --bind to your preferences if you need access to data outside of $HOME
time srun \
 apptainer run --nv \
 --bind /projects/bbka/slurm_test_scripts \
 /sw/user/NGC_containers/tensorflow_23.09-tf2-py3.sif python3 \
 cifar10gpu.py

PyTorch Multi-Node

pytorch multi-node example script (click to expand/collapse)

#!/bin/bash

#SBATCH --account=account_name
#SBATCH --job-name=multinode-example
#SBATCH --partition=ghx4
#SBATCH --mem=64g
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=4
#SBATCH --cpus-per-task=8
#SBATCH --time=02:00:00
#SBATCH --output=ddp_training_%j.log
#SBATCH --error=ddp_training_%j.err

# Setup variables for torchrun rdzv_endpoint
nodes=( $( scontrol show hostnames $SLURM_JOB_NODELIST ) )
nodes_array=($nodes)
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname -I | awk '{print $1}')
echo "Head node: $head_node"
echo "Head node IP: $head_node_ip"

export LOGLEVEL=INFO

module load python/miniforge3_pytorch/2.5.0
export NCCL_DEBUG=INFO
export NCCL_SOCKET_IFNAME=hsn
module load nccl # loads the nccl built with the AWS nccl plugin for Slingshot11
module list
echo "Job is starting on `hostname`"


time srun torchrun --nnodes ${SLURM_NNODES} \
      --nproc_per_node ${SLURM_GPUS_PER_NODE} \
      --rdzv_id $RANDOM --rdzv_backend c10d \
      --rdzv_endpoint="$head_node_ip:29500" \
      ${SLURM_SUBMIT_DIR}/multinode.py 50 10

vmtouch module to manage filesystem memory caching

vmtouch example script (click to expand/collapse)

#!/bin/bash
#SBATCH --mem=30g
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1 \
#SBATCH --cpus-per-task=16 \
#SBATCH --time=00:25:00 \
#SBATCH --job-name=testing \
#SBATCH --account=bbka-dtai-gh \
#SBATCH --gpus-per-node=1 \
#SBATCH --gpu-bind=verbose,closest

mkdir -p /tmp/include
cd /tmp/include
cp /work/hdd/bbka/arnoldg/include.tar .
tar xf include.tar
# load the vmtouch module, then evict the contents of the current directory and the cp from cache
module load vmtouch
vmtouch -ve /work/hdd/bbka/arnoldg/include.tar
vmtouch -ve ./

# proceed with some processing here...
date
sleep 5s

MPI

mpi example script (click to expand/collapse)

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --mem=32g
#SBATCH --ntasks-per-node=32
#SBATCH --cpus-per-task=1
#SBATCH --partition=ghx4
#SBATCH --time=00:50:00
#SBATCH --job-name=osu_bw
#SBATCH --account=account_name
#SBATCH --gpus-per-node=1

module list

export MSGSIZE=16777216
export ITERS=400

srun /sw/admin/osu-micro-benchmarks-7.4/c/mpi/pt2pt/standard/osu_bw -m $MSGSIZE:$MSGSIZE -i$ITERS

TensorFlow on GPUs

serial with gpu and threads example script (click to expand/collapse)

#!/bin/bash
#SBATCH --mem=80g
#SBATCH --nodes=1
#SBATCH --mem-bind=verbose,local
#SBATCH --ntasks-per-node=1
#SBATCH --ntasks-per-socket=1
#SBATCH --cpus-per-task=64
#SBATCH --partition=ghx4
#SBATCH --time=00:20:00
#SBATCH --job-name=tfcpu
#SBATCH --account=bbkg-dtai-gh
### GPU options ###
#SBATCH --gpus-per-node=1

module load python/miniforge3_tensorflow_cuda
module list
echo "job is starting on `hostname`"

time srun \
  python3 cifar10gpu.py

OpenMP

openmp example script (click to expand/collapse)

coming soon