Running Jobs
Accounting
The charge unit for TGI RAILS is the Service Unit (SU). This corresponds to the equivalent use of one compute core utilizing less than or equal to 2G of memory for one hour. Keep in mind that your charges are based on the resources that are reserved for your job and don’t necessarily reflect how the resources are used. Charges are based on either the number of cores or the fraction of the memory requested, whichever is larger.
Node Type |
Service Unit Equivalence |
GPUs |
Host Memory |
---|---|---|---|
CPU core |
1 |
N/A |
5.3 GB |
CPU Node |
96 |
N/A |
512 GB |
GPU Node: One H100 |
120 |
1 H100 |
256 GB |
GPU Node: 8-way H100 |
960 |
8 H100 |
2048 GB |
Local Account Charging
Use the accounts
command to list the accounts available for
charging. Users start out with a single chargeable project but may
be added to other projects over time.
[cmendes@railsl2 ~]$ accounts
Project Summary for User cmendes:
Project Description Balance Deposited Usage (SUs)
------------- ------------- --------- ----------- -------------
bcfz-tgirails tgi 988887 1010000 21113
Job Accounting Considerations
A node-exclusive job that runs on a CPU compute node for one hour will be charged 96 SUs (96 cores x 1 hour)
A node-exclusive job that runs on an 8-way GPU node for one hour will be charged 960 SUs (8 GPUs x 1 hour)
QOSGrpBillingMinutes
If you see QOSGrpBillingMinutes under the Reason column for the squeue command, as in
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1204221 cpu myjob .... PD 0:00 5 (QOSGrpBillingMinutes)
then the resource allocation specified for the job (i.e. xyzt-tgirails ) does not have sufficient balance to run the job based on the # of resources requested and the wallclock time. Sometimes it maybe other jobs from the same project that in the same QOSGrpBillingMinutes state are could cause other jobs using the same resource allocation that are preventing a job that would “fit” from running. The PI of the project needs to put in a supplement request using the XRAS proposal system.
Reviewing job charges for a project ( jobcharge )
jobcharge in /sw/user/scripts/ will show job charges by user for a project. Example usage:
[cmendes@railsl1 ~]$ jobcharge -d 180 bcfz-tgirails
Charges for account bcfz-tgirails from 2024-09-11-10:20:37 through 2025-03-10-10:20:37.
User Charge (SU)
-------- -------------
abbode2 257.39
arnoldg 6.56
cmendes 45.9
jenos 724.81
mshow 343.19
rbrunner 245.53
Total 1623.38
[cmendes@railsl1 ~]$ jobcharge -h
usage: jobcharge [-h] [-s STARTTIME] [-e ENDTIME] [--detail] [-m MONTH] [-y YEAR] [-d DAYSBACK] [-a] account
positional arguments:
account Name of the account to get jobcharges for. "accounts" command can be used to list your valid accounts.
options:
-h, --help show this help message and exit
-s STARTTIME, --starttime STARTTIME
Get jobcharges after this time. Default is one month before the current time.
-e ENDTIME, --endtime ENDTIME
Get jobcharges before this time. Default is the current time.
--detail, --details detail output, per-job
-m MONTH, --month MONTH
Get jobcharges for a specific month (1-12). Will override start end time arguments
-y YEAR, --year YEAR Get jobcharges for a specific year. Will override start end time arguments
-d DAYSBACK, --daysback DAYSBACK
Get jobcharges from the previous N days. Will take precedence over all other time search arguments
-a, --all Display any job that occured during the queried time regardless of endtime.
Performance tools
[was a link to NVIDIA Nsight Systems wiki page]
Sample Scripts
Serial jobs on CPU nodes
$ cat job.slurm #!/bin/bash #SBATCH --mem=16g #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=4 # <- match to OMP_NUM_THREADS #SBATCH --partition=cpu # <- or one of: gpuA100x4 gpuH100x8 #SBATCH --account=account_name #SBATCH --job-name=myjobtest #SBATCH --time=00:10:00 # hh:mm:ss for the job ### GPU options ### ##SBATCH --gpus-per-node=2 ##SBATCH --gpu-bind=none # <- or closest ##SBATCH [email protected] ##SBATCH --mail-type="BEGIN,END" See sbatch or srun man pages for more email options module reset # drop modules and explicitly load the ones needed # (good job metadata and reproducibility) # $WORK and $SCRATCH are now set module load python # ... or any appropriate modules module list # job documentation and metadata echo "job is starting on `hostname`" srun python3 myprog.py
MPI on CPU nodes
#!/bin/bash #SBATCH --mem=16g #SBATCH --nodes=2 #SBATCH --ntasks-per-node=32 #SBATCH --cpus-per-task=2 # <- match to OMP_NUM_THREADS #SBATCH --partition=cpu # <- or one of: gpuA100x4 gpuH100x8 #SBATCH --account=account_name #SBATCH --job-name=mympi #SBATCH --time=00:10:00 # hh:mm:ss for the job ### GPU options ### ##SBATCH --gpus-per-node=2 ##SBATCH --gpu-bind=none # <- or closest ##SBATCH [email protected] ##SBATCH --mail-type="BEGIN,END" See sbatch or srun man pages for more email options module reset # drop modules and explicitly load the ones needed # (good job metadata and reproducibility) # $WORK and $SCRATCH are now set module load gcc/11.2.0 openmpi # ... or any appropriate modules module list # job documentation and metadata echo "job is starting on `hostname`" srun osu_reduce
OpenMP on CPU nodes
#!/bin/bash #SBATCH --mem=16g #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=32 # <- match to OMP_NUM_THREADS #SBATCH --partition=cpu # <- or one of: gpuA100x4 gpuH100x8 #SBATCH --account=account_name #SBATCH --job-name=myopenmp #SBATCH --time=00:10:00 # hh:mm:ss for the job ### GPU options ### ##SBATCH --gpus-per-node=2 ##SBATCH --gpu-bind=none # <- or closest ##SBATCH [email protected] ##SBATCH --mail-type="BEGIN,END" See sbatch or srun man pages for more email options module reset # drop modules and explicitly load the ones needed # (good job metadata and reproducibility) # $WORK and $SCRATCH are now set module load gcc/11.2.0 # ... or any appropriate modules module list # job documentation and metadata echo "job is starting on `hostname`" export OMP_NUM_THREADS=32 srun stream_gcc
Hybrid (MPI + OpenMP or MPI+X) on CPU nodes
#!/bin/bash #SBATCH --mem=16g #SBATCH --nodes=2 #SBATCH --ntasks-per-node=4 #SBATCH --cpus-per-task=4 # <- match to OMP_NUM_THREADS #SBATCH --partition=cpu # <- or one of: gpuA100x4 gpuH100x8 #SBATCH --account=account_name #SBATCH --job-name=mympi+x #SBATCH --time=00:10:00 # hh:mm:ss for the job ### GPU options ### ##SBATCH --gpus-per-node=2 ##SBATCH --gpu-bind=none # <- or closest ##SBATCH [email protected] ##SBATCH --mail-type="BEGIN,END" See sbatch or srun man pages for more email options module reset # drop modules and explicitly load the ones needed # (good job metadata and reproducibility) # $WORK and $SCRATCH are now set module load gcc/11.2.0 openmpi # ... or any appropriate modules module list # job documentation and metadata echo "job is starting on `hostname`" export OMP_NUM_THREADS=4 srun xthi
4 gpus together on a compute node
#!/bin/bash #SBATCH --job-name="a.out_symmetric" #SBATCH --output="a.out.%j.%N.out" #SBATCH --partition=gpuH100x8 #SBATCH --mem=208G #SBATCH --nodes=1 #SBATCH --ntasks-per-node=4 # could be 1 for py-torch #SBATCH --cpus-per-task=16 # spread out to use 1 core per numa, set to 64 if tasks is 1 #SBATCH --constraint="scratch" #SBATCH --gpus-per-node=4 #SBATCH --gpu-bind=closest # select a cpu close to gpu on pci bus topology #SBATCH --account=bbjw-tgirails #SBATCH --exclusive # dedicated node for this job #SBATCH --no-requeue #SBATCH -t 04:00:00 export OMP_NUM_THREADS=1 # if code is not multithreaded, otherwise set to 8 or 16 srun -N 1 -n 4 ./a.out > myjob.out # py-torch example, --ntasks-per-node=1 --cpus-per-task=64 # srun python3 multiple_gpu.py
Parametric / Array / HTC jobs
Interactive Sessions
Interactive sessions can be implemented in several ways depending on what is needed.
To start up a bash shell terminal on a cpu or gpu node
single core with 16GB of memory, with one task on a CPU node
srun --account=<account_name> --partition=cpu \
--nodes=1 --tasks=1 --tasks-per-node=1 \
--cpus-per-task=4 --mem=16g --pty bash
single core with 20GB of memory, with one task on a GPU node
srun --account=<account_name> --partition=gpu \
--nodes=1 --gpus-per-node=1 --tasks=1 --tasks-per-node=16 \
--cpus-per-task=1 --mem=20g --pty bash
MPI interactive jobs: use salloc followed by srun
Since interactive jobs are already a child process of srun, one cannot srun (or mpirun) applications from within them. Within standard batch jobs submitted via sbatch, use srun to launch MPI codes. For true interactive MPI, use salloc in place of srun shown above, then “srun my_mpi.exe” after you get a prompt from salloc ( exit to end the salloc interactive allocation).
[arnoldg@dt-login01 collective]$ cat osu_reduce.salloc
salloc --account=bbka-delta-cpu --partition=cpu-interactive \
--nodes=2 --tasks-per-node=4 \
--cpus-per-task=2 --mem=0
[arnoldg@dt-login01 collective]$ ./osu_reduce.salloc
salloc: Pending job allocation 1180009
salloc: job 1180009 queued and waiting for resources
salloc: job 1180009 has been allocated resources
salloc: Granted job allocation 1180009
salloc: Waiting for resource configuration
salloc: Nodes cn[009-010] are ready for job
[arnoldg@dt-login01 collective]$ srun osu_reduce
# OSU MPI Reduce Latency Test v5.9
# Size Avg Latency(us)
4 1.76
8 1.70
16 1.72
32 1.80
64 2.06
128 2.00
256 2.29
512 2.39
1024 2.66
2048 3.29
4096 4.24
8192 2.36
16384 3.91
32768 6.37
65536 10.49
131072 26.84
262144 198.38
524288 342.45
1048576 687.78
[arnoldg@dt-login01 collective]$ exit
exit
salloc: Relinquishing job allocation 1180009
[arnoldg@dt-login01 collective]$
Interactive X11 Support
To run an X11 based application on a compute node in an interactive
session, the use of the --x11
switch with srun
is needed. For
example, to run a single core job that uses 16g of memory with X11 (in
this case an xterm) do the following:
srun -A <account_name> --partition=cpu \
--nodes=1 --tasks=1 --tasks-per-node=1 \
--cpus-per-task=2 --mem=16g --x11 xterm
Job Management
Batch jobs are submitted through a job script (as in the examples
above) using the sbatch
command. Job scripts generally start with a
series of SLURM directives that describe requirements of the job (such
as number of nodes, wall time required, etc…) to the batch
system/scheduler (SLURM directives can also be specified as options on
the sbatch command line; command line options take precedence over those
in the script). The rest of the batch script consists of user commands.
The syntax for sbatch
is:
sbatch [list of sbatch options] script_name
Refer to the sbatch man page for detailed information on the options.
Commands that display batch job and partition information:
SLURM EXAMPLE COMMAND |
DESCRIPTION |
---|---|
squeue -a |
List the status of all jobs on the system. |
squeue -u $USER |
List the status of all your jobs in the batch system. |
squeue -j JobID |
List nodes allocated to a running job in addition to basic information.. |
scontrol show job JobID |
List detailed information on a particular job. |
sinfo -a |
List summary information on all the partitions. |
See the manual (man) pages for other available options for the commands above.
srun
The srun command can be used to initiate an interactive job on compute nodes.
For example, the following command:
srun -A <account_name> --time=00:30:00 --nodes=1 --ntasks-per-node=16 \
--partition=gpu --gpus=1 --mem=16g --pty bash
will run an interactive job in the gpu partition with a wall clock limit of 30 minutes, using one node and 16 cores per node and 1 gpu. You can also use other sbatch options such as those documented above.
After you enter the command, you will have to wait for SLURM to start the job. As with any job, your interactive job will wait in the queue until the specified number of nodes is available. If you specify a small number of nodes for smaller amounts of time, the wait should be shorter because your job will backfill among larger jobs.
Once the job starts, you will be presented with an interactive shell prompt on the launch node. At this point, you can use the appropriate command to start your program. When you are done with your work, you can use the exit command to end the job.
scancel
The scancel command deletes a queued job or terminates a running job.
Usage: scancel JobID
Job Status
NODELIST(REASON)
MaxGRESPerAccount - a user has exceeded the number of cores or gpus allotted per user or project for a given partition.
DESCRIPTION |
SLURM ENVIRONMENT VARIABLE |
DETAIL DESCRIPTION |
---|---|---|
JobID |
$SLURM_JOB_ID |
Job identifier assigned to the job. |
Job Submision Directory |
$SLURM_SUBMIT_DIR |
By default, jobs start in the directory that the job was submitted from. So the “cd $SLURM_SUBMIT_DIR” command is not needed. |
Machine(node) list |
$SLURM_NODELIST |
Variable name that contains the list of nodes assigned to the batch job. |
Array JobID |
$SLURM_ARRAY_JOB_ID $SLURM_ARRAY_TASK_ID |
Each member of a job array is assigned a unique identifier. |
See the sbatch man page for additional environment variables available.
Accessing the Compute Nodes
TGI RAILS implements the Slurm batch environment to manage access to the compute nodes. Use the Slurm commands to run batch jobs or for interactive access to compute nodes. See: Slurm Quick Start User Guide for an introduction to Slurm. There are two ways to access compute nodes on TGI RAILS.
Batch jobs can be used to access compute nodes. Slurm provides a convenient direct way to submit batch jobs. See Slurm Submitting Jobs for details. Slurm supports job arrays for easy management of a set of similar jobs, see: Slurm Job Array Support.
Sample Slurm batch job scripts are provided in the Sample Scripts section above.
Direct ssh access to a compute node in a running batch job from a login node is enabled, once the job has started.
$ squeue -u <username>
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
21130 cpu bash cmendes R 1:02 1 rails01
Then in a terminal session in the login node invoke ssh:
[cmendes@railsl1 ~]$ ssh rails01
[cmendes@rails01 ~]$ hostname
rails01
[cmendes@rails01 ~]$ <other_commands>
See also:
Scheduler
For information, consult:
Partitions (Queues)
Partition/Queue |
Node Type |
Max Nodes per Job |
Max Duration |
Max Running in Queue/user |
Charge Factor |
---|---|---|---|---|---|
cpu |
CPU |
TBD |
7 days |
no limit |
1.0 |
gpu |
GPU |
TBD |
7 days |
no limit |
1.5 |
Node Policies
Node-sharing is the default for jobs. Node-exclusive mode can be obtained by specifying all the consumable resources for that node type or adding the following Slurm options:
--exclusive --mem=0
GPU NVIDIA MIG (GPU slicing) for the H100 GPU may be supported at a future date.
Pre-emptive jobs will be supported at a future date.
Job Policies
The default job requeue or restart policy is set to not allow jobs to be automatically requeued or restarted.
To enable automatic requeue and restart of a job by slurm, please add the following slurm directive
--requeue
When a job is requeued due to an evant like a node failure, the batch script is initiated from its beginning. Job scripts need to be written to handle automatically restarting from checkpoints etc.
Refunds
Refunds are considered, when appropriate, for jobs that failed due to circumstances beyond user control.
Projects wishing to request a refund should email help@ncsa.illinois.edu. Please include the batch job ids and the standard error and output files produced by the job(s).