Running Jobs

Accessing the Compute Nodes

Note

User processes running on the interactive (login/head) nodes are killed automatically if they accrue more than 30 minutes of CPU time or if more than 4 identical processes owned by the same user are running concurrently.

Access to the compute nodes for running jobs is available via a batch job. The Campus Cluster uses the Slurm Workload Manager for running batch jobs. See the Batch Commands section for details on batch job submission.

Please be aware that the interactive (login/head) nodes are a shared resource for all users of the system and their use should be limited to editing, compiling and building your programs, and for short non-intensive runs.

An interactive batch job provides a way to get interactive access to a compute node via a batch job. See the srun section for information on how to run an interactive job on the compute nodes.

To ensure the health of the batch system and scheduler, you should refrain from having more than 1,000 batch jobs in the queues at any one time.

Running Programs

On successful building (compilation and linking) of your program, an executable is created that is used to run the program. The table below describes how to run different types of programs.

How to run different types of programs

Program Type

How to Run the Program/Executable

Example Command

Serial

To run serial code, specify the name of the executable.

./a.out

MPI

MPI programs are run with the srun command followed by the name

of the executable.

The total number of MPI processes is the {number of nodes} x

{cores/nodes} set in the batch job resource specification.

srun ./a.out

OpenMP

The OMP_NUM_THREADS environment variable can be set to specify

the number of threads used by OpenMP programs. If this variable is not

set, the number of threads used defaults to one under the intel

compiler. Under GCC, the default behavior is to use one thread for

each core available on the node.

To run OpenMP programs, specify the name of the executable.

  • In bash: export OMP_NUM_THREADS=16

  • In tcsh: setenv OMP_NUM_THREADS 16

./a.out

MPI/OpenMP

As with OpenMP programs, the OMP_NUM_THREADS environment

variable can be set to specify the number of threads used by the OpenMP

portion of the mixed MPI/OpenMP program. The same default behavior

applies with respect to the number of threads used.

Use the srun command followed by the name of the executable to run

mixed MPI/OpenMP programs.

The number of MPI processes per node is set in the batch job resource

specification for number of cores/node.

  • In bash: export OMP_NUM_THREADS=4

  • In tcsh: setenv OMP_NUM_THREADS 4

srun ./a.out

Queues

Primary Queues

Each investor group has unrestricted access to a dedicated primary queue with concurrent access to the number and type of nodes in which they invested.

You can view the partitions(queues) that you can submit batch jobs to, with the following command:

sinfo -s -o "%.25R %.12l %.12L %.5D"

You can also view specific configuration information about the compute nodes associated with your primary partition(s), with the following command (replace <partition_name> with the name of the partition):

sinfo -p <partition_name> -N -o "%.8N %.4c %.25P %.9m %.12l %.12L %G"

Secondary Queues

One of the advantages of the Campus Cluster Program is the ability to share resources. A shared secondary queue allows you to access idle nodes in the cluster.

Investors have full access to the number and type of nodes they have invested in. When resources are not being fully utilized by an investor, these resources are eligible to run secondary queue jobs. Secondary queue usage follows these guidelines:

  • You must have access to a primary queue to be eligible to use the secondary queue.

  • The secondary queue is the default queue for the Campus Cluster; if a queue is not specified, jobs are routed to the secondary queue.

  • The secondary queue uses fairshare scheduling.

  • If there are resources eligible to run secondary queue jobs but there are no pending jobs in the secondary queue, pending jobs in primary queues that fit within the constraints of the secondary queue may be run on any otherwise appropriate idle nodes.

Limits of the secondary queue

Queue

Max Walltime

Max # Nodes

secondary

4 hours

TBD

The secondary queue has a maximum wall time of 4 hours. Specify your primary queue using the --partition option to sbatch for access to longer batch job wall times. You can view the maximum wall time for all queues on the cluster with the following command.

sinfo -a --format="%.16R %.4D %.14l"

Move Queued Batch Job to Another Queue

To move a queued batch job from one queue to another, use scontrol update with the following syntax. Replace queue_name with the name of the queue that you want to move the job to.

scontrol update jobid=[JobID] partition=[queue_name]

Note, the operation will not be permitted if the resources requested do not fit the queue limits.

Batch Commands

Below are brief descriptions of the primary batch commands. For more detailed information, refer to the individual man pages.

sbatch

Batch jobs are submitted through a job script using the sbatch command. Job scripts generally start with a series of SLURM directives that describe requirements of the job such as number of nodes and wall time required, to the batch system/scheduler. SLURM directives can also be specified as options on the sbatch command line; command line options take precedence over those in the script. The rest of the batch script consists of user commands.

Sample batch scripts are available in the directory /sw/cc.users/slurm.

The syntax for sbatch is:

sbatch [list of sbatch options] script_name

The main sbatch options are listed below. See the sbatch man page for more options.

  • ‑‑account=account_name

    account_name is the name of an account available to you. If you don’t know the account(s) available to you, run the following command (shell script) to view a list of your batch account names:

    /sw/cc.users/tools/my.accounts
    

    You should match the appropriate batch account name with an appropriate partition (queue) to successfully submit a batch job. See Queues for information about partitions.

  • --partition=partition_name

    partition_name is the name of a partition (queue) available to you. You should match the appropriate batch account name with an appropriate partition (queue) to successfully submit a batch job. See Queues for information about partitions.

  • ‑‑time=time

    time is the maximum wall clock time (d-hh:mm:ss) [default: maximum limit of the queue(partition) submitted to]

  • ‑‑nodes=n

    n is the number of 16/20/24/28/40/128-core nodes [default: 1 node]

  • ‑‑ntasks=p

    Total number of cores for the batch job. p is how many cores (ntasks) per job or per node (ntasks-per-node) to use (1 through 40) [default: 1 core]

  • ‑‑ntasks-per-node=p

    Number of cores per node (same as ppn under PBS). p is how many cores (ntasks) per job or per node (ntasks-per-node) to use (1 through 40) [default: 1 core]

Example:

--account=account_name      # <- replace "account_name" with an account available to you
--partition=partition_name  # <- replace "parition_name" with a partition available to you
--time=00:30:00
--nodes=2
--ntasks=32

or

--account=account_name      # <- replace "account_name" with an account available to you
--partition=partition_name  # <- replace "parition_name" with a partition available to you
--time=00:30:00
--nodes=2
--ntasks-per-node=16

Memory needs

For investors that have nodes with varying amounts of memory or to run in the secondary queue, nodes with a specific amount of memory can be targeted. The compute nodes have memory configurations of 64GB, 128GB, 192GB, 256GB or 384GB. Not all memory configurations are available in all investor queues.

For a list of all the nodes you have access to, with information about CPUs and memory, execute:

sinfo -N -l

You can also check with the technical representative of your investor group to determine what memory configurations are available for the nodes in your primary queue.

Warning

Do not use the memory specification unless absolutely required since it could delay scheduling of the job; also, if nodes with the specified memory are unavailable for the specified queue the job will never run.

Example:

‑‑account=account_name    # <- replace "account_name" with an account available to you
‑‑time=00:30:00
‑‑nodes=2
‑‑ntask=32
‑‑mem=118000

or

‑‑account=account_name    # <- replace "account_name" with an account available to you
‑‑time=00:30:00
‑‑nodes=2
‑‑ntasks-per-node=16
‑‑mem-per-cpu=7375

Specifying nodes with GPUs

To run jobs on nodes with GPUs, add the resource specification TeslaM2090 (for Tesla M2090), TeslaK40M (for Tesla K40M), K80 (for Tesla K80), P100 (for Tesla P100), V100 (for Tesla V100), TeslaT4 (for Tesla T4) or A40 (for Tesla A40) if your primary queue has nodes with multiple types of GPUs, nodes with and without GPUs or if you are submitting jobs to the secondary queue. Through the secondary queue any user can access the nodes that are configured with any of the specific GPUs.

Example:

‑‑gres=gpu:V100

or

‑‑gres=gpu:V100:2

to specify two V100 GPUs (default is 1 if no number is specified after the gpu type).

Note

Requesting more GPUs than what is available on a single compute node will result in a failed batch job submission.

To determine if GPUs are available on any of the compute nodes in your group’s partition(queue), run the below command (replace <partition_name> with the name of the partition) or check with the technical representative of your investor group.

sinfo -p <partition_name> -N -o "%.8N %.4c %.16G %.25P %50f"

Useful Batch Job Environment Variables

Useful batch job environment variables

Description

SLURM Environment

Variable

Detail Description

PBS Environment Variable

(no longer valid)

JobID

$SLURM_JOB_ID

Job identifier assigned to the job.

$PBS_JOBID

Job Submission

Directory

$SLURM_SUBMIT_DIR

By default, jobs start in the directory the job was

submitted from. So, the cd $SLURM_SUBMIT_DIR

command is not needed.

$PBS_O_WORKDIR

Machine(node) list

$SLURM_NODELIST

Variable name that contains the list of nodes

assigned to the batch job.

$PBS_NODEFILE

Array JobID

$SLURM_ARRAY_JOB_ID

$SLURM_ARRAY_TASK_ID

Each member of a job array is assigned a unique

identifier (see the Job Arrays section).

$PBS_ARRAYID

See the sbatch man page for additional environment variables available.

srun

The srun command initiates an interactive job on the compute nodes.

For example, the following command will run an interactive job in the “ncsa” queue with a wall clock limit of 30 minutes, using one node and 16 cores per node. The compute time will be charged to the “account_name” account.

[cc-login1 ~]$ srun -A account_name --partition=ncsa --time=00:30:00 --nodes=1 --ntasks-per-node=16 --pty /bin/bash

You can also use other sbatch options such as those documented above.

After you enter the command, you will have to wait for SLURM to start the job. As with any job, your interactive job will wait in the queue until the specified number of nodes is available. If you specify a small number of nodes for smaller amounts of time, the wait should be shorter because your job will backfill among larger jobs. You will see something like this:

srun: job 123456 queued and waiting for resources

Once the job starts, you will see the below and will be presented with an interactive shell prompt on the launch node:

srun: job 123456 has been allocated resources

At this point, you can use the appropriate command to start your program.

When you are done with your runs, you can use the exit command to end the job.

squeue

Commands that display the status of batch jobs

SLURM Example Command

Command Description

squeue -a

List the status of all jobs on the system.

squeue -u $USER

List the status of all your jobs in the batch system.

squeue -j JobID

List nodes allocated to a running job in addition to basic information.

scontrol show job JobID

List detailed information on a particular job.

sinfo -a

List summary information on all the queues.

See the man page for other options available.

scancel

The scancel command deletes a queued job or kills a running job.

scancel JobID deletes/kills a job.

Job Dependencies

SLURM job dependencies allow you to set execution order in which your queued jobs run. Job dependencies are set by using the ‑‑dependency option with the syntax being ‑‑dependency=<dependency type>:<JobID>. SLURM places the jobs in Hold state until they are eligible to run.

The following are examples on how to specify job dependencies using the afterany dependency type, which indicates to SLURM that the dependent job should become eligible to start only after the specified job has completed.

On the command line:

[cc-login1 ~]$ sbatch --dependency=afterany:<JobID> jobscript.sbatch

In a job script:

#!/bin/bash
#SBATCH --time=00:30:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=16
#SBATCH --account=account_name    # <- replace "account_name" with an account available to you
#SBATCH --job-name="myjob"
#SBATCH --partition=secondary
#SBATCH --output=myjob.o%j
#SBATCH --dependency=afterany:<JobID>

In a shell script that submits batch jobs:

#!/bin/bash
JOB_01=`sbatch jobscript1.sbatch |cut -f 4 -d " "`
JOB_02=`sbatch --dependency=afterany:$JOB_01 jobscript2.sbatch |cut -f 4 -d " "`
JOB_03=`sbatch --dependency=afterany:$JOB_02 jobscript3.sbatch |cut -f 4 -d " "`
...

Generally, the recommended dependency types to use are after, afterany, afternotok, and afterok. While there are additional dependency types, those types that work based on batch job error codes may not behave as expected because of the difference between a batch job error and application errors. See the dependency section of the sbatch manual page for additional information (man sbatch).

Job Constraints

Use the --constraint option to specify required features for a job. Refer to the Slurm srun --constraint documentation for more details. (You can also find the same information in the Slurm sbatch documentation and Slurm salloc documentation.)

Features available on Campus Cluster include:

  • CPU type (AE7713, E2680V4, G2348, …)

  • GPU type (NoGPU, P100, K80, …)

  • Memory (64G, 128G, 256G, 512G, …)

  • Interconnect (E1G, E10G, FDR, HDR, …)

Run the sinfo command below to see a full list of features for nodes that are in queues that you can submit to:

sinfo -N --format="%R (%N): %f" -S %R | more

If a constraint(s) cannot be satisfied, your job will not run and squeue will return BadConstraints; refer to the Slurm squeue documentation.

Job Arrays

If a need arises to submit the same job to the batch system multiple times, instead of issuing one sbatch command for each individual job, you can submit a job array. Job arrays allow you to submit multiple jobs with a single job script using the ‑‑array option to sbatch. An optional slot limit can be specified to limit the number of jobs that can run concurrently in the job array. See the sbatch manual page for details (man sbatch). The file names for the input, output, and so on, can be varied for each job using the job array index value defined by the SLURM environment variable SLURM_ARRAY_TASK_ID.

A sample batch script that makes use of job arrays is available in /sw/cc.users/slurm/jobarray.sbatch.

A few things to keep in mind:

  • Valid specifications for job arrays are:

    ‑‑array 1-10

    ‑‑array 1,2,6-10

    ‑‑array 8

    ‑‑array 1-100%5 (a limit of 5 jobs can run concurrently)

  • You should limit the number of batch jobs in the queues at any one time to 1,000 or less (each job within a job array is counted as one batch job.)

  • Interactive batch jobs are not supported with job array submissions.

  • To delete job arrays, see the scancel command section.

Running Serial Jobs

Users often have several single-core (serial) jobs that need to be run. Since the Campus Cluster nodes have multiple cores (16/20/24/28/40/56 cores on a node), using these resources efficiently means running multiple of these batch jobs on a single node. This can be done with Multiple Batch Jobs (one per serial process) or combined within a Single Batch Job.

Keep in mind, memory needs to decide how many serial processes to run concurrently on a node. If you are running your jobs in the secondary queue, also be aware that the compute nodes on the Campus Cluster have different amounts of memory. To avoid overloading the node, make sure that the memory required for multiple jobs or processes can be accommodated on a node. Assume that approximately 90% of the memory on a node is available for your job (the remaining is needed for system processes).

Multiple Batch Jobs

The queue --mem-per-cpu option in Slurm can be utilized to submit serial jobs to run concurrently on a node. Starting with a memory amount of 3375 megabytes and a node specification of 1 (--nodes), along with a specification of 1 for the ntasks per node (--ntasks-per-node) will allow jobs owned by the same user to share a node. Multiple serial jobs submitted will then be scheduled to run concurrently on a node.

The following Slurm specification will schedule multiple serial jobs on one node:

#!/bin/bash
#SBATCH --time=00:05:00                  # Job run time (hh:mm:ss)
#SBATCH --account=account_name           # Replace "account_name" with an account available to you
#SBATCH --nodes=1                        # Number of nodes
#SBATCH --ntasks-per-node=1              # Number of task (cores/ppn) per node
#SBATCH --mem-per-cpu=3375               # Memory per core (value in MBs)
<other sbatch options>
#
cd ${SLURM_SUBMIT_DIR}

# Run the serial executable
./a.out < input > output

Increase the --mem-per-cpu value for each job will cause fewer jobs get scheduled concurrently on a single compute node resulting in more memory available to each job.

The above sbatch specifications are based on the smallest compute node configuration. A compute node configured with 16 cores and 64GB memory (54000 MB usable).

Single Batch Job

Specify the maximum value for --ntasks-per-node as a sbatch option/specification for a batch job and execute multiple serial processes within a one-node batch job. This works best if all the processes are estimated to take approximately the same amount of time to complete because the batch job waits to exit until all the processes finish. The basic template for the job script would be:

#!/bin/bash
#SBATCH --time=00:05:00                  # Job run time (hh:mm:ss)
#SBATCH --account=account_name           # Replace "account_name" with an account available to you
#SBATCH --nodes=1                        # Number of nodes
#SBATCH --ntasks-per-node=16             # Number of task (cores/ppn) per node
#SBATCH --job-name=multi-serial_job      # Name of batch job
#SBATCH --partition=secondary            # Partition (queue)
#SBATCH --output=multi-serial.o%j        # Name of batch job output file

executable1 &
executable2 &
.
.
.
executable16 &
wait

The ampersand (&) at the end of each command indicates the process will be backgrounded and allows 16 processes to start concurrently. The wait command at the end is important so the shell waits until the background processes are complete (otherwise the batch job will exit right away). The commands can also be handled in a do loop depending on the specific syntax of the processes.

When running multiple processes in a job, the total number of processes should generally not exceed the number of cores. Also be aware of memory needs so as not to run a node out of memory.

The following example batch script runs 16 instances of Matlab concurrently:

#!/bin/bash
#SBATCH --time=00:30:00                  # Job run time (hh:mm:ss)
#SBATCH --account=account_name           # Replace "account_name" with an account available to you
#SBATCH --nodes=1                        # Number of nodes
#SBATCH --ntasks-per-node=16             # Number of task (cores/ppn) per node
#SBATCH --job-name=matlab_job            # Name of batch job
#SBATCH --partition=secondary            # Partition (queue)
#SBATCH --output=multi-serial.o%j        # Name of batch job output file


cd ${SLURM_SUBMIT_DIR}

module load matlab
for (( i=1; i<=16; i++))
do
         matlab -nodisplay -r num.9x2.$i > output.9x2.$i &
done
wait

Job Wait Times

There can be various reasons that contribute to the amount of time a job waits in a queue before it runs.

  • Your job is in your primary queue and all nodes in your investor group are in use by other primary queue jobs. In addition, because the Campus Cluster allows users access to any idle nodes via the secondary queue, jobs submitted to a primary queue could have a wait time of up to the secondary queue maximum wall time of 4 hours.

  • Your job is in the secondary queue. This queue is almost entirely opportunity scheduled because it makes use of idle nodes from investor primary queues. This means that secondary jobs will only run if there is a big enough scheduling hole on the number and type of nodes requested.

  • Preventative Maintenance (PM) on the Campus Cluster is generally scheduled quarterly on the third Wednesday of the month. If the wall time requested by a job will not allow it to complete before an upcoming PM, the job will not start until after the PM.

  • Your job has requested a specific type of resource. For example, nodes with 96GB memory.

  • Your job has requested a combination of resources that are incompatible. In this case, the job will never run. For example, 96GB memory and the cse queue.

Common Batch Job Errors

“Memory Oversubscription” or “Exceeded a Memory Resource”

Errors:

  • “=>> PBS: job killed: swap rate due to memory oversubscription is too high Ctrl-C caught… cleaning up processes”

  • “Job exceeded a memory resource limit (vmem, pvmem, etc.). Job was aborted”

Possible causes: Errors like these indicate that your job used more memory than available on the node(s) allocated to the batch job. If possible, you can submit jobs to nodes with larger amount of memory. For MPI jobs, you can also resolve the issue by running fewer processes on each node (so each MPI process will have more memory) or using more MPI processes in total (so each MPI process will need less memory).

Multi-Node Batch Jobs - “Permission Denied”

Error: “Permission denied (publickey,gssapi-keyex,gssapi-with-mic).”

Possible causes:

  • When the file $HOME/.ssh/authorized_keys has been removed, incorrectly modified, or zeroed out.

    To resolve, remove or rename the .ssh directory, and then log off and log back on. This will regenerate a default .ssh directory along with its contents. If you need to add an entry to $HOME/.ssh/authorized_keys, make sure to leave the original entry in place.

  • When group writable permissions are set for your home directory.

    [golubh1 ~]$ ls -ld ~jdoe drwxrwx— 15 jdoe SciGrp 32768 Jun 16 14:20 /home/jdoe
    

    To resolve, remove the group writable permissions:

    [golubh1 ~]$ chmod g-w ~jdoe
    [golubh1 ~]$ ls -ld ~jdoe drwxr-x— 15 jdoe SciGrp 32768 Jun 16 14:20 /home/jdoe
    

Using MATLAB

See Software - MATLAB for information on using MATLAB on the Campus Cluster.

Running Mathematica Batch Jobs

See Software - Mathematica for information on using Mathematica on the Campus Cluster.