Running Jobs

HAL uses Slurm for job management. For a complete guide to Slurm, go to the Slurm documentation. The following is simple examples with system-specific instructions.

Original Slurm

Available Queues

Original Slurm job queues

Name

Priority

Max Walltime

Max Nodes

Min/Max CPUs

Min/Max RAM

Min/Max GPUs

Description

cpu

normal

24 hrs

16

1-96

1.2GB per CPU

0

CPU-only jobs

gpu

normal

24 hrs

16

1-160

1.2GB per CPU

0-64

Jobs utilizing GPUs

debug

high

4 hrs

1

1-160

1.2GB per CPU

0-4

For single-node short jobs.

Commands

  • srun: submit an interactive job.

    srun --partition=debug --pty --nodes=1 \
         --ntasks-per-node=16 --cores-per-socket=4 \
         --threads-per-core=4 --sockets-per-node=1 \
         --mem-per-cpu=1200 --gres=gpu:v100:1 \
         --time 01:30:00 --wait=0 \
         --export=ALL /bin/bash
    
  • sbatch: submit a batch job.

    sbatch [job_script]
    
  • squeue: check job status.

    squeue                  # check all jobs from all users
    squeue -u [username]   # check all jobs belong to username
    
  • scancel: cancel a running job.

    scancel [job_id]   # cancel job with [job_id]
    

PBS Commands

Some PBS commands are supported by Slurm.

  • pbsnodes: check node status.

    pbsnodes
    
  • qstat: check job or queue status.

    qstat -f [job_number]  # check job status
    
    qstat # check queue status
    
  • qdel - delete a job.

    qdel [job_number]
    
  • Submit a batch job:

    $ cat test.pbs
    #!/usr/bin/sh #PBS -N test
    #PBS -l nodes=1
    #PBS -l walltime=10:00
    
    hostname
    $ qsub test.pbs
    107
    $ cat test.pbs.o107
    hal01.hal.ncsa.illinois.edu
    

Reasons a Pending Job Isn’t Running

Use the following command to get a list of your jobs (replace <username> with your username):

squeue -u <username>

The right column will contain a reason for each of the pending jobs.

Priority

There is at least one pending job with a higher priority than this job. The priority for a job depends on a couple of factors, the biggest of which is recent usage. You are most likely seeing this reason after running some combination of a large number of jobs, jobs using a lot of resources, or jobs that run for a long time. The recent usage factor slowly decays in a two-week period, which means any usage prior to two weeks before the job was submitted will not impact the priority of the job. You can check your recent HAL usage.

Jobs that are pending for this reason may remain pending for a long time if the recent usage factor has reduced your priority below most of the active users. If there is a sufficient difference between someone’s recent usage and yours, and the difference in the recent usage factor is large enough to exceed the waiting time factor, their job may receive a higher priority and run before your job, even if it is submitted after your job.

ReqNodeNotAvail

Some of the nodes specifically requested by the job aren’t available. The nodes could be unavailable for one of the following reasons:

  • Running jobs with a higher priority.

  • Reserved in a reservation.

  • Manually drained by an administrator for maintenance.

  • Unavailable due to some issues.

This job will run when all the requested nodes become available.

Resources

This job is at the front of the queue, but there are not enough resources for it to start running. This job will start running as soon as enough resources become available. The priority calculation favors large jobs, so when resources gradually become available, smaller jobs with a similar recent usage factor won’t run before this job and take away the available resource. Note that if someone has much lower recent usage than you do, their jobs can still run before your job, because the bonus from their recent usage factor can exceed the bonus from your job size factor.

AssocGrpGRES

This means you have reached the limit of resources that can be allocated to one user at any given time. There are three limits in place:

  • Maximum of 5 running jobs.

  • Maximum of 5 nodes running jobs.

  • Maximum of 16 GPUs running jobs.

This job will run as soon as some of your running jobs finish and free up the resources.

Reservation

This job is submitted to an inactive reservation. If the reservation is in the future, it will run when the reservation starts. If the reservation has ended, it will be stuck in the queue forever until it’s deleted.

Error: “sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified”.

Your account has not been properly initialized. Try logging in to and out of hal-login2.ncsa.illinois.edu via SSH a few times.

ssh <username>@hal-login2.ncsa.illinois.edu

If it still isn’t working, contact an admin on Slack or submit a support request.