Running Jobs

Please be aware that the interactive (login/head) nodes are a shared resource for all users of the system and their use should be limited to editing, compiling and building your programs, and for short non-intensive runs.

Note

User processes running on the interactive (login/head) nodes are killed automatically if they accrue more than 30 minutes of CPU time or if more than 4 identical processes owned by the same user are running concurrently.

The Illinois HTC system uses HTCondor for workload and job management. The basics of submitting and monitoring jobs, and system-specific configurations are outlined below.

Note

If you are new to HTCondor, the HTCondor quick start guide and submitting HTCondor jobs tutorial video are great resources to use to get started.

Queues

HTCondor does not have different queues like Slurm.

Batch Commands

Below are brief descriptions of the primary batch commands. For more detailed information, refer to HTCondor documentation.

Submit Command

To submit an HTCondor job, first write a description file and then pass it to the condor_submit command.

condor_submit <description file>

Submit Description File

Refer to the HTCondor documentation - condor_submit page for a complete list of submit description file options.

Your submit description file must include requests for CPUs and memory. The table below outlines the default and maximum allowed values for these variables (current maximum values can be seen in the output of condor_status -compact.

You should not just request the maximum values; you should set these variables based on the needs of your job. Underestimating could result in your job being placed on hold. Overestimating needlessly ties up resources that will not be available to other users.

Submit Description File Resource Variable Defaults and Maximums (Largest on single machine)

Variable

Default Value

Maximum Value

request_cpus

1

28

request_memory

1G

250G

Reminder that queue should be the last line in your submit description file, command lines after the queue line are ignored.

Useful Batch Job Environment Variables

HTCondor sets some environment variables automatically inside a job:

Example environment variables set by HTCondor

Description

Environment Variable

Detail Description

Job Scratch Directory

$_CONDOR_SCRATCH_DIR

This directory is unique for every job that is run, and its contents are deleted by HTCondor when the job stops running on a machine.

Cache Directories

$APPTAINER_CACHEDIR $SINGULARITY_CACHEDIR

Directory used by apptainer when building images. Set to the job scratch directory.

Thread Counts

$MLK_NUM_THREADS $OMP_NUM_THREADS $OPENBLAS_NUM_THREADS $PYTHON_CPU_COUNT

Set to the number of cpu cores provisioned to this job. For the full list of thread variables see the docs)

HTCondor can copy the submission environment to the job using getenv in the job submission file. Setting getenv to True copies all of the environment (PATH). One can also specify a subset to include or exclude.

Interactive Job

An interactive batch job provides a way to get interactive access to a compute node via a batch job. To submit an interactive batch job, use the condor_submit command with -i option and include -a options to specify memory, cpus, and so on. The following command requests 4G of memory and because of the defaults will get 1 CPU.

condor_submit -i -a "request_memory = 4G"

Note

The interactive job will exit after 7200 seconds of inactivity. However, typing exit when done will free up the resources for another job.

How to Monitor Jobs

To see the status of submitted jobs which have not completed, use the condor_q command. Without any options it will list the status of all your jobs.

Refer to the HTCondor documentation - condor_q page for a complete list of condor_q options. Here are some commonly used options/arguments:

Commonly used options/Arguments for condor_q.

Options/Arguments

Description

-all

List the status of all jobs on the system.

<Username>

List the status of all a specific user’s jobs.

<JobID>

List nodes allocated to a running job in addition to basic information.

-nobatch

List the status for each individual job instead of in batches.

-hold

List the status for each held job (status=H) with hold reason.

-l

Show the job details for each job listed (e.g. request_memory).

If a job has already completed (success or failure), it will not show up in the output of condor_q. Instead use condor_history. Here are some commonly used options/arguments:

Commonly-used options/arguments for condor_history.

Options/Arguments

Description

<Username>

List the status of all a specific user’s jobs.

<JobID>

List nodes allocated to a running job in addition to basic information.

-l

Show the job details for each job listed (e.g. request_memory).

Held Jobs

Sometimes HTCondor puts a job on hold due to problems running the job. Some example situations where the job will be put on hold (evicted from the compute node and appears in queue with H status):

  • Job uses more memory than requested.

  • Executable for job doesn’t exist (typically an incorrect path or typo in the submit file).

  • Filesystem issues preventing transfer of job stdout/stderr.

Release an “on hold” job back into the idle state (ready to run) with the condor_release command with specific job id or your username for all your jobs.

To put a job on hold, use the condor_hold command with the specific job id or your username for all your jobs.

Remove Jobs

To cancel one or more jobs in the queue (idle or running), use the condor_rm command with a specific job id or your username for all your jobs.

Occasionally, HTCondor has trouble contacting the job on the compute machine (for example, the compute machine is in a bad state) and cannot nicely kill the job. In these cases, HTCondor will put the job in an X state. To hard remove the job from the queue, add --forcex to the condor_rm command.

Job Dependencies

HTCondor has a tool called DAGMan that allows users to define job dependencies as a Directed Acyclic Graph.