Running Jobs
Please be aware that the interactive (login/head) nodes are a shared resource for all users of the system and their use should be limited to editing, compiling and building your programs, and for short non-intensive runs.
Note
User processes running on the interactive (login/head) nodes are killed automatically if they accrue more than 30 minutes of CPU time or if more than 4 identical processes owned by the same user are running concurrently.
The Illinois HTC system uses HTCondor for workload and job management. The basics of submitting and monitoring jobs, and system-specific configurations are outlined below.
Note
If you are new to HTCondor, the HTCondor quick start guide and submitting HTCondor jobs tutorial video are great resources to use to get started.
Queues
HTCondor does not have different queues like Slurm.
Batch Commands
Below are brief descriptions of the primary batch commands. For more detailed information, refer to HTCondor documentation.
Submit Command
To submit an HTCondor job, first write a description file and then pass it to the condor_submit
command.
condor_submit <description file>
Submit Description File
Refer to the HTCondor documentation - condor_submit page for a complete list of submit description file options.
Your submit description file must include requests for CPUs and memory. The table below outlines the default and maximum allowed values for these variables (current maximum values can be seen in the output of condor_status -compact
.
You should not just request the maximum values; you should set these variables based on the needs of your job. Underestimating could result in your job being placed on hold. Overestimating needlessly ties up resources that will not be available to other users.
Variable |
Default Value |
Maximum Value |
---|---|---|
request_cpus |
1 |
28 |
request_memory |
1G |
250G |
Reminder that queue
should be the last line in your submit description file, command lines after the queue
line are ignored.
Useful Batch Job Environment Variables
HTCondor sets some environment variables automatically inside a job:
Description |
Environment Variable |
Detail Description |
---|---|---|
Job Scratch Directory |
$_CONDOR_SCRATCH_DIR |
This directory is unique for every job that is run, and its contents are deleted by HTCondor when the job stops running on a machine. |
Cache Directories |
$APPTAINER_CACHEDIR $SINGULARITY_CACHEDIR |
Directory used by apptainer when building images. Set to the job scratch directory. |
Thread Counts |
$MLK_NUM_THREADS $OMP_NUM_THREADS $OPENBLAS_NUM_THREADS $PYTHON_CPU_COUNT |
Set to the number of cpu cores provisioned to this job. For the full list of thread variables see the docs) |
HTCondor can copy the submission environment to the job using getenv in the job submission file. Setting getenv to True copies all of the environment (PATH). One can also specify a subset to include or exclude.
Interactive Job
An interactive batch job provides a way to get interactive access to a compute node via a batch job. To submit an interactive batch job, use the condor_submit
command with -i
option and include -a
options to specify memory, cpus, and so on. The following command requests 4G of memory and because of the defaults will get 1 CPU.
condor_submit -i -a "request_memory = 4G"
Note
The interactive job will exit after 7200 seconds of inactivity. However, typing exit
when done will free up the resources for another job.
How to Monitor Jobs
To see the status of submitted jobs which have not completed, use the condor_q
command. Without any options it will list the status of all your jobs.
Refer to the HTCondor documentation - condor_q page for a complete list of condor_q
options. Here are some commonly used options/arguments:
Options/Arguments |
Description |
---|---|
|
List the status of all jobs on the system. |
|
List the status of all a specific user’s jobs. |
|
List nodes allocated to a running job in addition to basic information. |
|
List the status for each individual job instead of in batches. |
|
List the status for each held job (status=H) with hold reason. |
|
Show the job details for each job listed (e.g. request_memory). |
If a job has already completed (success or failure), it will not show up in the output of condor_q
. Instead use condor_history
. Here are some commonly used options/arguments:
Options/Arguments |
Description |
---|---|
|
List the status of all a specific user’s jobs. |
|
List nodes allocated to a running job in addition to basic information. |
|
Show the job details for each job listed (e.g. request_memory). |
Held Jobs
Sometimes HTCondor puts a job on hold due to problems running the job. Some example situations where the job will be put on hold (evicted from the compute node and appears in queue with H status):
Job uses more memory than requested.
Executable for job doesn’t exist (typically an incorrect path or typo in the submit file).
Filesystem issues preventing transfer of job stdout/stderr.
Release an “on hold” job back into the idle state (ready to run) with the condor_release
command with specific job id or your username for all your jobs.
To put a job on hold, use the condor_hold
command with the specific job id or your username for all your jobs.
Remove Jobs
To cancel one or more jobs in the queue (idle or running), use the condor_rm
command with a specific job id or your username for all your jobs.
Occasionally, HTCondor has trouble contacting the job on the compute machine (for example, the compute machine is in a bad state) and cannot nicely kill the job. In these cases, HTCondor will put the job in an X state. To hard remove the job from the queue, add --forcex
to the condor_rm
command.
Job Dependencies
HTCondor has a tool called DAGMan that allows users to define job dependencies as a Directed Acyclic Graph.