Running Jobs

Accessing the Compute Nodes

User access to the compute nodes for running jobs is available via a batch job. The Campus Cluster uses the Slurm Workload Manager for running batch jobs. See the Batch Commands section for details on batch job submission.

Please be aware that the interactive (login/head) nodes are a shared resource for all users of the system and their use should be limited to editing, compiling and building your programs, and for short non-intensive runs.

Note

User processes running on the interactive (login/head) nodes are killed automatically if they accrue more than 30 minutes of CPU time or if more than 4 identical processes owned by the same user are running concurrently.

An interactive batch job provides a way to get interactive access to a compute node via a batch job. See the srun section for information on how to run an interactive job on the compute nodes. Also, a very short time test queue provides quick turnaround time for debugging.

To ensure the health of the batch system and scheduler, users should refrain from having more than 1,000 batch jobs in the queues at any one time.

See the Running Serial Jobs section for information on expediting job turnaround time for serial jobs.

See the Using MATLAB / Running Mathematica Batch Jobs sections for information on running MATLAB and Mathematica on the Campus Cluster.

Running Programs

On successful building (compilation and linking) of your program, an executable is created that is used to run the program. The table below describes how to run different types of programs.

How to run different types of programs

Program Type

How to Run the Program/Executable

Example Command

Serial

To run serial code, specify the name of the executable.

./a.out

MPI

MPI programs are run with the srun command followed by the name

of the executable.

The total number of MPI processes is the {number of nodes} x

{cores/nodes} set in the batch job resource specification.

srun ./a.out

OpenMP

The OMP_NUM_THREADS environment variable can be set to specify

the number of threads used by OpenMP programs. If this variable is not

set, the number of threads used defaults to one under the intel

compiler. Under GCC, the default behavior is to use one thread for

each core available on the node.

To run OpenMP programs, specify the name of the executable.

  • In bash: export OMP_NUM_THREADS=16

  • In tcsh: setenv OMP_NUM_THREADS 16

./a.out

MPI/OpenMP

As with OpenMP programs, the OMP_NUM_THREADS environment

variable can be set to specify the number of threads used by the OpenMP

portion of the mixed MPI/OpenMP program. The same default behavior

applies with respect to the number of threads used.

Use the srun command followed by the name of the executable to run

mixed MPI/OpenMP programs.

The number of MPI processes per node is set in the batch job resource

specification for number of cores/node.

  • In bash: export OMP_NUM_THREADS=4

  • In tcsh: setenv OMP_NUM_THREADS 4

srun ./a.out

Queues

Primary Queues

Each investor group has unrestricted access to a dedicated primary queue with concurrent access to the number and type of nodes in which they invested.

Users can view the partitions(queues) that they have the ability to submit batch jobs to, with the following command:

[cc-login1 ~]$ sinfo -s -o "%.25R %.12l %.12L %.5D"

Users can also view specific configuration information about the compute nodes associated with their primary partition(s), with the following command (replace <partition_name> with the name of the partition):

[cc-login1 ~]$ sinfo -p <partition_name> -N -o "%.8N %.4c %.25P %.9m %.12l %.12L %G"

Secondary Queues

One of the advantages of the Campus Cluster Program is the ability to share resources. A shared secondary queue will allow users access to any idle nodes in the cluster. Users must have access to a primary queue to be eligible to use the secondary queue.

While each investor has full access to the number and type of nodes in which they invested, those resources not fully utilized by each investor will become eligible to run secondary queue jobs. If there are resources eligible to run secondary queue jobs but there are no jobs to be run from the secondary queue, jobs in the primary queues that fit within the constraints of the secondary queue may be run on any otherwise appropriate idle nodes. The secondary queue uses fairshare scheduling.

Limits of the secondary queues

Queue

Max Walltime

Max # Nodes

secondary

4 hours

305

secondary-Eth

4 hours

21

  • Jobs are routed to the secondary queue when a queue is not specified. i.e., the secondary queue is the default queue on the Campus Cluster.

  • The difference between secondary and “secondary-Eth” queues is the compute nodes associated with the secondary queue are interconnected via InfiniBand (IB) and the compute nodes that are associated with the “secondary-Eth” queue are interconnected via Ethernet. Currently Ethernet is slower than InfiniBand, but this only matters in terms of performance if users have batch jobs that use multiple nodes and need to communicate between nodes (like with MPI codes) or for jobs with heavy file system I/O requirements.

Test Queue

A test queue is available for providing very short jobs with quick turnaround time.

Limits of the test queue

Queue

Max Walltime

Max # Nodes

test

4 hours

2

Batch Commands

Below are brief descriptions of the primary batch commands. For more detailed information, refer to the individual man pages.

sbatch

Batch jobs are submitted through a job script using the sbatch command. Job scripts generally start with a series of SLURM directives that describe requirements of the job such as number of nodes and wall time required, to the batch system/scheduler. SLURM directives can also be specified as options on the sbatch command line; command line options take precedence over those in the script. The rest of the batch script consists of user commands.

Sample batch scripts are available in the directory /projects/consult/slurm.

The syntax for sbatch is:

sbatch [list of sbatch options] script_name

The main sbatch options are listed below. See the sbatch man page for more options.

  • ‑‑account=account_name

    account_name is the name of an account available to you. If you don’t know the account(s) available to you, ask your technical representative or submit a support request.

  • ‑‑time=time

    time is the maximum wall clock time (d-hh:mm:ss) [default: maximum limit of the queue(partition) submitted to]

  • ‑‑nodes=n

    n is the number of 16/20/24/28/40/128-core nodes [default: 1 node]

  • ‑‑ntasks=p

    Total number of cores for the batch job. p is how many cores (ntasks) per job or per node (ntasks-per-node) to use (1 through 40) [default: 1 core]

  • ‑‑ntasks-per-node=p

    Number of cores per node (same as ppn under PBS). p is how many cores (ntasks) per job or per node (ntasks-per-node) to use (1 through 40) [default: 1 core]

Example:

--account=account_name    # <- replace "account_name" with an account available to you
--time=00:30:00
--nodes=2
--ntasks=32

or

--account=account_name    # <- replace "account_name" with an account available to you
--time=00:30:00
--nodes=2
--ntasks-per-node=16

Memory needs

For investors that have nodes with varying amounts of memory or to run in the secondary queue, nodes with a specific amount of memory can be targeted. The compute nodes have memory configurations of 64GB, 128GB, 192GB, 256GB or 384GB. Not all memory configurations are available in all investor queues.

For a list of all the nodes you have access to, with information about CPUs and memeory, execute:

sinfo -N -l

You can also check with the technical representative of your investor group to determine what memory configurations are available for the nodes in your primary queue.

Warning

Do not use the memory specification unless absolutely required since it could delay scheduling of the job; also, if nodes with the specified memory are unavailable for the specified queue the job will never run.

Example:

‑‑account=account_name    # <- replace "account_name" with an account available to you
‑‑time=00:30:00
‑‑nodes=2
‑‑ntask=32
‑‑mem=118000

or

‑‑account=account_name    # <- replace "account_name" with an account available to you
‑‑time=00:30:00
‑‑nodes=2
‑‑ntasks-per-node=16
‑‑mem-per-cpu=7375

Specifying nodes with GPUs

To run jobs on nodes with GPUs, add the resource specification TeslaM2090 (for Tesla M2090), TeslaK40M (for Tesla K40M), K80 (for Tesla K80), P100 (for Tesla P100), V100 (for Tesla V100), TeslaT4 (for Tesla T4) or A40 (for Tesla A40) if your primary queue has nodes with multiple types of GPUs, nodes with and without GPUs or if you are submitting jobs to the secondary queue. Through the secondary queue any user can access the nodes that are configured with any of the specific GPUs.

Example:

‑‑gres=gpu:V100

or

‑‑gres=gpu:V100:2

to specify two V100 GPUs (default is 1 if no number is specified after the gpu type).

Note

Requesting more GPUs than what is available on a single compute node will result in a failed batch job submission.

To determine if GPUs are available on any of the compute nodes in your group’s partition(queue), run the below command (replace <partition_name> with the name of the partition) or check with the technical representative of your investor group.

sinfo -p <partition_name> -N -o "%.8N %.4c %.16G %.25P %50f"

Useful Batch Job Environment Variables

Useful batch job environment variables

Description

SLURM Environment

Variable

Detail Description

PBS Environment Variable

(no longer valid)

JobID

$SLURM_JOB_ID

Job identifier assigned to the job.

$PBS_JOBID

Job Submission

Directory

$SLURM_SUBMIT_DIR

By default, jobs start in the directory the job was

submitted from. So the cd $SLURM_SUBMIT_DIR

command is not needed.

$PBS_O_WORKDIR

Machine(node) list

$SLURM_NODELIST

Variable name that contains the list of nodes

assigned to the batch job.

$PBS_NODEFILE

Array JobID

$SLURM_ARRAY_JOB_ID

$SLURM_ARRAY_TASK_ID

Each member of a job array is assigned a unique

identifier (see the Job Arrays section).

$PBS_ARRAYID

See the sbatch man page for additional environment variables available.

srun

The srun command initiates an interactive job on the compute nodes.

For example, the following command will run an interactive job in the “ncsa” queue with a wall clock limit of 30 minutes, using one node and 16 cores per node. The compute time will be charged to the “account_name” account.

[golubh1 ~]$ srun -A account_name --partition=ncsa --time=00:30:00 --nodes=1 --ntasks-per-node=16 --pty /bin/bash

You can also use other sbatch options such as those documented above.

After you enter the command, you will have to wait for SLURM to start the job. As with any job, your interactive job will wait in the queue until the specified number of nodes is available. If you specify a small number of nodes for smaller amounts of time, the wait should be shorter because your job will backfill among larger jobs. You will see something like this:

srun: job 123456 queued and waiting for resources

Once the job starts, you will see the below and will be presented with an interactive shell prompt on the launch node:

srun: job 123456 has been allocated resources

At this point, you can use the appropriate command to start your program.

When you are done with your runs, you can use the exit command to end the job.

squeue

Commands that display the status of batch jobs

SLURM Example Command

Command Description

squeue -a

List the status of all jobs on the system.

squeue -u $USER

List the status of all your jobs in the batch system.

squeue -j JobID

List nodes allocated to a running job in addition to basic information.

scontrol show job JobID

List detailed information on a particular job.

sinfo -a

List summary information on all the queues.

See the man page for other options available.

scancel

The scancel command deletes a queued job or kills a running job.

scancel JobID deletes/kills a job.

Job Dependencies

SLURM job dependencies allow users to set execution order in which their queued jobs run. Job dependencies are set by using the ‑‑dependency option with the syntax being ‑‑dependency=<dependency type>:<JobID>. SLURM places the jobs in Hold state until they are eligible to run.

The following are examples on how to specify job dependencies using the afterany dependency type, which indicates to SLURM that the dependent job should become eligible to start only after the specified job has completed.

On the command line:

[golubh1 ~]$ sbatch --dependency=afterany:<JobID> jobscript.sbatch

In a job script:

#!/bin/bash
#SBATCH --time=00:30:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=16
#SBATCH --account=account_name    # <- replace "account_name" with an account available to you
#SBATCH --job-name="myjob"
#SBATCH --partition=secondary
#SBATCH --output=myjob.o%j
#SBATCH --dependency=afterany:<JobID>

In a shell script that submits batch jobs:

#!/bin/bash
JOB_01=`sbatch jobscript1.sbatch |cut -f 4 -d " "`
JOB_02=`sbatch --dependency=afterany:$JOB_01 jobscript2.sbatch |cut -f 4 -d " "`
JOB_03=`sbatch --dependency=afterany:$JOB_02 jobscript3.sbatch |cut -f 4 -d " "`
...

Generally, the recommended dependency types to use are after, afterany, afternotok, and afterok. While there are additional dependency types, those types that work based on batch job error codes may not behave as expected because of the difference between a batch job error and application errors. See the dependency section of the sbatch manual page for additional information (man sbatch).

Job Constraints

Use the --constraint option to specify required features for a job. Refer to the Slurm srun --constraint documentation for more details. (You can also find the same information in the Slurm sbatch documentation and Slurm salloc documentation.)

Features available on Campus Cluster include:

  • CPU type (AE7713, E2680V4, G2348, …)

  • GPU type (NoGPU, P100, K80, …)

  • Memory (64G, 128G, 256G, 512G, …)

  • Interconnect (E1G, E10G, FDR, HDR, …)

Run the sinfo command below to see a full list of features for nodes that are in queues that you can submit to:

sinfo -N --format="%R (%N): %f" -S %R | more

If a constraint(s) cannot be satisfied, your job will not run and squeue will return BadConstraints; refer to the Slurm squeue documentation.

Job Arrays

If a need arises to submit the same job to the batch system multiple times, instead of issuing one sbatch command for each individual job, users can submit a job array. Job arrays allow users to submit multiple jobs with a single job script using the ‑‑array option to sbatch. An optional slot limit can be specified to limit the number of jobs that can run concurrently in the job array. See the sbatch manual page for details (man sbatch). The file names for the input, output, and so on, can be varied for each job using the job array index value defined by the SLURM environment variable SLURM_ARRAY_TASK_ID.

A sample batch script that makes use of job arrays is available in /projects/consult/slurm/jobarray.sbatch.

A few things to keep in mind:

  • Valid specifications for job arrays are:

    ‑‑array 1-10

    ‑‑array 1,2,6-10

    ‑‑array 8

    ‑‑array 1-100%5 (a limit of 5 jobs can run concurrently)

  • You should limit the number of batch jobs in the queues at any one time to 1,000 or less (each job within a job array is counted as one batch job.)

  • Interactive batch jobs are not supported with job array submissions.

  • To delete job arrays, see the scancel command section.

Running Serial Jobs

Users often have a number of single-core (serial) jobs that need to be run. Since the Campus Cluster nodes have multiple cores (16/20/24/28/40/56 cores on a node), using these resources efficiently means running multiple of these batch jobs on a single node. This can be done with Multiple Batch Jobs (one per serial process) or combined within a Single Batch Job.

Keep in mind, memory needs to decide how many serial processes to run concurrently on a node. If you are running your jobs in the secondary queue, also be aware that the compute nodes on the Campus Cluster have different amounts of memory. To avoid overloading the node, make sure that the memory required for multiple jobs or processes can be accommodated on a node. Assume that approximately 90% of the memory on a node is available for your job (the remaining is needed for system processes).

Multiple Batch Jobs

The queue --mem-per-cpu option in Slurm can be utilized to submit serial jobs to run concurrently on a node. Starting with the a memory amount of 3375 megabytes and a node specification of 1 (--nodes), along with a specification of 1 for the ntasks per node (--ntasks-per-node) will allow jobs owned by the same user to share a node. Multiple serial jobs submitted will then be scheduled to run concurrently on a node.

The following Slurm specification will schedule multiple serial jobs on one node:

#!/bin/bash
#SBATCH --time=00:05:00                  # Job run time (hh:mm:ss)
#SBATCH --account=account_name           # Replace "account_name" with an account available to you
#SBATCH --nodes=1                        # Number of nodes
#SBATCH --ntasks-per-node=1              # Number of task (cores/ppn) per node
#SBATCH --mem-per-cpu=3375               # Memory per core (value in MBs)
<other sbatch options>
#
cd ${SLURM_SUBMIT_DIR}

# Run the serial executable
./a.out < input > output

Increase the --mem-per-cpu value for each job will cause fewer jobs get scheduled concurrently on a single compute node resulting in more memory available to each job.

The above sbatch specifications are based on a the smallest compute node configuration. A compute node configured with 16 cores and 64GB memory (54000 MB usable).

Single Batch Job

Specify the maximum value for --ntasks-per-node as an sbatch option/specification for a batch job and execute multiple serial processes within a one-node batch job. This works best if all the processes are estimated to take approximately the same amount of time to complete because the batch job waits to exit until all the processes finish. The basic template for the job script would be:

#!/bin/bash
#SBATCH --time=00:05:00                  # Job run time (hh:mm:ss)
#SBATCH --account=account_name           # Replace "account_name" with an account available to you
#SBATCH --nodes=1                        # Number of nodes
#SBATCH --ntasks-per-node=16             # Number of task (cores/ppn) per node
#SBATCH --job-name=multi-serial_job      # Name of batch job
#SBATCH --partition=secondary            # Partition (queue)
#SBATCH --output=multi-serial.o%j        # Name of batch job output file

executable1 &
executable2 &
.
.
.
executable16 &
wait

The ampersand (&) at the end of each command indicates the process will be backgrounded and allows 16 processes to start concurrently. The wait command at the end is important so the shell waits until the background processes are complete (otherwise the batch job will exit right away). The commands can also be handled in a do loop depending on the specific syntax of the processes.

When running multiple processes in a job, the total number of processes should generally not exceed the number of cores. Also be aware of memory needs so as not to run a node out of memory.

The following example batch script runs 16 instances of Matlab concurrently:

#!/bin/bash
#SBATCH --time=00:30:00                  # Job run time (hh:mm:ss)
#SBATCH --account=account_name           # Replace "account_name" with an account available to you
#SBATCH --nodes=1                        # Number of nodes
#SBATCH --ntasks-per-node=16             # Number of task (cores/ppn) per node
#SBATCH --job-name=matlab_job            # Name of batch job
#SBATCH --partition=secondary            # Partition (queue)
#SBATCH --output=multi-serial.o%j        # Name of batch job output file


cd ${SLURM_SUBMIT_DIR}

module load matlab
for (( i=1; i<=16; i++))
do
         matlab -nodisplay -r num.9x2.$i > output.9x2.$i &
done
wait

Using MATLAB

Introduction

MATLAB (MATrix  LABoratory) is a high-level language and interactive environment for numerical computation, visualization, and programming. Developed by MathWorks, MATLAB allows you to analyze data, develop algorithms, and create models and applications.

MATLAB is available on the Campus Cluster along with a collection of toolboxes all of which are covered by a campus concurrent license.

Versions

The table below list the versions of MATLAB installed on the Campus Cluster.

Matlab versions installed on the Campus Cluster

Version

Release Name

MATLAB 9.7

2019b

MATLAB 9.5

2018b

MATLAB 9.4

2018a

Adding MATLAB to Your Environment

Each MATLAB installation on the Campus Cluster has a module that you can use to load a specific version of MATLAB into your user environment. You can see the available versions of MATLAB by typing module avail matlab on the command line. The latest version of MATLAB can be loaded into your environment by typing module load matlab. To load a specific version, you will need to load the corresponding module. See the Managing Your Environment (Modules) section for more information about modules.

The MATLAB modules make the corresponding MATLAB product as well as all the installed toolboxes available to the user environment. To verify which toolboxes are available (and the MATLAB version), type ver at the prompt of an interactive MATLAB session.

Running MATLAB Batch Jobs

Execution of MATLAB should be restricted to compute nodes that are part of a batch job. For detailed information about running jobs on the Campus Cluster, see Running Jobs.

Standard batch job

A sample batch script that runs a single MATLAB task with a single m-file is available in /projects/consult/slurm/matlab.sbatch that you can copy and modify for your own use. Submit the job with:

[cc-login1 ~]$ sbatch matlab.sbatch

Interactive batch job

For the GUI (which will display on your local machine), use the -x11 option with the srun command. Replace account_name with the name of an account available to you. If you don’t know the account(s) available to you, ask your technical representative or submit a support request.

srun -A account_name --x11 --export=All --time=00:30:00 --nodes=1 --cpus-per-task=16 --partition=secondary --pty /bin/bash

Once the batch job starts, you will have an interactive shell prompt on a compute node. Then type:

module load matlab
matlab
  • An X-Server must be running on your local machine with X11 forwarding enabled within your SSH connection in order to display X-Apps, GUIs, and so on, back on your local machine.

  • Generally, users on Linux-based machines only have to enable X11 forwarding by passing an option (-X or -Y) to the SSH command.

  • Users on Windows machines will need to ensure that their SSH client has X11 forwarding enabled and have an X-Server running.

  • A list of SSH clients (which includes a combo packaged SSH client and X-Server) can be found in the SSH section.

  • Additional information about running X applications can be found on the Using the X Window System page.

For the command line interface:

srun -A account_name --export=All --time=00:30:00 --nodes=1 --cpus-per-task=16 --partition=secondary --pty /bin/bash

(Replace account_name with the name of an account available to you. If you don’t know the account(s) available to you, ask your technical representative or submit a support request.)

Once the batch job starts, you will have an interactive shell prompt on a compute node. Then type:

module load matlab
matlab -nodisplay

Parallel MATLAB

The Parallel Computing Toolbox (PCT) lets you solve computationally and data-intensive problems using multicore processors. High level constructs, parallel for loops, special array types, and parallelized numerical algorithms let you parallelize MATLAB applications without MPI programming. Under MATLAB versions 8.4 and earlier, this toolbox provides 12 workers (MATLAB computational engines) to execute applications locally on a single multicore node of the Campus Cluster. Under MATLAB version 8.5 the number of workers available is equal to the number of cores on a single node (up to a maximum of 512). See MATLAB Errors for error messages generated when violating this limit.

When submitting multiple parallel MATLAB jobs on the Campus Cluster a race condition to write temporary MATLAB job information to the same location can occur if two or more jobs start at the same time. This race condition can cause one or more of the parallel MATLAB jobs fail to use the parallel functionality of the toolbox. See MATLAB Errors for error messages generated when this occurs. Note that non-parallel MATLAB jobs do not suffer from this race condition.

To avoid this behavior, the start times of the parallel MATLAB jobs can be staggered by submitting each subsequent job to the batch system with the -W depend=after:JobID option (see the Job Dependencies section for more information about this option).

sbatch parallel.ML-job.sbatch
sbatch --dependency=after:JobID.01 parallel.ML-job.sbatch
sbatch --dependency=after:JobID.02 parallel.ML-job.sbatch
...
sbatch --dependency=after:JobID.NN parallel.ML-job.sbatch

Note

The MATLAB Distributed Computing Server (MDCS) is not installed on the Campus Cluster because the latest versions of MDCS are not covered under the campus concurrent license. Therefore, MATLAB jobs are restricted to the parallel computing functionality of MATLAB’s Parallel Computing Toolbox.

The UI WebStore offers MDCS for release 2010b - you can contact them directly at webstore@illinois.edu for information and download instructions (for use with release 2010b only - it is not compatible for use with other versions).

MATLAB matlabpool is no longer available

The matlabpool function is not available in MATLAB version 8.5(R2015a). The parpool function should be used instead. Additional information can be found in the Parallel Computing Toolbox release notes.

MATLAB Errors

The following are some example errors encountered when running MATLAB on the Campus Cluster using MATLAB versions <= 8.4 and versions >= 8.5.

Trying to start a matlabpool (or parpool) with more than 12 workers

  • MATLAB version <= 8.4: Error message generated when trying to start a matlabpool with more than 12 workers

    matlabpool('open', 24)
    
    >> Starting matlabpool using the 'local' profile ... stopped.
    
    Error using matlabpool (line 144)
    Failed to open matlabpool. (For information in addition to the causing error,
    validate the profile 'local' in the Cluster Profile Manager.)
    
    Caused by:
        Error using distcomp.interactiveclient/start (line 88)
        Failed to start matlabpool.
        This is caused by:
        You requested a minimum of 24 workers, but only 12 workers are allowed
        with the Local cluster.
    
  • MATLAB version >= 8.5: Error message generated when trying to start a parpool with more than 12 workers

    parpool('local', 24)
    
    >> Starting parallel pool (parpool) using the 'local' profile ...
    
    Error using parpool (line 103)
    You requested a minimum of 24 workers, but the cluster "local" has the
    NumWorkers property set to allow a maximum of 12 workers. To run a
    communicating job on more workers than this (up to a maximum of 512 for the
    Local cluster), increase the value of NumWorkers property for the cluster.
    The default value of NumWorkers for a Local cluster is the number of cores on
    the local machine.
    

Trying to start a matlabpool (or parpool) with 12 workers using 2 nodes and 6 ppn/node

  • MATLAB version <= 8.4: Error message generated when trying to start a matlabpool with 12 workers using 2 nodes and 6 ppn/node

    matlabpool('open', 12)
    
    >> Starting matlabpool using the 'local' profile ...
    Error using matlabpool (line 148)
    Failed to start a parallel pool. (For information in addition to the causing
    error, validate the profile 'local' in the Cluster Profile Manager.)
    
    Caused by:
        Error using parallel.internal.pool.InteractiveClient/start (line 326)
        Failed to start pool.
            Error using parallel.Job/submit (line 304)
            You requested a minimum of 12 workers, but the cluster "local" has the
            NumWorkers property set to allow a maximum of 6 workers. To run a
            communicating job on more workers than this (up to a maximum of 12 for
            the Local cluster), increase the value of the NumWorkers property for
            the cluster. The default value of NumWorkers for a Local cluster is the
            number of cores on the local machine.
    
  • MATLAB version >= 8.5: Error message generated when trying to start a parpool with 12 workers using 2 nodes and 6 ppn/node

    parpool('local', 12)
    
    >> Starting parallel pool (parpool) using the 'local' profile ...
    
    Error using parpool (line 103)
    You requested a minimum of 12 workers, but the cluster "local" has the
    NumWorkers property set to allow a maximum of 6 workers. To run a communicating
    job on more workers than this (up to a maximum of 512 for the Local cluster),
    increase the value of the NumWorkers property for the cluster. The default
    value of NumWorkers for a Local cluster is the number of cores on the local
    machine.
    

When 2 or more parallel MATLAB jobs start at the same time (see the Parallel MATLAB section for details)

  • MATLAB version <= 8.4:

    • Example 1

      Error using matlabpool (line 144)
      Failed to open matlabpool. (For information in addition to the causing error,
      validate the profile 'local' in the Cluster Profile Manager.)
      
      Caused by:
          Error using distcomp.interactiveclient/start (line 88)
          Failed to start matlabpool.
          This is caused by:
          A communicating job must have a single task defined before submission.
      
    • Example 2

      Error using matlabpool (line 144)
      Failed to open matlabpool. (For information in addition to the causing error
      validate the profile 'local' in the Cluster Profile Manager.)
      
      Caused by:
          Error using distcomp.interactiveclinet/start (line 88)
          Failed to start matlabpool.
          This is caused by:
          Can't write file
          /home//.matlab/local_cluster_jobs/R2012a/Job2.in.mat.
      
  • MATLAB version >= 8.5:

    • Example 1

      >>Starting parallel pool (parpool) using the 'local' profile ...
      Error using parpool (line 103)
      Failed to start a parallel pool. (For information in addition to the causing
      error, validate the profiles 'local' in the Cluster Profile Manager.)
      
      Caused by:
          Error using parallel.internal.pool.InteractiveClient>iThrowWithCause (line
          667)
          Failed to start pool.
              Error using parallel.Job/createTask (line 277)
              Only one task may be created on a communicating Job.
      
    • Example 2

      >> Starting parallel pool (parpool) using the 'local' profile ...
      Error using parpool (line 103)
      Failed to start a parallel pool. (For information in addition to the causing
      error, validate the profile 'local' in the Cluster Program Manager.)
      
      Caused by:
          Error using parallel.internal.pool.InteractiveClient>iThrowWithCause (line
          667)
          Failed to start pool.
              Error using parallel.Cluster/createCommunicatingJob (line 92)
              The storage metadata file is corrupt. Please delete all files in the
              JobStorageLocation and try again.
      

When all of the MATLAB licenses are unavailable (in use)

MATLAB version <= 8.4 and >= 8.5

>> Starting [matlabpool]/[parallel pool (parpool)] using the 'local' profiles ... License checkout failed.
License Manager Error -4
Maximum number of users for Dsistrib_Computing_Toolbox reached.
Try again later.
To see a list of current users use the lmstat utility or contact your License Administrator.

Troubleshoot this issue by visiting:
http://www.mathworks.com/support/lme/R####x/4

Diagnostic Information:
Feature: Distrib_Computing_Toolbox
License path: /home//.matlab/R####x_licenses:/usr/local/MATLAB/R####x/licenses/license.dat:/usr/local/MATLAB/R####x/licenses/network.lic
Licensing error: -4,132.
Error using gcp (line 45)
Unable to checkout a license for the Parallel Computing Toolbox.

R####x corresponds to the MATLAB release name associated with the version (i.e., ver 8.5 = R2015a)

Running Mathematica Batch Jobs

Standard batch job

A sample batch script that runs a Mathematica script is available in /projects/consult/slurm/mathematica.sbatch. You can copy and modify this script for your own use. Submit the job with:

[golubh1 ~]$ sbatch mathematica.sbatch

In an interactive batch job

For the GUI (which will display on your local machine), use the –x11 option with the srun command. (Replace account_name with the name of an account available to you. If you don’t know the account(s) available to you, ask your technical representative or submit a support request.)

srun -A account_name --x11 --export=All --time=00:30:00 --nodes=1 --ntasks-per-node=16 --partition=secondary --pty /bin/bash

Once the batch job starts, you will have an interactive shell prompt on a compute node. Then type:

module load mathematica
mathematica
  • An X-Server must be running on your local machine with X11 forwarding enabled within your SSH connection in order to display X-Apps, GUIs, and so on, back on your local machine.

  • Generally, users on Linux-based machines only have to enable X11 forwarding by passing an option (-X or -Y) to the SSH command.

  • Users on Windows machines will need to ensure that their ssh client has X11 forwarding enabled, and an X-Server is running.

  • A list of SSH clients (which includes a combo packaged SSH client and X-Server) can be found in the SSH section.

  • Additional information about running X applications can be found on the Using the X Window System page.

For the command line interface:

srun -A account_name --export=All --time=00:30:00 --nodes=1 --ntasks-per-node=16 --partition=secondary --pty /bin/bash

(Replace account_name with the name of an account available to you. If you don’t know the account(s) available to you, ask your technical representative or submit a support request.)

Once the batch job starts, you will have an interactive shell prompt on a compute node. Then type:

module load mathematica
math