Frequently Asked Questions
The Campus Cluster FAQ page includes frequently asked questions for investors, users, and research computing as a service (RCaaS).
ICCPv3 vs ICCPv4
What are the biggest changes between ICCPv3 and ICCPv4?
Login authentication now includes Duo.
Changes to the installed modules.
“secondary-Eth” queue is now just “secondary”.
Changes to storage areas, including that the home directory quota has increased to 100GB.
When I log in with my SSH client I get an error message like: ‘Unable to negotiate cipher…’
This error message indicates an SSH client compatibiliy issue with the updated system. If you are using an older version of an SSH client, please update it. See SSH Clients for SSH client options.
What happened to the secondary-Eth queue?
The ICCPv3 “secondary-Eth” queue is now the ICCPv4 “secondary” queue.
In ICCPv3, the compute nodes associated with the “secondary” queue were interconnected via InfiniBand (IB) and the compute nodes that are associated with the “secondary-Eth” queue are interconnected via Ethernet. In ICCPv4, the InfiniBand interconnected queue has been removed.
Access and Accounts
I am in multiple groups and want to change my default group
Your default group is determined by the group ownership of the home directory. Use the following command to change the group of the home directory. Replace defgroupname
with the name of the group you want to be your default.
chgrp defgroupname $HOME
Log off and log back on for the new default group to take effect.
Why does my SFTP connection attempt fail with error message: ‘Received message too long 1751714304’?
This error is usually caused by commands in a shell run-control file (.bashrc, .profile, .cshrc, etc.) that produce output to the terminal. To resolve this, place any commands that will produce output in a conditional statement that is executed only if the shell is interactive.
For example, in the .bashrc file:
if tty -s; then
;
fi
I have problems transferring data to my home directory on the Campus Cluster – I get 0 byte files, partial files, or the files do not transfer at all.
This is often due to being over quota in your home directory. Check your usage with the quota
command.
System Policies
Are there any user disk quotas in place?
See Storage Areas for a breakdown of the quotas for each filesystem directory.
I accidentally deleted files in my home directory. Is there any way to get them back?
Nightly snapshots of your home directory are available for the last 30 days. See the Storage and Data Guide for more information.
Is there a disk purge policy in place?
See Storage Areas for the purge policy of each filesystem directory.
Programming, Software, and Libraries
Is there a math library available on the Campus Cluster?
The Intel Math Kernel Library (MKL) is available as part of the Intel Compiler Suite. The OpenBLAS library is also available. See the software section of the User Guide for details.
Is there information on running MATLAB on the Campus Cluster?
See the Software section of the User Guide for information on software.
What is the process for installing investor-specific software on the Campus Cluster?
See Investor-Specific Software Installation for recommended guidelines to follow when installing software.
How do I install R packages specific to my needs that are not available in the Campus Cluster installation?
See R on the Campus Cluster for information about installing R add-on packages.
How can I access numeric and scientific Python modules on the Campus Cluster?
Python versions 2 and 3 are available.
See the output of one of the following commands for the specific modules.
module avail python
module avail anaconda
Load the needed module into your environment with the following command. Replace modulefile_name
with the name of the module you want to load.
module load modulefile_name
Note: Use the command python3
for Python 3.
How do I enable syntax highlighting in the vi editor on the Campus Cluster?
The default vi/vim installed with the OS on the campus cluster does not include syntax highlighting. You can load a newer version that includes this into your environment with the following command.
module load vim
Running Jobs
Why is the wait time of my batch job so long?
There can be various reasons that contribute to job wait times.
If your job is in your primary queue:
All nodes in your investor group are in use by other primary queue jobs. In addition, since the Campus Cluster allows users access to any idle nodes via the secondary queue, jobs submitted to a primary queue could have a wait time of up to the secondary queue maximum wall time of 4 hours.
If your job is in the secondary queue:
Since this queue makes use of idle nodes not being used by investor primary queues, it is almost entirely opportunity scheduled. This means that secondary jobs will only run if there is a big enough scheduling hole on the number and type of nodes requested.
Preventative Maintenance (PM) on the Campus Cluster is generally scheduled quarterly on the thied Wednesday of the month. If the wall time requested by a job will not allow it to complete before an upcoming PM, the job will not start until after the PM.
Your job has requested a specific type of resource—for example, nodes with 96GB memory.
Your job has requested a combination of resources that are incompatible—for example, 96GB memory and the cse queue.
[In this case, the job will never run.]
How can I get more than four hours of wall clock time in my batch jobs?
The secondary queue is the default queue on the cluster, batch jobs that do not specify a queue name are routed to this queue. This queue has a maximum wall time of 4 hours. Specify your primary queue using the --partition
option to sbatch for access to longer batch job wall times.
You can view the maximum wall time for all queues on the cluster with one of the following commands.
sinfo -a --format="%.16R %.4D %.14l"
qstat -q
What does the error in my batch job: ‘=>> PBS: job killed: swap rate due to memory oversubscription is too high Ctrl-C caught… cleaning up processes’ mean?
This indicates that your job used more memory than available on the node(s) allocated to the batch job. If possible, you can submit jobs to nodes with larger amount of memory, see Running Jobs for details. For MPI jobs, you can also resolve the issue by running fewer processes on each node (so each MPI process will have more memory) or using more MPI processes in total (so each MPI process will need less memory).
What does the error in my batch job: ‘Job exceeded a memory resource limit (vmem, pvmem, etc.). Job was aborted’ mean?
This indicates that your job used more memory than available on the node(s) allocated to the batch job. If possible, you can submit jobs to nodes with larger amount of memory, see Running Jobs for details. For MPI jobs, you can also resolve the issue by running fewer processes on each node (so each MPI process will have more memory) or using more MPI processes in total (so each MPI process will need less memory).
I need to run a large number of single-core (serial) jobs and the jobs are moving very slowly through the batch system.
See Running Serial Jobs for information on combining multiple serial processes within a single batch job to help expedite job turnaround time.
I get the following error when I run multi-node batch jobs: ‘Permission denied (publickey,gssapi-keyex,gssapi-with-mic).’
Possible causes:
When the file $HOME/.ssh/authorized_keys has been removed, incorrectly modified, or zeroed out.
To resolve, remove or rename the .ssh directory, and then log off and log back on. This will regenerate a default .ssh directory along with its contents. If you need to add an entry to $HOME/.ssh/authorized_keys, make sure to leave the original entry in place.
When group writable permissions are set for the user’s home directory.
[golubh1 ~]$ ls -ld ~jdoe drwxrwx— 15 jdoe SciGrp 32768 Jun 16 14:20 /home/jdoe
To resolve, remove the group writable permissions:
[golubh1 ~]$ chmod g-w ~jdoe [golubh1 ~]$ ls -ld ~jdoe drwxr-x— 15 jdoe SciGrp 32768 Jun 16 14:20 /home/jdoe
How can I move my queued batch job from one queue to another?
Use the command scontrol update
, with the following syntax. Replace queue_name
with the name of the queue that you want to move the job to.
scontrol update jobid=[JobID] partition=[queue_name]
Note, the operation will not be permitted if the resources requested do not fit the queue limits.
What is the maximum wall time on the secondary queue?
See Secondary Queues.