Installed Software

DeltaAI software is provisioned using the HPE Cray Programming Environment (CPE). Select NVIDIA NGC containers are made available (see Containers) and are periodically updated from the NVIDIA NGC site. An automated list of available software can be found on the ACCESS website.

Modules/Lmod

DeltaAI provides HPE/Cray modules and compilers. The functional programming environments are PrgEnv-gnu and PrgEnv-cray. The default environment loads PrgEnv-gnu.

Use module spider package_name to search for software in Lmod and see the steps to load it in your environment.

See also: User Guide for Lmod.

Please submit a support request for help with software not currently installed on DeltaAI. For general installation requests, the DeltaAI project office will review requests for broad use and installation effort.

Python

Note

When submitting support requests for python, please provide the following and understand that DeltaAI support staff time is a finite resource while python developments (new software and modules) are growing at nearly infinite velocity:

  • Python version or environment used (describe fully, with the commands needed to reproduce)

  • Error output or log from what went wrong (screenshots are more difficult to work with than text data)

  • Pertinent URLs describing what you were following/attempting (if applicable), note that URL recipes specific to vendors may be difficult to reproduce when not using their cloud resources (Google Colab, for example)

  • DeltaAI’s architecture is aarch64 and many python packages may not be built for that, if you cannot find a python wheel then building from source may be the only option. There is no guarantee your desired software can be ported to the new architecture with minimal effort.

  • TensorFlow is only supported from Nvidia’s NGC container. Python sw stacks that require TensorFlow may be difficult (or impossible) to adapt to DeltaAI. See the notes about it at TensorFlow on DeltaAI.

On DeltaAI, you may install your own python software stacks, as needed. There are choices when customizing your python setup. If you anticipate maintaining multiple python environments or installing many packages, you may want to target a filesystem with more quota space (not $HOME) for your environments. /scratch or /projects may be more appropriate in that case. You may use any of these methods with any of the python versions or instances described below (or you may install your own python versions):

  • venv (python virtual environment)

    Can name environments (metadata) and have multiple environments per python version or instance. pip installs are local to the environment. You specify the path when using venv: python -m venv /path/to/env.

  • conda (or miniforge) environments

    Similar to venv but with more flexibility, see this comparison table. See also the miniforge environment option: miniforge. pip and conda installs are local to the environment and the location defaults to $HOME/.conda. You can override the default location in $HOME by using the --prefix syntax: conda create --prefix /path/to/env. You can also relocate your .conda directory to your project space, which has a larger quota than your home directory.

  • pip3: pip3 install --user <python_package>

    CAUTION: Python modules installed this way into your $HOME/.local/ will match on python versions. This can create incompatibilities between containers or python venv or conda environments when they have a common python version number. You can work around this by using the PYTHONUSERBASE environment variable. That will also allow for shared pip installs if you choose a group-shared directory.

  • conda-env-mod Lmod module generator from Purdue

    The conda-env-mod script will generate a python module you can load or share with your team. This makes it simpler to manage multiple python scenarios that you can activate and deactivate with module commands.

  • pyenv python version management

    Pyenv helps you manage multiple python versions. You can also use more than one python version at once in a project using pyenv.

Note

The NVIDIA NGC Containers on Delta provide optimized python frameworks built for DeltaAI’s H100 GPUs. Delta staff recommend using an NGC container when possible with the GPU nodes (or use one of the conda or miniforge modules).

Python (a recent or latest version)

If you don’t need all the extra modules provided by Anaconda, use the basic python installation provided by Cray or install your own for aarch64. You can add modules via pip3 install --user <modulename>, setup virtual environments, and customize, as needed, for your workflow starting from a smaller installed base of python than Anaconda.

$ module load cray-python
$ which python
/opt/cray/pe/python/3.11.7/bin/python

cray-python includes: numpy, mpi4py, and pandas .

miniforge3

python/miniforge3_pytorch

Use python from the python/miniforge3_pytorch module if you need some of the modules provided by conda-forge in your python workflow. See the Managing Environments section of the conda getting started guide to learn how to customize conda for your workflow and add extra python modules to your environment.

Note

If you use conda with NGC containers, take care to use python from the container and not python from conda or one of its environments. The container’s python should be first in $PATH. You may --bind the conda directory or other paths into the container so that you can start your conda environments with the container’s python (/usr/bin/python).

The Anaconda archive contains previous Anaconda versions. The bundles are not small, but using one from Anaconda will ensure that you get software that was built to work together. If you require an older version of a python lib/module, NCSA staff suggest looking back in time at the Anaconda site (though this will be a limited timeline due to the new grace-hopper aarch64 in DeltaAI).

Python Environments with conda

See the Conda configuration documentation if you want to disable automatic conda environment activation.

Note

When using your own custom conda environment with a batch job, submit the batch job from within the environment and do not add conda activate commands to the job script; the job inherits your environment.

Batch Jobs

Batch jobs will honor the commands you execute within them. Purge/unload/load modules as needed for that job.

A clean slate might resemble (user has a conda init clause in bashrc for a custom environment):

conda deactivate
conda deactivate  # just making sure
module reset      # load the default DeltaAI modules

conda activate base
# commands to load modules and activate environs such that your environment is active before
# you use slurm ( do not include conda activate commands in the slurm script )

sbatch myjob.slurm  # or srun or salloc

Non-python/conda HPC users would see per-job stderr from the conda deactivate above (user has never run conda init bash):

[arnoldg@gh-login03 ~]$ conda deactivate
bash: conda: command not found
[arnoldg@gh-login03 ~]$

# or

[arnoldg@gh-login03 ~]$ conda deactivate

CommandNotFoundError: Your shell has not been properly configured to use 'conda deactivate'.
To initialize your shell, run

    $ conda init <SHELL_NAME>

Currently supported shells are:
  - bash
  - tcsh
  - zsh

See 'conda init --help' for more information and options.

IMPORTANT: You may need to close and restart your shell after running 'conda init'.

Extending a System Python Module with Your Own Packages

The python/miniforge3_* modules under /sw/user/python/ are a shared, read-only stack: a tested combination of Python, CUDA, MPI, PyTorch, and related libraries that every user gets identically. You cannot install into them, but you do not need to — the recommended workflow is to layer your own packages on top of the module without modifying it.

Each module sets PYTHONNOUSERSITE=1 at load time, which disables Python’s user-site-packages mechanism (the ~/.local/lib/pythonX.Y/site-packages/ directory) for the duration of the load. Without that, a stale ~/.local package from an unrelated module or Python version could silently shadow the system stack. The recommended layering recipe is the venv overlay below — it adds packages to sys.path directly, so PYTHONNOUSERSITE does not affect it.

Warning

A consequence of PYTHONNOUSERSITE=1 is that pip install --user appears to succeed but the installed package cannot be imported:

$ pip install --user humanize
Successfully installed humanize-4.15.0
$ python -c "import humanize"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'humanize'

The files exist on disk under ~/.local/..., but site.ENABLE_USER_SITE is False so Python never adds ~/.local to sys.path.

Alternatives

Use these only when the venv overlay above does not fit your case.

unset PYTHONNOUSERSITE for plain pip install --user

The shortest workaround if you want pip install --user to behave as it does on a stock Python install — clear the variable after loading the module:

module load python/miniforge3_pytorch/2.11.0
conda activate base
unset PYTHONNOUSERSITE
pip install --user humanize lz4

This re-enables ~/.local/lib/pythonX.Y/site-packages/ as a user-site directory, so import will find packages installed there.

Warning

~/.local is shared across every Python module and version on the system. A package installed against one module’s Python can silently shadow the system stack when you later load a different module. Use this only if you load a single module and never mix it with others; for any other workflow, prefer the venv overlay above or the redirected-PYTHONUSERBASE recipe below.

conda create --prefix for a self-contained env

Use this when you need a different Python version from the base, or a fully self-contained environment you can move, share, or pin independently of the system module:

module load python/miniforge3_pytorch/2.11.0
conda activate base
conda create --prefix /projects/<account>/$USER/conda/myenv python=3.11 numpy
conda activate /projects/<account>/$USER/conda/myenv

This is roughly 10× slower to create (~50 s vs ~5 s) and 4× heavier on disk per env than the venv overlay, and you do not inherit the base’s PyTorch stack — anything you need must be installed into the new env.

Warning

Vet your ~/.condarc before running conda create --prefix. A leaked pkgs_dirs: entry from a previous project’s build will silently disable conda’s hardlink reuse and inflate every env you create by another factor of four. The next section ships a clean template that prevents this.

pip install --user with a redirected PYTHONUSERBASE

Use this only if you specifically want pip install --user semantics (for example, a workflow you’re porting from another site that expects ~/.local layout). You must both unset PYTHONNOUSERSITE and redirect PYTHONUSERBASE to a per-module-version directory so packages don’t leak across module loads:

module load python/miniforge3_pytorch/2.11.0
conda activate base
unset PYTHONNOUSERSITE
export PYTHONUSERBASE=$HOME/.local/deltaai/pytorch-2.11.0
pip install --user humanize lz4
python -c "import humanize, lz4; print('ok')"

Re-export PYTHONUSERBASE (with a different per-module-version path) in any future shell that loads a different module. Without the per-module isolation, packages built against one module’s Python version may silently break when you load another. Python’s site.py auto-discovers PYTHONUSERBASE, so no PYTHONPATH is needed at runtime.

Custom Recipes for Python

Custom recipes to “install foo” with Python are available on the system at /sw/user/python/. The README… files describe the recipes. Created to address past user issues, these recipes can be useful references while you work on your own installations. Topics include, but are not limited to:

  • datascience

  • pytorch.2.5.0

  • tensorflowcpu

  • tensorflow+cuda

  • cuquantum

  • torchgeometric+sparse

  • vLLM

  • triton-lang

PyTorch

Information on how to set up and run PyTorch.

Quantum Simulation Resources

DeltaAI provides GPU-accelerated quantum simulation frameworks optimized for the GH200 Grace Hopper superchip architecture. Each GH200 GPU has 120 GB of HBM3 memory (about 97 GB usable after driver overhead), enabling state vector simulations up to 33 qubits on a single GPU. Multi-GPU and multi-node workflows extend this to 37+ qubits across multiple nodes.

Note

All quantum modules are conda-based environments. In SLURM batch scripts, include both the module load and conda activate commands to ensure the environment is fully initialized. See the batch script examples in each section below.

NVIDIA CUDA Quantum (CUDA-Q)

CUDA-Q is NVIDIA’s framework for hybrid quantum-classical computing. The DeltaAI module includes a native MPI communication plugin compiled against Cray MPICH for multi-node quantum simulation over HPE Slingshot 11.

Loading CUDA-Q:

$ module load python/cuda_quantum/0.14.0
$ conda activate base
$ python -c "import cudaq; print(cudaq.__version__)"
0.14.0

Simulation backends:

  • nvidia — single-GPU state vector simulation (default)

  • nvidia with option="mgpu,fp64" — multi-GPU distributed state vector via MPI

  • nvidia with option="mqpu" — circuit batching across multiple GPUs (no MPI needed)

  • tensornet — tensor network simulation for large structured circuits

Example: Single-GPU Bell state (click to expand/collapse)
import cudaq

cudaq.set_target("nvidia")

@cudaq.kernel
def bell():
    q = cudaq.qvector(2)
    h(q[0])
    cx(q[0], q[1])

result = cudaq.sample(bell, shots_count=1000)
print(result)
# Expected: roughly 50% |00> and 50% |11>

Example: Single-GPU SLURM batch script (click to expand/collapse)
#!/bin/bash
#SBATCH --account=<account_name>
#SBATCH --partition=ghx4
#SBATCH --nodes=1
#SBATCH --gpus=1
#SBATCH --time=00:10:00
#SBATCH --job-name=cudaq-single

module load python/cuda_quantum/0.14.0
conda activate base
python -u my_circuit.py

Example: Multi-GPU distributed state vector (4 GPUs, 34+ qubits) (click to expand/collapse)
# multi_gpu_cudaq.py
import cudaq

cudaq.mpi.initialize()
cudaq.set_target("nvidia", option="mgpu,fp64")

N = 34  # ~256 GiB (complex128, from fp64) — ~64 GiB per GPU on 4 GPUs

@cudaq.kernel
def ghz(n: int):
    qubits = cudaq.qvector(n)
    h(qubits[0])
    for i in range(1, n):
        cx(qubits[0], qubits[i])

result = cudaq.sample(ghz, N, shots_count=1000)

if cudaq.mpi.rank() == 0:
    print(f"GHZ({N}) distributed across {cudaq.mpi.num_ranks()} GPUs")

cudaq.mpi.finalize()
#!/bin/bash
#SBATCH --account=<account_name>
#SBATCH --partition=ghx4
#SBATCH --nodes=1
#SBATCH --gpus=4
#SBATCH --ntasks=4
#SBATCH --time=00:30:00
#SBATCH --job-name=cudaq-mgpu

module load python/cuda_quantum/0.14.0
conda activate base

srun python -u multi_gpu_cudaq.py

For multi-node jobs (2+ nodes), change the SLURM directives to --nodes=2 --gpus-per-node=4 --ntasks-per-node=4. The srun command is the same — CUDA-Q’s nvidia-mgpu plugin enumerates the visible GPUs on each node and assigns one per rank internally. Do not pin ranks with CUDA_VISIBLE_DEVICES=$SLURM_LOCALID here — pre-pinning before cudaq.mpi.initialize() prevents the plugin from selecting a GPU and the launch segfaults on all ranks at multi-node scale.


For more information, see the CUDA-Q documentation.

NVIDIA cuQuantum SDK

The NVIDIA cuQuantum SDK provides GPU-accelerated libraries for quantum simulation: cuStateVec (state vector), cuTensorNet (tensor networks), and cuDensityMat (density matrices). For Qiskit-based workflows, see Qiskit Aer — Aer is provided as a standalone module on DeltaAI.

Loading the cuQuantum environment:

$ module load python/miniforge3_cuquantum/26.01.0
$ conda activate base
$ python -c "import cuquantum; print(cuquantum.__version__)"
26.01.0

Available conda sub-environments:

$ conda env list
base                 * /sw/user/python/miniforge3-cuquantum-26.01.0
pennylane-0.44         /sw/user/python/miniforge3-cuquantum-26.01.0/envs/pennylane-0.44

Note

cuQuantum 26.01.0 changed some Python import paths. If upgrading from 24.11.0 or 25.03.0:

  • import custatevecfrom cuquantum.bindings import custatevec

  • import cutensornetfrom cuquantum.bindings import cutensornet

  • from cuquantum import CircuitToEinsumfrom cuquantum.tensornet import CircuitToEinsum

PennyLane

PennyLane is a quantum machine learning framework with auto-differentiation and optimized GPU backends. On DeltaAI, PennyLane is available as a standalone module with source-built Lightning backends compiled against Cray MPICH for multi-node support over Slingshot 11.

Available Lightning backends:

Loading PennyLane:

$ module load python/pennylane/0.44
$ conda activate pennylane-0.44
$ python -c "import pennylane as qml; print(qml.__version__)"
0.44.1
Example: Single-GPU Bell state with lightning.gpu (click to expand/collapse)
import pennylane as qml
import numpy as np

dev = qml.device("lightning.gpu", wires=2)

@qml.qnode(dev)
def bell():
    qml.Hadamard(wires=0)
    qml.CNOT(wires=[0, 1])
    return qml.probs(wires=[0, 1])

probs = bell()
print(f"|00>={probs[0]:.3f}, |01>={probs[1]:.3f}, "
      f"|10>={probs[2]:.3f}, |11>={probs[3]:.3f}")
# Expected: |00>=0.500, |01>=0.000, |10>=0.000, |11>=0.500

Example: Multi-GPU distributed state vector (4 GPUs, 34+ qubits) (click to expand/collapse)
# multi_gpu_pennylane.py
import pennylane as qml
import numpy as np

N = 34  # ~256 GiB (complex128 default) — ~64 GiB per GPU on 4 GPUs
dev = qml.device("lightning.gpu", wires=N, mpi=True)
# For tighter memory budgets, pass c_dtype=np.complex64 — halves
# state vector size at the cost of single-precision amplitudes.

@qml.qnode(dev)
def ghz():
    qml.Hadamard(wires=0)
    for i in range(1, N):
        qml.CNOT(wires=[0, i])
    return qml.probs(wires=range(min(N, 5)))

probs = ghz()
print(f"GHZ({N}) top probabilities: {probs[:3]}")
#!/bin/bash
#SBATCH --account=<account_name>
#SBATCH --partition=ghx4
#SBATCH --nodes=1
#SBATCH --gpus=4
#SBATCH --ntasks=4
#SBATCH --time=00:30:00
#SBATCH --job-name=pl-mgpu

module load python/pennylane/0.44
conda activate pennylane-0.44

srun python -u multi_gpu_pennylane.py

For more information on PennyLane Lightning backends:

Note

Jobs should be submitted from within the active PennyLane environment with the module loaded. See Python Environments with conda for details on conda environments with batch jobs.

Qiskit Aer

Qiskit Aer is the GPU-accelerated circuit simulator for the Qiskit ecosystem. On DeltaAI, Aer 0.17.2 is provided as a standalone module built on the CUDA 13 toolchain with cuStateVec.

Loading Qiskit Aer:

$ module load python/miniforge3_qiskit_aer/2.4.1
$ conda activate base
$ python -c "import qiskit_aer; print(qiskit_aer.__version__)"
0.17.2

The module bundles Qiskit 2.4.1, the Aer GPU backend, a broad slice of the Qiskit ecosystem (qiskit-ibm-runtime, qiskit-algorithms, qiskit-machine-learning, qiskit-optimization, qiskit-experiments, qiskit-dynamics, mthree, qiskit-serverless, and the qiskit-addon-* family), and mpi4py linked to Cray MPICH.

Example: Single-node 4-GPU via cuStateVec blocking (up to ~33q) (click to expand/collapse)

The intra-node multi-GPU recipe sets blocking_enable so Aer distributes the state vector across the four GH200 GPUs without MPI:

from qiskit import QuantumCircuit
from qiskit_aer import AerSimulator

sim = AerSimulator(method='statevector', device='GPU',
                   cuStateVec_enable=True,
                   blocking_enable=True, blocking_qubits=31)

qc = QuantumCircuit(33)
# ... build circuit ...
result = sim.run(qc, shots=1000).result()

The rule of thumb is blocking_qubits = N - 2 — leaves two qubits’ worth of state per shard so the four shards fit across the four GH200 GPUs with workspace to spare. A 33-qubit state vector at complex128 (Aer’s default) is 128 GiB total and completes in ~5 s with this configuration; a 30-qubit circuit completes in ~1.1 s. Lower blocking_qubits to trade speed for memory headroom on the densest circuits.


Important

Aer’s native MPI-distributed state vector path is broken with the GPU backend on DeltaAI in 0.17.2 — calls to sim.run() segfault when MPI world size ≥ 2. For genuine multi-node distributed state vector simulation, use PennyLane lightning.gpu with mpi=True. For Qiskit-based parameter sweeps across many nodes, use the embarrassingly-parallel mpi4py pattern shown below.

Multi-node pattern — mpi4py parameter sweep (click to expand/collapse)

Each MPI rank runs its own independent AerSimulator against a slice of the parameter grid; mpi4py handles work distribution and result gather. Aer’s distribution code is never engaged — the pattern works because the per-rank job is a complete, self-contained simulation.

# mpi_parameter_sweep.py
from mpi4py import MPI
import numpy as np
from qiskit_aer import AerSimulator

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
nranks = comm.Get_size()

# Build the (gamma, beta) parameter grid
grid = [(g, b) for g in np.linspace(0.1, 1.5, 10)
               for b in np.linspace(0.1, 1.5, 10)]

# Each rank claims grid[i] where i % nranks == rank
my_indices = list(range(rank, len(grid), nranks))

sim = AerSimulator(method="statevector", device="GPU",
                   cuStateVec_enable=True)

local = []
for idx in my_indices:
    g, b = grid[idx]
    qc = build_qaoa_circuit(8, gamma=g, beta=b)  # your kernel
    res = sim.run(qc, shots=2048).result()
    local.append((idx, g, b, expectation(res)))

# Gather to rank 0 for aggregation
all_results = comm.gather(local, root=0)
if rank == 0:
    flat = [r for chunk in all_results for r in chunk]
    best = max(flat, key=lambda x: x[3])
    print(f"Best (gamma, beta): ({best[1]:.3f}, {best[2]:.3f})")

The matching SLURM script launches one task per GPU. --ntasks-per-node=4 --gpus-per-node=4 places one rank in each of the node’s four NUMA domains (see System Architecture for the GH200 topology), and each rank addresses its local-NUMA GPU via CUDA_VISIBLE_DEVICES=$SLURM_LOCALID. --cpus-per-task=18 claims a quarter of a NUMA’s cores per rank — enough for Aer’s light orchestration-and-MPI CPU side. Raise it (up to 72 per rank) only if a pre- or post-processing step is CPU-heavy.

#!/bin/bash
#SBATCH --account=<account_name>
#SBATCH --partition=ghx4
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-node=4
#SBATCH --cpus-per-task=18      # 1/4 of a 72-core NUMA per rank
#SBATCH --time=00:15:00
#SBATCH --job-name=aer-mpi-sweep

module load python/miniforge3_qiskit_aer/2.4.1
conda activate base

srun bash -c \
    'CUDA_VISIBLE_DEVICES=$SLURM_LOCALID python -u mpi_parameter_sweep.py'

Multi-GPU quantum scaling

DeltaAI’s pre-installed quantum frameworks support several multi-GPU patterns, each suited to a different workload. This section covers the practical combinations — which framework and backend to choose for distributed state vector circuits, embarrassingly-parallel parameter sweeps, and on-node circuit batching.

State vector memory doubles with each additional qubit. The table below shows the minimum GPU count to hold the state at the default precision (complex128). The Aer blocking pattern shown earlier uses all 4 GPUs of a node by design; PennyLane lightning.gpu and CUDA-Q nvidia mgpu accept any power-of-2 GPU count that fits.

Minimum GPUs to hold the state vector at complex128 (1 node = 4 GPUs)

Qubits

State vector size

Minimum GPUs

30

16 GiB

1

33

128 GiB

2 (1 node)

34

256 GiB

4 (1 node)

35

512 GiB

8 (2 nodes)

36

1 TiB

16 (4 nodes)

37

2 TiB

32 (8 nodes)

Rule of thumb: state vector memory = 2n × 16 bytes at complex128. GH200 HBM3 is about 95 GiB usable per GPU after driver overhead, so a single GPU holds a 32-qubit state at complex128 (~64 GiB used, ~30 GiB margin); spread across 4 GPUs the headroom covers 34 qubits comfortably (~64 GiB per GPU, ~30 GiB margin for swap buffers).

Note

Aer, PennyLane, and CUDA-Q all default to complex128 (double precision). Switching to complex64 (single precision) halves the state vector memory — so a 33-qubit state fits on one GPU and a 35-qubit state fits on 4 — at the cost of single-precision amplitudes. Sampling-heavy and variational workloads usually tolerate single precision; precise expectation values and deep circuits where small-amplitude error accumulates may not. Verify the algorithm in complex128 before dropping to complex64. The knobs are PennyLane c_dtype=np.complex64, Aer precision='single', and CUDA-Q option="mgpu,fp32".

Choose a multi-node pattern based on workload type, not just qubit count:

Multi-node patterns by workload

Workload type

Pattern

Framework + backend on DeltaAI

Distributed state vector

One process per GPU; the state is split across ranks via MPI. Cost grows with inter-node MPI traffic — use when the circuit doesn’t fit on a single node.

PennyLane lightning.gpu with mpi=True; CUDA-Q nvidia with option="mgpu,fp64"

Ensemble / parameter sweep

One independent simulator per MPI rank; mpi4py handles work distribution and result gather. Each rank’s circuit fits on one GPU — scaling is embarrassingly parallel.

Any single-node simulator. See Qiskit Aer for the worked recipe.

Circuit batching

One process; CUDA-Q dispatches independent circuit evaluations across the node’s GPUs without MPI. Useful for variational parameter sweeps within a single node.

CUDA-Q nvidia with option="mqpu"

Tensor networks

Single device — memory is independent of qubit count for low-entanglement circuits, so multi-GPU scaling is often unnecessary for circuits that fit this regime.

PennyLane lightning.tensor; CUDA-Q tensornet

Important

Multi-GPU state vector simulations require a power-of-2 number of GPUs. Start with single-GPU to verify correctness before scaling to multiple GPUs.

Multi-node MPI troubleshooting

Two Cray MPICH environment variables address known multi-node failure modes on the Slingshot 11 / CXI fabric. They are not universal defaults — set them only when the symptom matches.

Important

MPI_Finalize crashes from mismatched CXI counter buffer sizes — set when a framework creates many MPI communicators (e.g. CUDA-Q nvidia-mgpu inside a variational loop) and segfaults at finalization:

$ export MPICH_OFI_CXI_COUNTER_REPORT=0

The DeltaAI CUDA-Q modules set this automatically. PennyLane and Qiskit Aer workflows do not typically need it.

Note

process_vm_readv: Operation not permitted errors when MPI ranks share a node come from Cray MPICH’s Cross-Memory Attach (CMA) intra-node optimization failing a kernel permission check. Disable CMA single-copy mode and fall back to a two-copy intra-node transfer:

$ export MPICH_SMP_SINGLE_COPY_MODE=NONE

Performance impact is negligible for quantum workloads where inter-node bandwidth is the bottleneck.

Open OnDemand JupyterLab with Quantum Environments

  1. (first time only) Create a Jupyter kernelspec for the desired environment:

    $ module load python/pennylane/0.44
    $ conda activate pennylane-0.44
    $ setup-kernel
    

    The setup-kernel command creates a kernelspec for the active conda environment and automatically adds environment variable settings needed to avoid runtime linking errors. See How to Customize JupyterLab with conda Environments for more general information on managing kernelspecs for custom conda environments.

  2. Refer to JupyterLab for instructions on starting a JupyterLab session from the DeltaAI Open OnDemand Dashboard

TensorFlow

Information on how to set up and run TensorFlow.

Containers

See Containers.

Jupyter Notebooks

Warning

This section is under construction.

Note

The DeltaAI Open OnDemand (OOD) dashboard provides an easy method to start a Jupyter notebook; this is the recommended method.

Go to OOD Jupyter interactive app for instructions on how to start an OOD JupyterLab session.

You can also customize your OOD JupyterLab environment:

Do not run Jupyter on the shared login nodes. Instead, follow these steps to attach a Jupyter notebook running on a compute node to your local web browser:

How to Run Jupyter on a Compute Node

The Jupyter notebook executables are in your $PATH after loading the anaconda3 module. If you run into problems from a previously saved Jupyter session (for example, you see paths where you do not have write permission), you may remove this file to get a fresh start: $HOME/.jupyter/lab/workspaces/default-*.

Follow these steps to run Jupyter on a compute node (CPU or GPU):

  1. On your local machine/laptop, open a terminal.

  2. SSH into DeltaAI. (Replace <my_delta_username> with your DeltaAI login username).

    ssh <my_deltaai_username>@gh-login.delta.ncsa.illinois.edu
    
  3. Enter your NCSA password and complete the Duo MFA. Note, the terminal will not show your password (or placeholder symbols such as asterisks [*]) as you type.

    Warning

    If there is a conda environment active when you log into DeltaAI, deactivate it before you continue. You will know you have an active conda environment if your terminal prompt has an environment name in parentheses prepended to it, like these examples:

    (base) [<gh-login_username>@gh-login01 ~]$
    
    (mynewenv) [<gh-login_username>@gh-login01 ~]$
    

    Run conda deactivate until there is no longer a name in parentheses prepended to your terminal prompt. When you don’t have any conda environment active, your prompt will look like this:

    [<gh-login_username>@dt-login01 ~]$
    
  4. Load the appropriate anaconda module. To see all of the available anaconda modules, run module avail anaconda. This example uses python/miniforge3_pytorch.

    module load python/miniforge3_pytorch
    
  5. Verify the module is loaded.

    module list
    
  6. Verify a jupyter-notebook is in your $PATH.

    which jupyter-notebook
    
  7. Generate a MYPORT number and copy it to a notepad (you will use it in steps 9 and 12).

    MYPORT=$(($(($RANDOM % 10000))+49152)); echo $MYPORT
    
  8. Find the the account_name that you are going to use and copy it to a notepad (you will use it in step 9); your accounts are listed under Project when you run the accounts command.

    accounts
    
  9. Run the following srun command, with these replacements:

    • Replace <account_name> with the account you are going to use, which you found and copied in step 8.

    • Replace <$MYPORT> with the $MYPORT number you generated in step 7.

    • Modify the --partition, --gpus, --time, and --mem options and/or add other options to meet your needs.

    srun --account=<account_name> --partition=ghx4 --gpus=1 --time=00:30:00 --mem=32g jupyter-notebook --no-browser --port=<$MYPORT> --ip=0.0.0.0
    
  10. Copy the last 5 lines returned beginning with: “To access the notebook, open this file in a browser…” to a notepad (you will use this information steps 12 and 14). (It may take a few minutes for these lines to be returned.)

    Note these two things about the URLs you copied:

    • The first URL begins with http://<ghXXX>.delta..., <ghXXX> is the internal hostname and will be used in step 12.

    • The second URL begins with http://127.0..., you will use this entire URL in step 14.

  11. Open a second terminal on your local machine/laptop.

  12. Run the following ssh command, with these replacements:

    • Replace <my_deltaai_username> with your DeltaAI login username.

    • Replace <$MYPORT> with the $MYPORT number you generated in step 7.

    • Replace <ghXXX> with internal hostname you copied in step 10.

    ssh -l <my_delta_username> -L 127.0.0.1:<$MYPORT>:<ghXXX>.delta.ncsa.illinois.edu:<$MYPORT> gh-login.delta.ncsa.illinois.edu
    
  13. Enter your NCSA password and complete the Duo MFA. Note, the terminal will not show your password (or placeholder symbols such as asterisks [*]) as you type.

  14. Copy and paste the entire second URL from step 10 (begins with https://127.0...) into your browser. You will be connected to the Jupyter instance running on your compute node of Delta.

    Jupyter screenshot

How to Run Jupyter on a Compute Node, in an NGC Container

Follow these steps to run Jupyter on a compute node, in an NGC container:

  1. On your local machine/laptop, open a terminal.

  2. SSH into DeltaAI. (Replace <my_deltaai_username> with your DeltaAI login username.)

    ssh <my_delta_username>@gh-login.delta.ncsa.illinois.edu
    
  3. Enter your NCSA password and complete the Duo MFA. Note, the terminal will not show your password (or placeholder symbols such as asterisks [*]) as you type.

  4. Generate a $MYPORT number and copy it to a notepad (you will use it in steps 6, 8, and 14).

    MYPORT=$(($(($RANDOM % 10000))+49152)); echo $MYPORT
    
  5. Find the the account_name that you are going to use and copy it to a notepad (you will use it in step 6); your accounts are listed under Project when you run the accounts command.

    accounts
    
  6. Run the following srun command, with these replacements:

    • Replace <account_name> with the account you are going to use, which you found and copied in step #5.

    • Replace <project_path> with the name of your projects folder (in two places).

    • Replace <$MYPORT> with the MYPORT number you generated in step 4.

    • Modify the --partition, --gpus, --time, --mem, and --gpus-per-node options and/or add other options to meet your needs.

    srun --account=<account_name> --partition=ghx4-interactive --gpus=1 --time=00:30:00 --mem=64g --gpus-per-node=1 apptainer run --nv --bind /projects/<project_path> /sw/user/NGC_containers/pytorch_24.07-py3.sif jupyter-notebook --notebook-dir /projects/<project_path> --no-browser --port=<$MYPORT> --ip=0.0.0.0
    
  7. Copy the last 2 lines returned (beginning with “Or copy and paste this URL…”) to a notepad. (It may take a few minutes for these lines to be returned.)

  8. Modify the URL you copied in step 7 by changing hostname:8888 to 127.0.0.1:<$MYPORT>. You will use the modified URL in step 16. (Replace <$MYPORT> with the $MYPORT number you generated in step 4.)

  9. Open a second terminal.

  10. SSH into DeltaAI. (Replace <my_deltaai_username> with your DeltaAI login username.)

    ssh <my_deltaai_username>@gh-login.delta.ncsa.illinois.edu
    
  11. Enter your NCSA password and complete the Duo MFA. Note, the terminal will not show your password (or placeholder symbols such as asterisks [*]) as you type.

  12. Find the internal hostname for your job and copy it to a notepad (you will use it in step 14).

    squeue -u $USER
    

    The value returned under NODELIST is the internal hostname for your GPU job (ghXXX). You can now close this terminal.

  13. Open a third terminal.

  14. Run the following ssh command, with these replacements:

    • Replace <my_deltaai_username> with your DeltaAI login username.

    • Replace <$MYPORT> with the $MYPORT number you generated in step 4.

    • Replace <ghXXX> with internal hostname you copied in step 12.

    ssh -l <my_deltaai_username> -L 127.0.0.1:<$MYPORT>:<ghXXX>.delta.internal.ncsa.edu:<$MYPORT> gh-login.delta.ncsa.illinois.edu
    
  15. Enter your NCSA password and complete the Duo MFA. Note, the terminal will not show your password (or placeholder symbols such as asterisks [*]) as you type.

  16. Copy and paste the entire modified URL (beginning with https://127.0...) from step 8 into your browser. You will be connected to the Jupyter instance running on your gpu node of DeltaAI.

    Jupyter screenshot

List of Installed Software (CPU & GPU)

See: module avail.