Containers
Containerization is a modern software packaging and execution technology that allows scripts and executables to be distributed with not only libraries and other dependencies, but a complete Linux operating system environment. Unlike virtual machines, which run a separate kernel on virtual processors, containerized applications share the same kernel as the host and therefore suffer practically no overhead.
The Hydro cluster supports containers via Apptainer (formerly Singularity), which is like Docker but specialized for traditional HPC environments. Apptainer distinguishes itself in that root/sudo authorization is not required to either run or (as of version 1.1) build containers (technical details in Apptainer Without Setuid - Dave Dykstra).
Apptainer 1.2 is installed on all Hydro login and compute nodes at /usr/bin/apptainer
.
In interpreting the Apptainer documentation it is occasionally helpful to know that Apptainer on Hydro runs in non-suid mode.
See the Apptainer v1.2.0 release notes for information on the changes from Apptainer 1.1. One notable improvement is that a $PWD
under /projects
is now bind-mounted by default. However, a $PWD
under $HOME
will be bind-mounted even if --no-home
or --no-mount home
are specified so --no-mount home,cwd
or --contain
must be used instead.
Using Docker Images with Apptainer
Option 1 - Just run it:
apptainer run docker://rockylinux:8
Images are cached in
$APPTAINER_CACHEDIR
if set, or in$HOME/.apptainer/cache
by default.Option 2 - Download to Singularity Image Format (SIF) file and run:
apptainer pull docker://rockylinux:8 apptainer run rockylinux_8.sif
A SIF file can also be run directly (assuming execute permission):
./rockylinux_8.sif
Option 3 - Download to local sandbox directory and modify:
apptainer build --sandbox /tmp/rocky docker://rockylinux:8 apptainer exec --fakeroot --writable /tmp/rocky yum install -y which apptainer run --fakeroot --writable /tmp/rocky
You can test the sandbox as a normal user in read-only mode:
apptainer run /tmp/rocky
The Lustre home and projects filesystems lack xattr support, which results in a long stream of error messages from apptainer build and causes yum install transaction failures. It is therefore necessary to use a writable local filesystem (/tmp) for sandboxes, and then convert the image to a SIF file on a cross-node filesystem for future use:
apptainer build --fakeroot newrocky.sif /tmp/rocky
Option 4 - Convert Dockerfile to Apptainer definition file and build:
Singularity Python provides a recipe converter from Dockerfile format to Apptainer definition file format. The converter greatly simplifies the process but isn’t perfect, particularly when files are copied using relative paths.
pip3 install spython --user spython recipe Dockerfile image.def apptainer build image.sif image.def
Interacting with Host Filesystems
Apptainer will bind-mount $HOME
, $PWD
, and /tmp
into the container by default.
Additional directories may be mounted with --bind src[:dest[:ro]]
and default mounts suppressed with --no-mount home,cwd,tmp
or --contain
.
Note that --no-mount home
or --no-home
will only disable mounting of the home directory if it is not also the current working directory.
The caller’s current user and group will appear unchanged, but all other users and groups will appear as nobody.
(With the --fakeroot
option $HOME
will be mounted as /root
and the caller’s user and group will be mapped to root.)
Regardless of apparent user and group, processes inside a container have the caller’s full read and write capabilities on mounted host filesystems.
See the Apptainer user guide - Bind Paths and Mounts for details.
Mounting Images of Many-File Datasets
Shared network filesystems, such as Lustre used for home and projects, incur much higher latencies opening and closing files than local filesystems do. For this reason, workflows that process many small files can run orders of magnitude slower on a cluster than on a desktop workstation.
As described in the Apptainer user guide - Image Mounts, Apptainer can bind-mount image files in standard ext3 and squashfs formats as well as its own SIF format. An image file can contain millions of tiny files while providing the simplicity and performance of a single large file. Each image file can safely be mounted either read-write by a single container or read-only by many containers (but not both at the same time).
Running with GPU Acceleration
Apptainer GPU support is described in detail in the Apptainer user guide - GPU Support but adding --nv
should just work, assuming that GPUs were correctly requested in the Slurm submission options.
Devices visible with nvidia-smi outside a container should be visible inside a container launched with --nv
.
Images based on Alpine Linux may not work correctly with --nv
(reporting nvidia-smi: not found
).
If this happens, try an image based on another Linux distribution such as Ubuntu.
The NVIDIA HPC SDK container distribution includes directions for running with Singularity that can be used as-is with Apptainer (/usr/bin/singularity
is a symbolic link to apptainer).
Note that by default Apptainer passes through most environment variables, including CC, CXX, FC, and F77 from the gcc module and MPICC, MPICXX, MPIF77, and MPIF90 from the openmpi module, which will mislead cmake and configure scripts into attempting to use compilers in /sw/spack/...
that are not available in the container.
This can be prevented by either running module unload gcc openmpi
or running Apptainer with the --cleanenv
option.
Running on Multiple Nodes with MPI
The many limitations and pitfalls of combining containers and MPI are detailed in the Apptainer user guide - MPI. To summarize, the MPI library used inside the container must be compatible with both the host mpiexec or srun program used to launch the container and with the host high-speed network. Images based on the latest OpenMPI release seem likely to work.
The NVIDIA GPU Cloud (NGC) HPC benchmark HPL image can be launched within a Slurm job by:
srun --mpi=pmi2 --cpu-bind=none apptainer run --nv NGC/hpc-benchmarks\:21.4-hpl hpl.sh ...
The job script sets all the node counts, task counts, and so on, but the hpl.sh
script uses numactl
so both CPU and GPU binding must be disabled.
The --mpi=pmi2
option overrides Hydro’s default pmix, but if there is a failure the pmi signal handling doesn’t work and the run hangs rather than exits.
The Extreme-scale Scientific Software Stack (E4S) image just works out of the box.
The image is 40 GB, so the box is pretty big, but spack list
shows over 6,000 packages that you can spack load
(and in some cases module load
) to run directly or to build into your own program on a host filesystem.
MPI applications can be launched inside the container by:
mpiexec ... apptainer exec e4s-cuda-x86_64-22.08.sif myprog ...
While the --cleanenv
option can prevent interaction with the Hydro module system when building software, in a parallel job it blocks environment variables needed by MPI, resulting in many independent processes rather than a single unified MPI launch.
Accessing Hydro Modules in a Container
The following Apptainer definition file will build an image that is compatible with the Hydro base OS and modules, including the MPI library, if launched with the --bind
and --env
options shown in the %help
section.
The definition file can be extended to yum install additional packages to augment the Hydro software stack when building and running software in a container.
Bootstrap: docker
From: rockylinux:8
%post
# for Lmod
yum install -y lua
yum install -y epel-release
/usr/bin/crb enable
yum repolist
yum install -y Lmod
# useful
yum install -y which
yum install -y make
yum install -y findutils
yum install -y glibc-headers
yum install -y glibc-devel
yum install -y tcl-devel
# for MPI
yum install -y hwloc-libs
yum install -y ucx
yum install -y libevent
# for GDAL
yum install -y libtiff
yum install -y libpng
%help
Enables host modules and MPI in container.
Recommended apptainer launch options are:
--bind /sw \
--bind /usr/lib64/liblustreapi.so.1 \
--bind /usr/lib64/libpmix.so.2 \
--bind /usr/lib64/pmix \
--env PREPEND_PATH="$PATH" \
--env LD_LIBRARY_PATH="$LD_LIBRARY_PATH"
Should work with GPUs if --nv added.