TensorFlow on DeltaAI

Summary

  • The options to run TensorFlow are NGC containers like: tensorflow_24.09-tf2-py3.sif (in /sw/user/NGC_containers).

  • Power users will run into errors or install fails when trying to build their own environments beyond the container.

    • pip install --user into $HOME or a $PYTHONUSERBASE (see below) to work around this.

  • jupyter-notebook is in the container.

  • Remember to add the --nv flag to the srun apptainer command line when using any NGC container.

Run TensorFlow

Warning

TensorFlow on DeltaAI must use the NGC container. NVIDIA has told us that it is not possible to get a GPU-enabled TensorFlow by other means (pip or conda installs) and DeltaAI admins have confirmed that locally. After installing TensorFlow on your own, runtime will throw this error:

> python3 -c “import tensorflow as tf; print(tf.config.list_physical_devices(‘GPU’))” <jemalloc>: Unsupported system page size

Customization

The container does not support python venv (it’s not installed), and conda is not available inside the container. Instead, use the PYTHONUSERBASE environment variable to specify a (possibly shared) path where you will install additions to the tensorflow container’s python. If you are using a jupyter notebook you will need to “restart kernel” from the menu to make your changes visible to jupyter. See also: PYTHONUSERBASE:

Installing from within the Container

arnoldg@gh001:~> export PYTHONUSERBASE=/projects/bbka/arnoldg/tensorflow_modules
arnoldg@gh001:~> apptainer shell --bind /projects /sw/user/NGC_containers/tensorflow_24.09-tf2-py3.sif
Apptainer> pip install --user matplotlib
...
Successfully installed contourpy-1.2.1 cycler-0.12.1 fonttools-4.53.1 kiwisolver-1.4.5 matplotlib-3.9.0 pillow-10.4.0
Apptainer> python3
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
Could not open PYTHONSTARTUP
FileNotFoundError: [Errno 2] No such file or directory: '/etc/pythonstart'
>>> import matplotlib
>>> exit()
Apptainer> echo $PYTHONUSERBASE
/projects/bbka/arnoldg/tensorflow_modules
Apptainer> ls $PYTHONUSERBASE/lib/python3.10/site-packages/
PIL                        fontTools                   mpl_toolkits
__pycache__                fonttools-4.53.1.dist-info  pillow-10.4.0.dist-info
contourpy                  kiwisolver                  pillow.libs
contourpy-1.2.1.dist-info  kiwisolver-1.4.5.dist-info  pylab.py
cycler                     matplotlib
cycler-0.12.1.dist-info    matplotlib-3.9.0.dist-info
Apptainer>

Package Install Location

arnoldg@gh001:~/.local/lib/python3.10/site-packages> pwd
/u/arnoldg/.local/lib/python3.10/site-packages
arnoldg@gh001:~/.local/lib/python3.10/site-packages> ls
contourpy                  fontTools                   matplotlib                  pillow-10.4.0.dist-info
contourpy-1.2.1.dist-info  fonttools-4.53.1.dist-info  matplotlib-3.9.1.dist-info  pillow.libs
cycler                     kiwisolver                  mpl_toolkits                __pycache__
cycler-0.12.1.dist-info    kiwisolver-1.4.5.dist-info  PIL                         pylab.py

Runtime Items of Note

Use some CPU cores with this container (--cpus-per-task=64). It takes quite a few ARM cores to keep the H100 GPUs working at peak.