Advanced Apptainer usage
Running GPU programs: using CUDA
Bind-mounting GPU drivers
Apptainer supports Nvidia GPUs through bind-mounting the GPU drivers and the base CUDA libraries. The --nv
flag does it transparently to the user, e.g.
apptainer pull tensorflow.sif docker://tensorflow/tensorflow
ls -l tensorflow.sif # 415M
salloc --gres=gpu:p100:1 --cpus-per-task=8 --mem=40Gb --time=2:0:0 --account=...
apptainer exec --nv -B /scratch/${USER}:/scratch tensorflow.sif python my-tf.py
Use --nv
to expose the NVIDIA hardware devices to the container.
Pre-built NVIDIA containers
NVIDIA NGC (NVIDIA’s hub for GPU-optimized software) provides prebuilt containers for a large number of HPC applications. Try searching for TensorFlow, GAMESS (quantum chemistry), GROMACS and NAMD (molecular dynamics), VMD, ParaView, NVIDIA IndeX (visualization). Their GPU-accelerated containers are quite large, so it might take a while to build them, e.g.
apptainer pull tensorflow-22.06-tf1-py3.sif docker://nvcr.io/nvidia/tensorflow:22.06-tf1-py3
ls -l tensorflow-22.06-tf1-py3.sif # 5.9G (very large!)
Demo: Chapel GPU container
Recently, I built a container for using GPUs from Chapel programming language. There are some issues compiling Chapel with LLVM and GPU support on InfiniBand clusters, so I bypassed them by creating a container that can be used on any of our clusters, whether built on InfiniBand, or Ethernet, or OmniPath interconnect.
- the container can be used only on one node (no multi-node parallelism), but it does support multiple GPUs
- installed Chapel into an overlay image (we’ll study overlays below)
Here is how I would use it on Cedar:
cd ~/scratch
salloc --time=0:30:0 --nodes=1 --cpus-per-task=1 --mem-per-cpu=3600 --gpus-per-node=v100l:1 \
--account=cc-debug --reservation=asasfu_756
nvidia-smi # verify that we can see the GPU on the node
git clone ~/chapelBare $SLURM_TMPDIR/
module load apptainer
export SRC=/project/6003910/razoumov/apptainerImages/chapelGPU20240826
apptainer shell --nv -B $SLURM_TMPDIR --overlay $SRC/extra.img:ro $SRC/almalinux.sif
nvidia-smi # verify that we can see the GPU inside the container
source /extra/c4/chapel-2.1.0/util/setchplenv.bash
export CHPL_GPU=nvidia
export CHPL_CUDA_PATH=/usr/local/cuda-12.4
export PATH=$PATH:/usr/local/cuda-12.4/bin
cd $SLURM_TMPDIR/gpu
chpl --fast probeGPU.chpl -L/usr/local/cuda-12.4/targets/x86_64-linux/lib/stubs
./probeGPU
Let me know if you are interested in playing with it on a machine with an NVIDIA GPU, and I can share both container files with you.
Running MPI programs from within a container
MPI (message passing interface) is the industry standard for distributed-memory parallel programming. There are several implementations: OpenMPI, MPICH, and few others.
MPI libraries on HPC systems usually depend on various lower-level runtime libraries – interconnect, RDMA (Remote Direct Memory Access), PMI (process management interface) and others – that vary from one HPC cluster to another, so they are hard to containerize. Thus no generic --mpi
flag could be implemented for containers that would work across the network on different HPC clusters.
The official Apptainer documentation provides a good overview of running MPI codes inside containers. There are 3 possible modes of running MPI programs with Apptainer:
1. Rely on MPI inside the container
In the MPI-inside-the-container mode you would start a single Apptainer process on the host:
apptainer exec -B ... --pwd ... container.sif mpirun -np $SLURM_NTASKS ./mpicode
The cgroup
limitations from the Slurm job are passed into the container which sets the number of available CPU cores inside the container. In this setup the command mpirun
uses all available (to this job) CPU cores.
limited to a single node …
no need to adapt container’s MPI to the host; just install SSH into the container
can build a generic container that will work across multiple HPC clusters (each with a different setup)
2. Hybrid mode
Hybrid mode uses host’s MPI to spawn MPI processes, and MPI inside the container to compile the code and provide runtime MPI libraries. In this mode you would start a separate Apptainer process for each MPI rank:
mpirun -np $SLURM_NTASKS apptainer exec -B ... --pwd ... container.sif ./mpicode
can span multiple nodes
container’s MPI should be configured to support the same process management mechanism and version (e.g. PMI2 / PMIx) as the host – not that difficult with a little bit of technical knowledge (reach to us for help)
3. Bind mode
Bind mode bind-mount host’s MPI libraries and drivers into the container and use exclusively them, i.e. there is no MPI inside the container. The MPI code will need to be compiled (when building the container) with a version of MPI similar to that of the host – typically that MPI will reside on the build node used to build the container, but will not be installed inside the container.
I have zero experience with this mode, so I won’t talk about it here in detail.
Example: hybrid-mode MPI
For this course I have already built an MPI container that can talk to MPI on the training cluster. For those interested, you can find the detailed instructions here, but in nutshell I created a definition file that:
- bootstraps from docker://ubuntu:22.04
- installs the necessary fabric and PMI packages, Slurm client, few others requirements
and then used it to build mpi.sif
which I copied over to the training cluster into /project/def-sponsor00/shared
.
cd ~/tmp
module load openmpi apptainer
unzip /project/def-sponsor00/shared/introHPC.zip codes/distributedPi.c
cd codes
mkdir -p ~/.openmpi
echo "btl_vader_single_copy_mechanism=none" >> ~/.openmpi/mca-params.conf
export PMIX_MCA_psec=native # allow mpirun to use host's PMI
export CONTAINER=/project/def-sponsor00/shared/mpi.sif
apptainer exec $CONTAINER mpicc -O2 distributedPi.c -o distributedPi
salloc --ntasks=4 --time=0:5:0 --mem-per-cpu=1200
mpirun -np $SLURM_NTASKS ./distributedPi # error: compiled for Ubuntu, cannot run on Rocky Linux
mpirun -np $SLURM_NTASKS apptainer exec $CONTAINER ./distributedPi
Example: WRF container with self-contained MPICH
These instructions describe building a WRF Apptainer image following this build script. This container is large (8.1GB compressed SIF file, 47GB uncompressed sandbox) and includes everything but the kitchen sink, including multiple perl and Python 3 libraries and 3rd-party packages. It was created for a support ticket, but what’s important for us is that it also installs MPICH entirely inside the container, not relying on host’s OpenMPI. This means that we’ll be limited to MPI runs on one node.
To run an MPI code inside this container, it is important to pass -e
to Apptainer to avoid loading MPI from the host:
cd ~/scratch
module load apptainer/1.2.4
salloc --time=1:0:0 --ntasks=4 --mem-per-cpu=3600 --account=def-razoumov-ac
export APPTAINERENV_NTASKS=$SLURM_NTASKS
apptainer shell -e --pwd $PWD wrf.sif
export PATH=/data/WRF/Libs/MPICH/bin:$PATH
cat << EOF > distributedPi.c
#include <stdio.h>
#include <math.h>
#include <mpi.h>
int main(int argc, char *argv[])
{
double total, h, sum, x;
long long int i, n = 1e10;
int rank, numprocs;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
h = 1./n;
sum = 0.;
if (rank == 0)
printf("Calculating PI with %d processes\n", numprocs);
printf("process %d started\n", rank);
for (i = rank+1; i <= n; i += numprocs) {
x = h * ( i - 0.5 ); //calculate at center of interval
sum += 4.0 / ( 1.0 + pow(x,2));
}
sum *= h;
MPI_Reduce(&sum,&total,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD);
if (rank == 0)
printf("%.17g\n", total);
MPI_Finalize();
return 0;
}
EOF
mpicc distributedPi.c -o mpi
mpirun -np $NTASKS ./mpi
Overlays
In the container world, an overlay image is a file formatted as a filesystem. To the host filesystem it is a single file. When you mount it into a container, the container will see a filesystem with many files.
An overlay mounted into an immutable SIF image lets you store files without rebuilding the image. For example, you can store your computation results, or compile/install software into an overlay.
An overlay can be:
- a standalone writable ext3 filesystem image (most useful),
- a sandbox directory,
- a writable ext3 image embedded into the SIF file.
If you write millions of files, do not store them on a cluster filesystem – instead, use an Apptainer overlay file for that. Everything inside the overlay will appear as a single file to the cluster filesystem.
The direct apptainer overlay
command requires Singularity 3.8 / Apptainer 1.0 or later and a relatively recent set of filesystem tools, e.g. it won’t work in a CentOS7 VM. It should work on a VM or a cluster running Rocky Linux 8.5 or later.
cd ~/tmp
apptainer pull ubuntu.sif docker://ubuntu:latest
module load apptainer
salloc --time=0:30:0 --mem-per-cpu=3600
apptainer overlay create --size 512 small.img # create a 0.5GB overlay image file
apptainer shell --overlay small.img ubuntu.sif
Inside the container any newly-created top-level directory will go into the overlay filesystem:
Apptainer> df -kh # the overlay should be mounted inside the container
Apptainer> mkdir -p /data # by default this will go into the overlay image
Apptainer> cd /data
Apptainer> df -kh . # using overlay; check for available space
Apptainer> for num in $(seq -w 00 19); do
echo $num
# generate a binary file (1-33)MB in size
dd if=/dev/urandom of=test"$num" bs=1024 count=$(( RANDOM + 1024 ))
done
Apptainer> df -kh . # should take ~300-400 MB
If you exit the container and then mount the overlay again, your files will be there:
apptainer shell --overlay small.img ubuntu.sif
Apptainer> ls /data # here is your data
You can also create a new overlay image with a directory inside with something like:
apptainer overlay create --create-dir /data --size 512 overlay.img # create an overlay with a directory
If you want to mount the overlay in the read-only mode:
apptainer shell --overlay small.img:ro ubuntu.sif
Apptainer> touch /data/test.txt # error: read-only file system
Into the same container at the same time, you can mount many read-only overlays, but only one writable overlay.
To see the help page on overlays (these two commands are equivalent):
apptainer help overlay create
apptainer overlay create --help
Sparse overlay images
Sparse images use disk more efficiently when blocks allocated to them are mostly empty. As you add more data to a sparse image, it can grow (but not shrink!) in size. Let’s create a sparse overlay image:
apptainer overlay create --size 512 --sparse sparse.img
ls -l sparse.img # its apparent size is 512MB
du -h --apparent-size sparse.img # same
du -h sparse.img # its actual size is much smaller (17MB)
Let’s mount it and fill with some data, now creating fewer files:
apptainer shell --overlay sparse.img ubuntu.sif
Apptainer> mkdir -p /data && cd /data
Apptainer> for num in $(seq -w 0 4); do
echo $num
# generate a binary file (1-33)MB in size
dd if=/dev/urandom of=test"$num" bs=1024 count=$(( RANDOM + 1024 ))
done
Apptainer> df -kh . # should take ~75-100 MB, pay attention to "Used"
du -h sparse.img # shows actual usage
Be careful with sparse images: not all tools (e.g. backup/restore, scp, sftp, gunzip) recognize sparsefiles ⇒ this can potentially lead to data loss and other bad things …
Example: installing Conda into an overlay
Installing native Anaconda on HPC clusters is a bad idea for a number of reasons. Instead of Conda, we recommend using virtualenv
together with our pre-compiled Python wheels to install Python packages into your own virtual environments.
One of the reasons we do not recommend Conda is that it creates a large number of files in your directories. You can alleviate this problem by hiding Conda files inside an overlay image:
- takes a couple of minutes, results in 22k+ files that are hidden from the host
- no need for root, as you don’t modify the container image
- still might not be the most efficient use of resources (non-optimized binaries)
Here is one way you could install Conda into an overlay image:
cd ~/tmp
apptainer pull ubuntu.sif docker://ubuntu:latest
apptainer overlay create --size 1200 conda.img # create a 1200M overlay image
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
apptainer shell --overlay conda.img -B /home ubuntu.sif
Apptainer> mkdir /conda && cd /conda
Apptainer> df -kh .
Apptainer> bash /home/${USER}/tmp/miniconda.sh
agree to the license
use /conda/miniconda3 for the installation path
no to initialize Miniconda3
Apptainer> find /conda/miniconda3/ -type f | wc -l # 22,637 files
Apptainer> df -kh . # uses ~695M once finished, but more was used during installation
These 22k+ files appear as a single file to the Lustre metadata server (which is great!).
apptainer shell ubuntu.sif
Apptainer> ls /conda # no such file or directory
apptainer shell --overlay conda.img ubuntu.sif
Apptainer> source /conda/miniconda3/bin/activate
(base) Apptainer> type python # /conda/miniconda3/bin/python
(base) Apptainer> python # works
If you want to install large Python packages, you probably want to resize the image:
e2fsck -f conda.img # check your overlay's filesystem first (required step)
resize2fs -p conda.img 2G # resize your overlay
ls -l conda.img # should be 2GB
Next mount the resized overlay into the container, and make sure to pass the -C
flag to force writing config files locally (and not to the host):
apptainer shell -C --overlay conda.img ubuntu.sif # /home/$USER won't be available
Apptainer> cd /conda/miniconda3
Apptainer> source bin/activate
(base) Apptainer> conda install numpy
(base) Apptainer> df -kh . # so far used 1.8G out of 2.0G
(base) Apptainer> python
>>> import numpy as np
>>> np.pi
Running container instances
You can also run backgrounded processes within your container while not being inside your container. You can start/terminate these with instance start
/instance stop
. All these processes will terminate once your job ends.
module load apptainer
salloc --cpus-per-task=1 --time=0:30:0 --mem-per-cpu=3600
apptainer instance start ubuntu.sif test01 # start a container instance test01
apptainer shell instance://test01 # start an interactive shell in that instance
bash -c 'for i in {1..60}; do sleep 1; echo $i; done' > dump.txt & # start a 60-sec background process
exit # and then exit; the instance and the process are still running
apptainer exec instance://test01 tail -3 dump.txt # check on the process in that instance
apptainer exec instance://test01 tail -3 dump.txt # and again
apptainer shell instance://test01 # poke around the instance
apptainer instance list
apptainer instance stop test01
Best practices on production clusters
Do not build containers on networked filesystems
Don’t use /home
or /scratch
or /project
to build a container – instead, always use a local disk, e.g. /localscratch
on login nodes or $SLURM_TMPDIR inside a Slurm job. After having built it, you can move the container to a regular filesystem.
The importance of temp space when running large workflows
By default, for its internal use Apptainer allocates some temporary space in /tmp
which is often in RAM and is very limited. When it becomes full, Apptainer will stop working, so you might want to give it another, larger temporary space via the -W
flag. In practice, this would mean doing something like:
- on your own computer or on a production cluster’s login node:
mkdir /localscratch/tmp
apptainer shell/exec/run ... -W /localscratch/tmp <image.sif>
- inside a Slurm job:
mkdir $SLURM_TMPDIR/tmp
apptainer shell/exec/run ... -W $SLURM_TMPDIR/tmp <image.sif>
Regard /tmp
inside the container as temporary space. Any files you put there will disappear the next time you start the container.
You can use an environment variable in lieu of -W
:
export APPTAINER_TMPDIR=$SLURM_TMPDIR/tmp
apptainer shell/exec/run ... <image.sif>
Sample job submission script
Of all clusters in the Alliance only Cedar has Internet access from compute nodes – this might limit your options of where to build a container. You can move your SIF file to other clusters after having built it.
#!/bin/bash
#SBATCH --time=...
#SBATCH --mem=...
#SBATCH --account=def-...
cd $SLURM_TMPDIR
mkdir -p tmp cache
export APPTAINER_TMPDIR=${PWD}/tmp
export APPTAINER_CACHEDIR=${PWD}/cache # replaces default `$HOME/.apptainer/cache`
<build the container in this directory> # run on Cedar if docker:// access is needed
<run your workflow inside the container>
<copy out your results>
Placeholder: running multi-locale Chapel from a container
- Useful if Chapel is not installed natively at your HPC centre.
- Somewhat tricky for multi-locale Chapel due to its dependence on the cluster’s parallel launcher and interconnect.
- Piece of cake for single-locale Chapel and for emulated multi-locale Chapel.