Couple of side training projects that use Chapel
September 10th, 2025
These sessions are not focused on teaching Chapel itself, but rather on providing an opportunity to introduce the audience – academic researchers from diverse fields and our HPC users – to Chapel while covering other topics.
Generating a large ensemble of solutions for model training
The goal of this 4-hour ML course is to train a generative AI model on an ensemble of solutions from a 2D advection solver, and then use the model to predict solutions based on initial conditions. I am using a Chapel code to generate the solutions for model training.
Consider the following acoustic wave equation with \(c=1\) (speed of sound), written as a system separately for velocity and pressure: \[ \begin{cases} \partial\vec{v}/\partial t = -\nabla p\\\ \partial p/\partial t = -\nabla\cdot\vec{v} \end{cases} \]
Writing a full 3D solver is straightforward, but to reduce the model training time, we will implement it in 2D. Here is a multi-threaded implementation of this solver in Chapel acoustic2D.chpl
:
, Math, IO, sciplot, Time;
use Image
const n = 501, nt = 350, nout = 5; // resolution, max time steps, plotting frequency
config const ANIMATION = true;
config
= 1.0 / (n-1);
var h const mesh = {1..n, 1..n};
const largerMesh = {0..n+1, 0..n+1};
const a = 0.1; // 0.1 - thick front, no post-wave oscillations; 0.5 - narrow front, large oscillations
: [largerMesh] real;
var Vx: [largerMesh] real;
var Vy
: [largerMesh] real = [(i,j) in largerMesh] exp(-(a/h)**2*(((j-1)*h-0.5)**2));
var P
: [1..n, 1..n] 3*int;
var colour= readColourmap('viridis.csv'); // cmap.domain is {1..256, 1..3}
var cmap
if ANIMATION then plotPressure(0);
= h / 1.6;
var dt : stopwatch;
var watch.start();
watchfor count in 1..nt {
(i,j) in mesh {
forall [i,j] -= dt * (P[i,j]-P[i-1,j]) / h;
Vx[i,j] -= dt * (P[i,j]-P[i,j-1]) / h;
Vy}
(i,j) in mesh do
forall [i,j] -= dt * (Vx[i+1,j] - Vx[i,j] + Vy[i,j+1] - Vy[i,j]) / h;
P(P);
periodic(Vx);
periodic(Vy);
periodicif count%nout == 0 && ANIMATION then plotPressure(count/nout);
}
.stop();
watch('Simulation took ', watch.elapsed(), ' seconds');
writeln
(A) {
proc periodic[0,1..n] = A[n,1..n]; A[n+1,1..n] = A[1,1..n]; A[1..n,0] = A[1..n,n]; A[1..n,n+1] = A[1..n,1];
A}
(counter) {
proc plotPressure= min reduce(P);
var smin = max reduce(P);
var smax for i in 1..n {
for j in 1..n {
= ((P[i,j]-smin)/(smax-smin)*255):int + 1; //scale to 1..256
var idx [i,j] = ((cmap[idx,1]*255):int, (cmap[idx,2]*255):int, (cmap[idx,3]*255):int);
colour}
}
= colorToPixel(colour); // array of pixels
var pixels : string;
var filenameif counter >= 1000 then filename = counter:string;
else if counter >= 100 then filename = "0"+counter:string;
else if counter >= 10 then filename = "00"+counter:string;
else filename = "000"+counter:string;
("writing ", filename);
writeln("frame"+filename+".png", imageType.png, pixels);
writeImage}
with the following sciplot.chpl
that reads a colour map from a CSV file:
;
use IO;
use List
(filename: string) {
proc readColourmap= open(filename, ioMode.r).reader();
var reader : string;
var lineif (!reader.readLine(line)) then // skip the header
("ERROR: file appears to be empty");
halt: list(string); // a list of lines from the file
var dataRows while (reader.readLine(line)) do // read all lines into the list
.pushBack(line);
dataRows: [1..dataRows.size, 1..3] real;
var cmapfor (i, row) in zip(1..dataRows.size, dataRows) {
= row.find(','):int; // position of the 1st comma in the line
var c1 = row.rfind(','):int; // position of the 2nd comma in the line
var c2 [i,1] = row[0..c1-1]:real;
cmap[i,2] = row[c1+1..c2-1]:real;
cmap[i,3] = row[c2+1..]:real;
cmap}
.close();
readerreturn cmap;
}
To run it:
chpl --fast acoustic2D.chpl
./acoustic2D --nt=5000
For a single point source:
For two point sources:
Some representative solutions
Here are solutions for point sources at the same fixed time:
Since advection proceeds in the direction of the pressure gradient, a 1D initial condition produces a flat 1D wave:
Here combining 0D (point-source) and 1D features:
The plan is to feed thousands of these pairs of images – the initial conditions and the solution at the same fixed time – to a deep neural network, to train it to produce the solution from arbitrary initial conditions.
Could also demo a \(4001^2\) solution clip ~/training/jax/large.mkv
with 10_000 time steps.
Benchmarking Chapel vs. Julia
Out of curiosity, I implemented exactly the same solver in Julia using ParallelStencil.jl
, state-of-the-art Julia package that can run in parallel using either multi-core CPUs (on top of Base.Threads
) or GPUs (on top of CUDA.jl
, AMDGPU.jl
, or Metal.jl
). I ran both codes on the same machine (M1 MacBook Pro), for the same problem size (\(501^2\), 5000 time steps), turning off animation in both cases.
code | 1 CPU thread | 8 CPU threads | Metal |
---|---|---|---|
Chapel | 1.716s | 0.471s | |
Julia | 7.29s | 2.60s | 23.59s (problem too small) |
For the record, in Julia you see a speedup on the M1 GPU in 3D at \(512^3\): with 9000 time steps going from 8 CPU threads to Metal reduces the runtime from 39m to 11m40s.
GPU efficiency on NVIDIA’s H100 cards
These GPUs can be challenging to utilize effectively: most off-the-shelf software packages achieve well below 75% utilization on a full GPU. I plan to use a simple Chapel code to demonstrate profiling steps. The challenge is that my current code is too efficient, maintaining 100% GPU usage. I’ve manually degraded its efficiency for demonstration purposes, but I’d prefer to start with a less efficient version and then optimize it, so perhaps I need to consider a different numerical problem.
Prime factorization of an integer number is finding all its prime factors. For example, the prime factors of 60 are 2, 2, 3, and 5. Let’s write a function that takes an integer number and returns the sum of all its prime factors. For example, for 60 it will return 2+2+3+5 = 12, for 91 it will return 7+13 = 20, for 101 it will return 101, and so on.
Since 1 has no prime factors, we start computing from 2, and then apply this function to all integers in the range 2..n
, where n
is a larger number. We will do all computations separately from scratch for each number, i.e. we will not cache our results (caching can significantly speed up our calculations but the point here is to focus on brute-force computing).
Here is the GPU version of the code primesGPU.chpl
(the CPU version is very similar, with obvious simplifications):
;
use Time: stopwatch;
var watch
(n: int) {
proc primeFactorizationSum= n, total = 0, count = 0;
var num while num % 2 == 0 {
/= 2;
num += 1;
count }
for j in 1..count do total += 2;
for i in 3..(sqrt(n:real)):int by 2 {
= 0;
count while num % i == 0 {
/= i;
num += 1;
count }
for j in 1..count do total += i;
}
if num > 2 then total += num;
return total;
}
const a = 2, n = 10;
config .gpus[0] {
on here: [a..n] int;
var A.start();
watch..n do
@assertOnGpu foreach i in a[i] = primeFactorizationSum(i);
A.stop(); writeln('It took ', watch.elapsed(), ' seconds');
watch= if n > 5 then n-4..n else a..n; // last 5 or fewer digits
var lastFewDigits ("A = ", A[lastFewDigits]);
writeln}
Running it on a full H100 card on the cluster:
module load gcc/12.3 cuda/12.2 chapel-ucx-cuda/2.4.0
chpl --fast primesGPU.chpl
salloc --time=... --mem-per-cpu=3600 --gpus-per-node=h100:1 --account=...
./primesGPU -nl 1 --n=1_000_000
n | single-core CPU time in sec | GPU time in sec | speedup factor | GPU memory in MiB |
---|---|---|---|---|
1e6 | 0.5330 | 0.000717 | 743 | |
10e6 | 16.49 | 0.017904 | 921 | |
100e6 | 516.8 | 0.5456 | 947 | |
1e9 | 16.91 | |||
2e9 | 48.10 | |||
3e9 | 88.31 | |||
5e9 | 270.8 | 38670 | ||
10e9 | >1375 | 76816 |
Running nvidia-smi
on the GPU node, we see GPU utilization at 100%, and power consumption of 263W / 316W/ 318W / 267W (different stages of the same job) out of max 700W.
Moving some of the calculations to the host
Divide the entire range 2..n
into 100 groups. Next, divide each group into 10 items (each still being a range), of which the first 9 items are processed on the GPU, and the last one on the CPU in serial – a nice demo of Amdahl’s Law.
;
use Time.chunks;
import RangeChunk: stopwatch;
var watch
(n: int) {
proc primeFactorizationSum= n, total = 0, count = 0;
var num while num % 2 == 0 {
/= 2;
num += 1;
count }
for j in 1..count do total += 2;
for i in 3..(sqrt(n:real)):int by 2 {
= 0;
count while num % i == 0 {
/= i;
num += 1;
count }
for j in 1..count do total += i;
}
if num > 2 then total += num;
return total;
}
const a = 2, n = 5_000_000_000;
config .gpus[0] {
on here: [a..n] int;
var A.start();
watchfor group in chunks(a..n, 100) {
for item in chunks(group, 10) {
if item.last < group.last then
do // first 9 items in each group are processed on the GPU
@assertOnGpu foreach i in item [i] = primeFactorizationSum(i);
Aelse
for i in item do // last item in each group is processed on the CPU
[i] = primeFactorizationSum(i);
A}
}
.stop(); writeln('It took ', watch.elapsed(), ' seconds');
watch= if n > 5 then n-4..n else a..n; // last 5 or fewer digits
var lastFewDigits ("A = ", A[lastFewDigits]);
writeln}
On a full H100, this runs at 10% GPU efficiency, consuming 112W out of max 700W.
In the actual workshop, in addition to nvidia-smi
, I am hoping to use NVIDIA’s Nsight profilers (if those can work with Chapel-compiled binaries) and our cluster’s monitoring portal (not yet in production). Combine this with MIG partitions and MPS scheduling.
How do I go in the opposite direction, starting from an inefficient code? Need a simple numerical problem for which a naive implementation results in a poorly-performing GPU code which would be simple to improve. Here are a couple of things to try:
1. the 3D version of the acoustic wave solver should run inefficiently for smaller problems on a GPU
2. 2D acoustic wave solver for very large problems
3. a Julia set should suffer from load imbalance, but not sure if it’ll show when running on a GPU