Parallel Julia set
Julia set on distributed domains
Copy juliaSetParallel.chpl into juliaSetDistributed.chpl and start modifying it:
- Load
BlockDist - Replace
var stability: [1..n,1..n] int;
with
const mesh: domain(2) = {1..n, 1..n};
const distributedMesh: domain(2) dmapped new blockDist(boundingBox=mesh) = mesh;
var stability: [distributedMesh] int;
- Look into the loop variables: currently we have
forall i in 1..n {
var y = 2*(i-0.5)/n - 1;
for j in 1..n {
var point = 2*(j-0.5)/n - 1 + y*1i; // rescale to -1:1 in the complex plane
stability[i,j] = pixel(point);
}
}
– in the previous, shared-memory version of the code this fragment gave you a parallel loop running on multiple cores on the same node. If you run this loop now, it’ll run entirely on the first node!
In the distributed version of the code you want to loop in parallel over all elements of the distributed mesh distributedMesh (or, equivalently, over all elements of the distributed array stability) – this will send the computation to the locales holding these blocks:
forall (i,j) in distributedMesh {
var y = 2*(i-0.5)/n - 1;
var point = 2*(j-0.5)/n - 1 + y*1i;
stability[i,j] = pixel(point);
}
or (equivalent):
forall (i,j) in stability.domain {
var y = 2*(i-0.5)/n - 1;
var point = 2*(j-0.5)/n - 1 + y*1i;
stability[i,j] = pixel(point);
}
Compile and run a larger problem (\(8000^2\)) across several nodes:
#!/bin/bash
# this is distributed.sh
#SBATCH --time=0:5:0 # walltime in d-hh:mm or hh:mm:ss format
#SBATCH --nodes=4
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=3600 # in MB
#SBATCH --output=solution.out
echo Running on $SLURM_NNODES nodes
./juliaSetDistributed --n=8000 -nl $SLURM_NNODESsource /project/def-sponsor00/shared/syncHPC/startMultiLocale.sh
chpl --fast juliaSetDistributed.chpl
sbatch distributed.shHere are my timings on the training cluster (even over a slow interconnect!):
| --nodes | 1 | 2 | 4 | 4 |
| --cpus-per-task | 1 | 1 | 1 | 8 |
| wallclock runtime (sec) | 36.56 | 17.91 | 9.51 | 0.985 |
They don’t call it “embarrassing parallel” for nothing! There is some overhead at the start and at the end of computing each block, but this overhead is much smaller than the computing part itself, hence leading to almost perfect speedup.
Here we have an example of a hybrid parallel code, utilizing multiples processes (one per locale) and multiple threads (on each locale) when available.