Parallel Julia set

Shared-memory Julia set

Recall the serial code juliaSetSerial.chpl (without output):

use Time;

config const c = 0.355 + 0.355i;

proc pixel(z0) {
  var z = z0*1.2;   // zoom out
  for i in 1..255 {
    z = z*z + c;
    if abs(z) >= 4 then
      return i;
  }
  return 255;
}

config const n = 2_000;   // vertical and horizontal size of our image
var y: real;
var point: complex;
var watch: stopwatch;

writeln("Computing ", n, "x", n, " Julia set ...");
var stability: [1..n,1..n] int;
watch.start();
for i in 1..n {
  y = 2*(i-0.5)/n - 1;
  for j in 1..n {
    point = 2*(j-0.5)/n - 1 + y*1i;   // rescale to -1:1 in the complex plane
    stability[i,j] = pixel(point);
  }
}
watch.stop();
writeln('It took ', watch.elapsed(), ' seconds');

Now let’s parallelize this code with forall in shared memory (single locale). Copy juliaSetSerial.chpl into juliaSetParallel.chpl and start modifying it:

For the outer loop, replace for with forall. This will produce an error about the scope of variables y and point:

error: cannot assign to const variable
note: The shadow variable 'y' is constant due to task intents in this loop
error: cannot assign to const variable
note: The shadow variable 'point' is constant due to task intents in this loop

Discussion

Why do you think this message was produced? How do we solve this problem?

What do we do next?

Compile and run the code on several CPU cores on 1 node:

#!/bin/bash
# this is shared.sh
#SBATCH --time=0:5:0         # walltime in d-hh:mm or hh:mm:ss format
#SBATCH --mem-per-cpu=3600   # in MB
#SBATCH --cpus-per-task=4
#SBATCH --output=solution.out
./juliaSetParallel

module load chapel-multicore/2.4.0
chpl --fast juliaSetParallel.chpl
sbatch shared.sh

Once you have the working shared-memory parallel code, study its performance.

Here are my timings on the training cluster:

ncores	1	2	4	8
wallclock runtime (sec)	1.181	0.568	0.307	0.197

Discussion

Why do you think the code’s speed does not scale linearly (~6X on 8 cores) with the number of cores?

Julia set on distributed domains

Copy juliaSetParallel.chpl into juliaSetDistributed.chpl and start modifying it:

Load BlockDist
Replace

var stability: [1..n,1..n] int;

with

const mesh: domain(2) = {1..n, 1..n};
const distributedMesh: domain(2) dmapped new blockDist(boundingBox=mesh) = mesh;
var stability: [distributedMesh] int;

Look into the loop variables: currently we have

forall i in 1..n {
  var y = 2*(i-0.5)/n - 1;
  for j in 1..n {
    var point = 2*(j-0.5)/n - 1 + y*1i;   // rescale to -1:1 in the complex plane
    stability[i,j] = pixel(point);
  }
}

– in the previous, shared-memory version of the code this fragment gave you a parallel loop running on multiple cores on the same node. If you run this loop now, it’ll run entirely on the first node!

In the distributed version of the code you want to loop in parallel over all elements of the distributed mesh distributedMesh (or, equivalently, over all elements of the distributed array stability) – this will send the computation to the locales holding these blocks:

forall (i,j) in distributedMesh {
  var y = 2*(i-0.5)/n - 1;
  var point = 2*(j-0.5)/n - 1 + y*1i;
  stability[i,j] = pixel(point);
}

or (equivalent):

forall (i,j) in stability.domain {
  var y = 2*(i-0.5)/n - 1;
  var point = 2*(j-0.5)/n - 1 + y*1i;
  stability[i,j] = pixel(point);
}

Compile and run a larger problem (\(8000^2\)) across several nodes:

#!/bin/bash
# this is distributed.sh
#SBATCH --time=0:5:0         # walltime in d-hh:mm or hh:mm:ss format
#SBATCH --nodes=4
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=3600   # in MB
#SBATCH --output=solution.out
echo Running on $SLURM_NNODES nodes
./juliaSetDistributed --n=8000 -nl $SLURM_NNODES

source /project/def-sponsor00/shared/syncHPC/startMultiLocale.sh
chpl --fast juliaSetDistributed.chpl
sbatch distributed.sh

Here are my timings on the training cluster (even over a slow interconnect!):

--nodes	1	2	4	4
--cpus-per-task	1	1	1	8
wallclock runtime (sec)	36.56	17.91	9.51	0.985

They don’t call it “embarrassing parallel” for nothing! There is some overhead at the start and at the end of computing each block, but this overhead is much smaller than the computing part itself, hence leading to almost perfect speedup.

Note

Here we have an example of a hybrid parallel code, utilizing multiples processes (one per locale) and multiple threads (on each locale) when available.