Julia set on a single locale

Recall the serial code juliaSetSerial.chpl (without output):

use Time;

config const c = 0.355 + 0.355i;

proc pixel(z0) {
  var z = z0*1.2;   // zoom out
  for i in 1..255 {
    z = z*z + c;
    if abs(z) >= 4 then
      return i;
  }
  return 255;
}

config const n = 2_000;   // vertical and horizontal size of our image
var y: real;
var point: complex;
var watch: stopwatch;

writeln("Computing ", n, "x", n, " Julia set ...");
var stability: [1..n,1..n] int;
watch.start();
for i in 1..n {
  y = 2*(i-0.5)/n - 1;
  for j in 1..n {
    point = 2*(j-0.5)/n - 1 + y*1i;   // rescale to -1:1 in the complex plane
    stability[i,j] = pixel(point);
  }
}
watch.stop();
writeln('It took ', watch.elapsed(), ' seconds');

Now let’s parallelize this code with forall in shared memory (single locale). Copy juliaSetSerial.chpl into juliaSetParallel.chpl and start modifying it:

  1. For the outer loop, replace for with forall. This will produce an error about the scope of variables y and point:
error: cannot assign to const variable
note: The shadow variable 'y' is constant due to task intents in this loop
error: cannot assign to const variable
note: The shadow variable 'point' is constant due to task intents in this loop
NoteDiscussion

Why do you think this message was produced? How do we solve this problem? Hint: each thread needs its own separate copy of these two variables.

  1. What do we do next?

Compile and run the code on several CPU cores on 1 node:

#!/bin/bash
# this is shared.sh
#SBATCH --time=0:5:0         # walltime in d-hh:mm or hh:mm:ss format
#SBATCH --mem-per-cpu=3600   # in MB
#SBATCH --cpus-per-task=4
#SBATCH --output=solution.out
./juliaSetParallel
module load chapel-multicore/2.4.0
chpl --fast juliaSetParallel.chpl
sbatch shared.sh

Once you have the working shared-memory parallel code, study its performance.

Here are my timings on the training cluster:

ncores 1 2 4 8
wallclock runtime (sec) 1.181 0.568 0.307 0.197
NoteDiscussion

Why do you think the code’s speed does not scale linearly (~6X on 8 cores) with the number of cores?