Parallelizing N-body

In this section I will describe a project that you can work at home: the direct N-body solver.

Imagine that you place \(N\) identical particles randomly into a unit cube, with zero initial velocities. Then you turn on gravity, so that the particles start attracting each other. There are no boundary conditions: these are the only particles in the Universe, and they can fly to \(\infty\).

Question

What do you expect these particles will do?

We will adopt the following numerical method:

In a real simulation, you would replace:

Expected solutions:

In these clips below the time arrow is not physical time but the time step number. Consequently, the animations slow down when any two particles come close to each other.

 

Below you will find the serial code nbodySerial.jl. I removed all parts related to plotting the results, as it’s slow in Julia, and you would need to install Plots package (takes a while with many dependencies!).

using ProgressMeter

npart = 20
niter = Int(1e5)
courant = 1e-3
softeningLength = 0.01

x = rand(npart, 3);   # uniform [0,1]
v = zeros(npart, 3);

soft = softeningLength^2;

println("Computing ...");
force = zeros(Float32, npart, 3);
oldforce = zeros(Float32, npart, 3);
@showprogress for iter = 1:niter
    tmin = 1.e10
    for i = 1:npart
        force[i,:] .= 0.
        for j = 1:npart
            if i != j
                distSquared = sum((x[i,:] .- x[j,:]).^2) + soft;
                force[i,:] -= (x[i, :] .- x[j,:]) / distSquared^1.5;
                tmin = min(tmin, sqrt(distSquared / sum((v[i,:] .- v[j,:]).^2)));
            end
        end
    end
    dt = min(tmin*courant, 0.001);   # limit the initial step
    for i = 1:npart
        x[i,:] .+= v[i,:] .* dt .+ 0.5 .* oldforce[i,:] .* dt^2;
        v[i,:] .+= 0.5 .* (oldforce[i,:] .+ force[i,:]) .* dt;
        oldforce[i,:] .= force[i,:];
    end
end

Tring running this code with julia nbodySerial.jl; the main loop takes ~6m on the training cluster. Obviously, the most CPU-intensive part is force evaluation – this is what you want to accelerate.

There are many small arrays in the code – let’s use SharedArrays and fill them in parallel, e.g. you would replace

force = zeros(Float32, npart, 3);

with

force = SharedArray{Float32}(npart,3);

When updating shared arrays, you have a choice: either update array[localindices(array)] on each worker, or use a parallel for loop with reduction. I suggest the latter. What do you want to reduce? Hint: what else do you compute besides the force in that loop? For code syntax, check parallelFor.jl code in this earlier section.

Results

With default 20 particles and \(10^5\) steps the code runs slower in parallel on the training cluster:

Code Time
julia nbodySerial.jl (serial runtime) 340s
julia -p 1 nbodyDistributedShared.jl (2 processes) 358s

This is the same problem we discussed in the Chapel course: with a fine-grained parallel code the communication overhead dominates the problem. As we increase the problem size, we should see the benefit from parallelization. E.g. with 1000 particles and 10 steps:

Code Time
julia nbodySerial.jl (serial runtime) 83.7s
julia -p 1 nbodyDistributedShared.jl (2 processes) 47.9s

Here is what I got on Cedar with 100 particles and \(10^3\) steps:

Run Time
serial 1m23s
2 cores 46s
4 cores 29s
8 cores 22s
16 cores 18s
32 cores 19s