Parallelizing N-body

In this section I will describe a project that you can work at home: the direct N-body solver.

Imagine that you place \(N\) identical particles randomly into a unit cube, with zero initial velocities. Then you turn on gravity, so that the particles start attracting each other. There are no boundary conditions: these are the only particles in the Universe, and they can fly to \(\infty\).

Question

What do you expect these particles will do?

We will adopt the following numerical method:

force evaluation via direct summation
single variable (adaptive) time step: smaller \(\Delta t\) when any two particles are close
time integration: more accurate than simple forward Euler + one force evaluation per time step
two parameters: softening length and Courant number (I will explain these when we study the code)

In a real simulation, you would replace:

direct summation with a tree- or mesh-based \(O(N\log N)\) code
current integrator with a higher-order scheme, e.g. Runge-Kutta
current timestepping with hierarchical particle updates
for long-term stable evolution with a small number of particles, use a symplectic orbit integrator

Expected solutions:

2 particles: should pass through each other, infinite oscillations
3 particles: likely form a close binary + distant 3\(^{\rm rd}\) particle (hierarchical triple system)
many particles: likely form a gravitationally bound system, with occasional ejection

In these clips below the time arrow is not physical time but the time step number. Consequently, the animations slow down when any two particles come close to each other.

Below you will find the serial code nbodySerial.jl. I removed all parts related to plotting the results, as it’s slow in Julia, and you would need to install Plots package (takes a while with many dependencies!).

using ProgressMeter

npart = 20
niter = Int(1e5)
courant = 1e-3
softeningLength = 0.01

x = rand(npart, 3);   # uniform [0,1]
v = zeros(npart, 3);

soft = softeningLength^2;

println("Computing ...");
force = zeros(Float32, npart, 3);
oldforce = zeros(Float32, npart, 3);
@showprogress for iter = 1:niter
    tmin = 1.e10
    for i = 1:npart
        force[i,:] .= 0.
        for j = 1:npart
            if i != j
                distSquared = sum((x[i,:] .- x[j,:]).^2) + soft;
                force[i,:] -= (x[i, :] .- x[j,:]) / distSquared^1.5;
                tmin = min(tmin, sqrt(distSquared / sum((v[i,:] .- v[j,:]).^2)));
            end
        end
    end
    dt = min(tmin*courant, 0.001);   # limit the initial step
    for i = 1:npart
        x[i,:] .+= v[i,:] .* dt .+ 0.5 .* oldforce[i,:] .* dt^2;
        v[i,:] .+= 0.5 .* (oldforce[i,:] .+ force[i,:]) .* dt;
        oldforce[i,:] .= force[i,:];
    end
end

Tring running this code with julia nbodySerial.jl; the main loop takes ~6m on the training cluster. Obviously, the most CPU-intensive part is force evaluation – this is what you want to accelerate.

There are many small arrays in the code – let’s use SharedArrays and fill them in parallel, e.g. you would replace

force = zeros(Float32, npart, 3);

with

force = SharedArray{Float32}(npart,3);

When updating shared arrays, you have a choice: either update array[localindices(array)] on each worker, or use a parallel for loop with reduction. I suggest the latter. What do you want to reduce? Hint: what else do you compute besides the force in that loop? For code syntax, check parallelFor.jl code in this earlier section.

Results

With default 20 particles and \(10^5\) steps the code runs slower in parallel on the training cluster:

Code	Time
`julia nbodySerial.jl` (serial runtime)	340s
`julia -p 1 nbodyDistributedShared.jl` (2 processes)	358s

This is the same problem we discussed in the Chapel course: with a fine-grained parallel code the communication overhead dominates the problem. As we increase the problem size, we should see the benefit from parallelization. E.g. with 1000 particles and 10 steps:

Code	Time
`julia nbodySerial.jl` (serial runtime)	83.7s
`julia -p 1 nbodyDistributedShared.jl` (2 processes)	47.9s

Here is what I got on Cedar with 100 particles and \(10^3\) steps:

Run	Time
serial	1m23s
2 cores	46s
4 cores	29s
8 cores	22s
16 cores	18s
32 cores	19s