Parallelizing iterative additive Schwarz method
In this section we will implement the iterative additive Schwarz method (ASM) 1 in Julia, starting with a serial version. We will then parallelize it with DistributedArrays.jl.
We will be solving the 1D Poisson problem:
Discretizing this equation on a uniform grid of \(m\) points, we have
or in matrix notation \(AU=F\), where
Let’s break our grid into two domains \(\Omega=\Omega_1\bigcup\Omega_2\), where we are looking for the solution
In matrix notation the solution can be written as \(U_i=R_iU\), where the restriction operator \(R_1\) is a \(m_s\times m\) matrix consisting of two parts of sizes \(m_s\times m_s\) and \(m_s\times(m-m_s)\), and \(R_2\) is a \((m-m_s)\times m\) matrix consisting of two parts of sizes \((m-m_s)\times m_s\) and \((m-m_s)\times(m-m_s)\), respectively:
The iterative additive Schwarz method (eq. 1.30 of 2) lets you compute the next iteration of the solution as
where the matrix
is called the ASM preconditioner.
Serial additive Schwarz method with explicit matrices
Let’s generalize our computation to three 1D domains. For now, we are writing the serial code, so all arrays are local. Also in this first code, we will define matrices explicitly, even though \(A\) is sparse, and \(R_i\) are Boolean matrices. All these shortcomings will be corrected in the parallel code.
For matrix inversion \((R_iAT_i^T)^{-1}\) we will use Julia’s builtin inv()
function, but presumably for larger calculations you might want to replace this with your own (more efficient?) code.
For the right-hand side \(F\), we’ll use a constant piece in the middle, mimicking a uniform bar with empty spaces on each end.
= 21 # total number of points
m = 3 # number of domains
N = zeros(Int, N) # number of points in each domain
ms 1:2] = [7,7] # number of points in the first two domains, respectively
ms[3] = m - sum(ms)
ms[= 1.0 / (m-1) # grid spacing
h
= zeros(Float64, m, 1); # 0th iteration
U
= zeros(Float64, m, 1); # RHS
F trunc(Int,m/2)-3:trunc(Int,m/2)+3] .= h^2 * 1.0;
F[
= zeros(Float64, m, m);
A 1,1], A[m,m] = 1, 1
A[for i in 2:m-1
-1] = -1
A[i, i= 2
A[i,i] +1] = -1
A[i, iend
= zeros(Int32, ms[1], m);
R1 for j in 1:ms[1]
= 1
R1[j,j] end
= zeros(Int32, ms[2], m);
R2 for j in 1:ms[2]
1]+j] = 1
R2[j,ms[end
= zeros(Int32, ms[3], m);
R3 for j in 1:ms[3]
1]+ms[2]+j] = 1
R3[j,ms[end
# compute three terms in the ASM preconditioner
= transpose(R1) * inv(R1*A*transpose(R1)) * R1;
M += transpose(R2) * inv(R2*A*transpose(R2)) * R2;
M += transpose(R3) * inv(R3*A*transpose(R3)) * R3;
M
using LinearAlgebra: norm
for iter in 1:50
= M * (F-A*U)
dU global U += dU
println(norm(dU))
end
= A \ F; # Julia's left division, i.e. A^{-1}*F
exact using Plots
plot(1:m, exact, label = "exact")
plot!(1:m, U, label="approximate") # overplot
savefig("serial.png")
Serial additive Schwarz method with structures
We will generalize our code to an arbitrary number of domains. Instead of using R1, R2, R3, we will define a structure domainType
that will contain the restriction operator for each domain. It could contain other variables and matrices, but for now we don’t need any other elements in it.
Similar to the previous version of the code, here we are still storing all sparse and Boolean matrices as dense matrices, which is Ok only for testing purposes (for real problems these matrices will be large).
= 21 # total number of points
m = 3 # number of domains
N = zeros(Int, N) # number of points in each domain
ms 1:2] = [7,7] # number of points in the first two domains, respectively
ms[3] = m - sum(ms)
ms[= 1.0 / (m-1) # grid spacing
h
= zeros(Float64, m, 1); # 0th iteration; will be a distributed 2D array
U
= zeros(Float64, m, 1); # RHS; will be a distributed 2D array
F trunc(Int,m/2)-3:trunc(Int,m/2)+3] .= h^2 * 1.0;
F[
= zeros(Float64, m, m);
A 1,1], A[m,m] = 1, 1
A[for i in 2:m-1
-1] = -1
A[i, i= 2
A[i,i] +1] = -1
A[i, iend
struct domainType
::Array{Int32}
Rend
= Array{domainType, 1}(undef, 3) # 3-element 1D array of domains
domain = zeros(Float64, m, m);
M
= [0, ms[1], ms[1]+ms[2]];
offset for i in 1:3
= domainType(zeros(Int32, ms[i], m)) # construct a new object of domainType
domain[i] for j in 1:ms[i]
+j] = 1
domain[i].R[j,offset[i]end
global M += transpose(domain[i].R) * inv(domain[i].R*A*transpose(domain[i].R)) * domain[i].R;
end
using LinearAlgebra: norm
for iter in 1:50
= M * (F-A*U)
dU global U += dU
println(norm(dU))
end
= A \ F;
exact using Plots
plot(1:m, exact, label = "exact")
plot!(1:m, U, label="approximate")
savefig("serial.png")
Parallel additive Schwarz method
Imagine that now we are dealing with a very large problem, and we are breaking it into pieces, with each piece being processed by one worker. Now \(U\) and \(F\) will be 1D distributed arrays split between workers; we’ll implement them with distributed arrays.
Ideally, we would like to partition domain
into subdomains with DistributedArrays.jl, and then on each processor use a combination of sparse and Boolean (coded explicitly with indices) matrices to solve the problem. However, at this point DistributedArrays.jl does not seem to support distributed arrays of structures.
At the heart of our implementation is the distributed preconditioner matrix pre
. DistributedArrays.jl does not seem to support an uneven distribution of an array across processes, and therefore we are limited to having an equal number of points in all subdomains.
\(A\) is a sparse matrix. While Julia supports sparse matrices, instead we will code it algebraically with indices, so that we don’t have to worry about its distribution among processes.
\(R_i\) is Boolean. The matrix \(R_iAR_i^T\) applies the domain restriction operator \(R_i\) to both rows and columns of \(A\), and the result is coded algebraically with the function fillPre()
. It is then inverted locally in each process with invertPre()
.
\(R_i^T\left(R_iAR_i^T\right)^{-1}R_i\) takes the result of this inversion and puts it as a dense block into the \(m\times
m\) ASM preconditioner \(M^{-1}_\textrm{ASM}\). Each process computes its own dense block and stores it locally inside the distributed preconditioner matrix pre
.
Next, we start iterations. In computeUpdate()
we compute \(F-AU^n\) as a 1D distributed array tmp
, and multiply the preconditioner pre
by tmp
. Since pre
is block-diagonal, this multiplication can be done separately in each process. Finally, we call addUpdate()
to update the distributed solution U
, again separately in each process.
Big assumptions:
- All distributed arrays are partitioned in exactly the same way, i.e. the same range of indices is assigned to each worker for 1D arrays
U
,tmp
,dU
,F
. - The 2D array
pre
is partitioned purely along the second dimension (columns), i.e. each worker gets an \(m_s\times m_s\) array, and the range of column indices on each worker is exactly the same as for the 1D arrays above.
You can force (2) by replacing the line
= dzeros(Float64, ms, m); pre
with the following block
= @spawnat 2 zeros(Float64, ms, ms);
d1 = @spawnat 3 zeros(Float64, ms, ms);
d2 = @spawnat 4 zeros(Float64, ms, ms);
d3 = DArray([d1 d2 d3]); pre
Here is the entire parallel code:
= 3 # number of domains and processes
N = 21 # total number of points; must be a multiple of N
m @assert m%N == 0 "m must be a multiple of N"
= round(Int, m/N) # the size of each subdomain
ms = 1.0 / (m-1) # grid spacing
h
using Distributed
addprocs(N)
@everywhere using DistributedArrays
= dzeros(Float64, m, 1); # 0th iteration; a distributed 2D array
U = dzeros(Float64, m, 1); # work area array
tmp = dzeros(Float64, m, 1); # update array
dU
= dzeros(Float64, m, 1); # RHS; a distributed 2D array
F @everywhere function fillF(data,m,h)
= localindices(data)[1]
rows for iGlobal in rows
= iGlobal - rows.start + 1
iLoc if iGlobal >= trunc(Int,m/2)-3 && iGlobal <= trunc(Int,m/2)+3
= h^2 * 1.0;
data.localpart[iLoc] end
end
end
for w in workers()
@spawnat w fillF(F, m, h)
end
= dzeros(Float64, ms, m);
pre
@everywhere function fillPre(data, rank, ms, N)
if rank == 1
1,1] = 1
data.localpart[for iLoc in 2:ms # main diagonal
= 2
data.localpart[iLoc,iLoc] end
for iLoc in 3:ms # above main diagonal
-1,iLoc] = -1
data.localpart[iLocend
for iLoc in 1:ms-1 # below main diagonal
+1,iLoc] = -1
data.localpart[iLocend
end
if rank > 1 && rank < N
for iLoc in 1:ms # main diagonal
= 2
data.localpart[iLoc,iLoc] end
for iLoc in 2:ms # above main diagonal
-1,iLoc] = -1
data.localpart[iLocend
for iLoc in 1:ms-1 # below main diagonal
+1,iLoc] = -1
data.localpart[iLocend
end
if rank == N
= 1
data.localpart[ms,ms] for iLoc in 1:ms-1 # main diagonal
= 2
data.localpart[iLoc,iLoc] end
for iLoc in 2:ms # above main diagonal
-1,iLoc] = -1
data.localpart[iLocend
for iLoc in 1:ms-2 # below main diagonal
+1,iLoc] = -1
data.localpart[iLocend
end
end
for (rank,w) in enumerate(workers())
@spawnat w fillPre(pre, rank, ms, N)
end
@everywhere function invertPre(data)
= inv(data.localpart)
data.localpart # println(data.localpart)
end
for w in workers()
@spawnat w invertPre(pre)
end
@everywhere function computeUpdate(data, F, U, ms, rank, N, tmp, dU)
# (1) compute tmp = (F-A*U)
if rank == 1
1] = F.localpart[1] - U[1]
tmp.localpart[# for rank==1 we always have iGlobal = iLoc
for iLoc in 2:ms
= F.localpart[iLoc] + U[iLoc-1] - 2*U[iLoc] + U[iLoc+1] # last one has U[ms+1] => domains communicate
tmp.localpart[iLoc] end
end
if rank > 1 && rank < N
= (rank-1)*ms
iGlobal for iLoc in 1:ms
+= 1
iGlobal = F.localpart[iLoc] + U[iGlobal-1] - 2*U[iGlobal] + U[iGlobal+1]
tmp.localpart[iLoc] end
end
if rank == N
= (rank-1)*ms
iGlobal for iLoc in 1:ms-1
+= 1
iGlobal = F.localpart[iLoc] + U[iGlobal-1] - 2*U[iGlobal] + U[iGlobal+1]
tmp.localpart[iLoc] end
= F.localpart[ms] - U[rank*ms]
tmp.localpart[ms] end
# (2) compute pre*tmp
= data.localpart*tmp.localpart
dU.localpart if rank == 1
println(norm(dU.localpart))
end
end
@everywhere function addUpdate(U, dU)
+= dU.localpart
U.localpart end
@everywhere using LinearAlgebra: norm
for iter in 1:50
@sync for (rank,w) in enumerate(workers())
@spawnat w computeUpdate(pre, F, U, ms, rank, N, tmp, dU)
end
@sync for w in workers()
@spawnat w addUpdate(U, dU)
end
end
using Plots
= zeros(Float64, m, 1);
Ulocal .= U
Ulocal plot(1:m, Ulocal, label="approximate")
savefig("parallel.png")
Ulocal
is necessary since the plot()
function won’t take the distributed array U
as an argument.