parallel-processing multiprocessing julia matrix-inverse

Parallelise backslash matrix inversion using @distributed

I'm solving a PDE using an implicit scheme, which I can divide into two matrices at every time step, that are then connected by a boundary condition (also at every time step). I'm trying to speed up the process by using multi-processing to invert both matrices at the same time.

Here's an example of what this looks like in a minimal (non-PDE-solving) example.

using Distributed
using LinearAlgebra

function backslash(N, T, b, exec)
    A = zeros(N,N)
    α = 0.1
    for i in 1:N, j in 1:N
        abs(i-j)<=1 && (A[i,j]+=-α)
        i==j && (A[i,j]+=3*α+1)
    end

    A = Tridiagonal(A)
    a = zeros(N, 4, T)

    if exec == "parallel"
        for i = 1:T
            @distributed for j = 1:2
                a[:, j, i] = A\b[:, i]
            end
        end
    elseif exec == "single"
        for i = 1:T
            for j = 1:2
                a[:, j, i] = A\b[:, i]
            end
        end
    end
    return a
end

b = rand(1000, 1000)

a_single = @time backslash(1000, 1000, b, "single");
a_parallel = @time backslash(1000, 1000, b, "parallel");

a_single == a_parallel

Here comes the problem: the last line evaluate to true, with an 6-fold speed-up, however, only 2-fold should be possible. What am I getting wrong?

Solution

You are measuring compile time
Your @distributed loop exits prematurely
Your @distributed loop does not collect the results

Hence obviously you have:

julia> addprocs(2);

julia> sum(backslash(1000, 1000, b, "single")), sum(backslash(1000, 1000, b, "parallel"))
(999810.3418359067, 0.0)

So in order to make your code work you need to collect the data from the distributed loop which can be done as:

function backslash2(N, T, b, exec)
    A = zeros(N,N)
    α = 0.1
    for i in 1:N, j in 1:N
        abs(i-j)<=1 && (A[i,j]+=-α)
        i==j && (A[i,j]+=3*α+1)
    end

    A = Tridiagonal(A)
    a = zeros(N, 4, T)

    if exec == :parallel
        for i = 1:T
           
            aj = @distributed (append!) for j = 1:2
                [A\b[:, i]]
            end
            # you could consider using SharedArrays instead
            a[:, 1, i] .= aj[1]
        a[:, 2, i] .= aj[2]
        end
    elseif exec == :single
        for i = 1:T
            for j = 1:2
                a[:, j, i] = A\b[:, i]
            end
        end
    end
    return a
end

Now you have equal results:

julia>  sum(backslash2(1000, 1000, b, :single)) == sum(backslash2(1000, 1000, b, :parallel))
true

However, the distributed code is very inefficient for loops that take few milliseconds to execute so the @distributed code will take in this example many times longer to execute as you run it 1000 times and it takes something like few milliseconds to dispatch a distributed job each time.

Perhaps your production task takes longer so it than makes sense. Or maybe you will consider Threads.@threads instead.

Last but not least BLAS might be configured to be multi-threaded and in this scenario on a single machine it might make no sense to parallelize (depends on the scenario)