parallel-processingmultiprocessingjuliamatrix-inverse

Parallelise backslash matrix inversion using @distributed


I'm solving a PDE using an implicit scheme, which I can divide into two matrices at every time step, that are then connected by a boundary condition (also at every time step). I'm trying to speed up the process by using multi-processing to invert both matrices at the same time.

Here's an example of what this looks like in a minimal (non-PDE-solving) example.

using Distributed
using LinearAlgebra

function backslash(N, T, b, exec)
    A = zeros(N,N)
    α = 0.1
    for i in 1:N, j in 1:N
        abs(i-j)<=1 && (A[i,j]+=-α)
        i==j && (A[i,j]+=3*α+1)
    end

    A = Tridiagonal(A)
    a = zeros(N, 4, T)

    if exec == "parallel"
        for i = 1:T
            @distributed for j = 1:2
                a[:, j, i] = A\b[:, i]
            end
        end
    elseif exec == "single"
        for i = 1:T
            for j = 1:2
                a[:, j, i] = A\b[:, i]
            end
        end
    end
    return a
end

b = rand(1000, 1000)

a_single = @time backslash(1000, 1000, b, "single");
a_parallel = @time backslash(1000, 1000, b, "parallel");

a_single == a_parallel

Here comes the problem: the last line evaluate to true, with an 6-fold speed-up, however, only 2-fold should be possible. What am I getting wrong?


Solution

    1. You are measuring compile time
    2. Your @distributed loop exits prematurely
    3. Your @distributed loop does not collect the results

    Hence obviously you have:

    julia> addprocs(2);
    
    julia> sum(backslash(1000, 1000, b, "single")), sum(backslash(1000, 1000, b, "parallel"))
    (999810.3418359067, 0.0)
    

    So in order to make your code work you need to collect the data from the distributed loop which can be done as:

    function backslash2(N, T, b, exec)
        A = zeros(N,N)
        α = 0.1
        for i in 1:N, j in 1:N
            abs(i-j)<=1 && (A[i,j]+=-α)
            i==j && (A[i,j]+=3*α+1)
        end
    
        A = Tridiagonal(A)
        a = zeros(N, 4, T)
    
        if exec == :parallel
            for i = 1:T
               
                aj = @distributed (append!) for j = 1:2
                    [A\b[:, i]]
                end
                # you could consider using SharedArrays instead
                a[:, 1, i] .= aj[1]
            a[:, 2, i] .= aj[2]
            end
        elseif exec == :single
            for i = 1:T
                for j = 1:2
                    a[:, j, i] = A\b[:, i]
                end
            end
        end
        return a
    end
    

    Now you have equal results:

    julia>  sum(backslash2(1000, 1000, b, :single)) == sum(backslash2(1000, 1000, b, :parallel))
    true
    

    However, the distributed code is very inefficient for loops that take few milliseconds to execute so the @distributed code will take in this example many times longer to execute as you run it 1000 times and it takes something like few milliseconds to dispatch a distributed job each time.

    Perhaps your production task takes longer so it than makes sense. Or maybe you will consider Threads.@threads instead.

    Last but not least BLAS might be configured to be multi-threaded and in this scenario on a single machine it might make no sense to parallelize (depends on the scenario)