multithreadingjulialinear-algebrablasarbitrary-precision

How to force Julia to use multiple threads for matrix multiplication?


I want to find powers of a relatively small matrix, but this matrix consists of rational numbers of type Rational{BigInt}. By default, Julia utilizes only a single thread for such computations. I want to check if using multithreaded matrix multiplication would yield performance gains. How do I do this?

Below is an example of raising 32x32 matrix to the power of four. If I run it on i7-12700k it uses only one thread:

using Random
using LinearAlgebra

Random.seed!(42)

M = BigInt.(rand(Int128, 32, 32)) .// BigInt.(rand(Int128, 32, 32));

BLAS.set_num_threads(8)

@time M^4;

the output is:

19.976082 seconds (1.24 M allocations: 910.103 MiB, 0.19% gc time)

With just Float64 and big matricies I can see Julia correctly uses multiple threads.

N = rand(2^14,2^14)

@time N^4;

32.764584 seconds (1.71 M allocations: 4.113 GiB, 0.08% gc time, 1.14% compilation time)

Solution

  • As noted in comments above, BLAS isn't involved in this at all.

    Since I have it, here's a very simple multi-threaded function:

    julia> M3 = @time M^3;
      8.113582 seconds (1.24 M allocations: 644.222 MiB, 0.60% gc time)
    
    julia> function mul_th(A::AbstractMatrix, B::AbstractMatrix)
             C = similar(A, size(A,1), size(B,2))
             size(A,2) == size(B,1) || error("sizes don't match up")
             Threads.@threads for i in axes(A,1)
               for j in axes(B,2)
                 acc = zero(eltype(C))
                 for k in axes(A,2)
                   acc += A[i,k] * B[k,j]
                 end
                 C[i,j] = acc
               end
             end
             C
           end;
    
    julia> M3 == @time mul_th(mul_th(M, M), M)
      2.313267 seconds (1.24 M allocations: 639.237 MiB, 2.29% gc time, 5.94% compilation time)
    true
    
    julia> Threads.nthreads()  # running e.g. julia -t4
    4
    

    Various packages can write this for you, e.g. using Einsum; mul(A,B) = @vielsum C[i,k] := A[i,j] * B[j,k] or else using Tullio; mul(A,B) = @tullio C[i,k] := A[i,j] * B[j,k] threads=10.

    Higher powers are much slower, because the numbers involved are larger:

    julia> M2 = @time M^2;
      0.133534 seconds (621.57 k allocations: 51.243 MiB, 3.22% gc time)
    
    julia> M3 = @time M^3;
      8.084701 seconds (1.24 M allocations: 644.222 MiB, 0.64% gc time)
    
    julia> M4 = @time M^4;  # uses Base.power_by_squaring
     20.915199 seconds (1.24 M allocations: 910.664 MiB, 0.84% gc time)
    
    julia> @time M2 * M2;  # all the time is here:
     20.659935 seconds (621.57 k allocations: 859.421 MiB, 0.69% gc time)
    
    julia> mean(x -> abs(x.den), M)
    6.27823462640995725259881990669421930274423828125e+37
    
    julia> mean(x -> abs(x.den), M2)
    4.845326324048551470412760353413448348641588891008324543404627136353750441508056e+2349