performancematlabjuliaoverheadbroadcasting

Julia broadcasting twice as slow as Matlab


I am trying to familiarize myself with Julia in order to migrate from Matlab, so far so good until I started using broadcasting to port a specific function which performs more or less twice as slow as Matlab.

function features(X::Vector{Float64},M::Int,hyper::Float64,mid::Float64)
    X = X.-mid
    H = 4.0.*hyper.+maximum(abs.(X))
    X = (X.+H)./(2.0.*H)
    w = transpose(1:M)
    S = (sqrt.(2.0.*pi).*hyper).*exp.((-0.5.*hyper.^2).*((pi.*w./(2.0.*H)).^2))
    f = H.^(-0.5).*sin.(pi.*X.*w).*sqrt.(S)
end

Any help would be appreciated!


Solution

  • Firstly, your use of broadcasting isn't optimal. You are using it too much, and not enough ;)

    Secondly, almost all the runtime (99.9%) is happening in the broadcasted sin expression, so the effort should be focused there.

    And thirdly, you shouldn't really expect Julia to outperform Matlab in a case like this. This is exactly what Matlab has been optimized for: straightforward elementwise calls into optimized C/Fortran routines. Also, Matlab is multi-threaded by default, implicitly running the elementwise calls in parallel, while Julia requires you to be explicit about multi-threading.

    For now, a factor of 2 difference doesn't seem unreasonable.

    Still, let's make an effort. Here are a few comments first:

    X = X .- mid
    

    You are missing out on making the operation in-place, use

    X .= X .- mid
    

    instead. This saves an allocation of an intermediate array.

    H = 4.0.*hyper.+maximum(abs.(X))
    

    Broadcasting over scalars (hyper) is futile, and at worst, wasteful. And abs.(X) creates an unnecessary temporary array. Instead use the version of maximum with a function input, which is more efficient:

    H = 4 * hyper + maximum(abs, X)
    

    Here are some more unnecessary dots:

    S = (sqrt.(2.0.*pi).*hyper).*exp.((-0.5.*hyper.^2).*((pi.*w./(2.0.*H)).^2))
    

    Avoid broadcasting over scalars again and use integers most places instead of floats:

    S = (sqrt(2pi) * hyper) .* exp.((-0.5 * hyper^2 * (pi/2H)^2) .* w.^2)
    

    Note that x^(-0.5) is much slower than 1/sqrt(x), so

    f = H.^(-0.5).*sin.(pi.*X.*w).*sqrt.(S)
    

    should be

    f = sin.(pi .* X .* w') .* (sqrt.(S)' ./ sqrt(H))
    

    Let's put this together:

    function features2(X::Vector{Float64},M::Int,hyper::Float64,mid::Float64)
        X .= X .- mid
        H = 4 * hyper + maximum(abs, X)
        X .= (X .+ H) ./ (2 * H)
        w = 1:M
        S = (sqrt(2pi) * hyper) .* exp.((-0.5 * hyper^2 * (pi/2H)^2) .* w.^2)
        f = sin.(pi .* X .* w') .* (sqrt.(S)' ./ sqrt(H))
        return f
    end
    

    Benchmarks:

    jl> X = rand(10000);
    
    jl> M = 100;
    
    jl> hyper = rand();
    
    jl> mid = 0.4;
    
    jl> @btime features($X, $M, $hyper, $mid);
      17.339 ms (9 allocations: 7.86 MiB)
    
    jl> @btime features2($X, $M, $hyper, $mid);
      17.173 ms (4 allocations: 7.63 MiB)
    

    That's not much of a speedup. Fewer allocations, though. The problem is that runtime is dominated to an enormous degree by the sin broadcast.

    Let's try multithreading. I have 8 cores, so I'm using 8 threads:

    function features3(X::Vector{Float64},M::Int,hyper::Float64,mid::Float64)
        X .= X .- mid
        H = 4 * hyper + maximum(abs, X)
        X .= (X .+ H) ./ (2 * H)
        w = transpose(1:M)
        S = (sqrt(2pi) * hyper) .* exp.((-0.5 * hyper^2 * (pi/2H)^2) .* w.^2)
        f = similar(X, length(X), M)
        temp = sqrt.(S) ./ sqrt(H)
        Threads.@threads for j in axes(f, 2)
            wj = w[j]
            tempj = temp[j]
            for i in axes(f, 1)
                @inbounds f[i, j] = tempj * sin(pi * X[i] * w[j])
            end
        end
        return f
    end
    

    Bencmark:

    jl> @btime features3($X, $M, $hyper, $mid);
      1.919 ms (45 allocations: 7.63 MiB)
    

    That's a lot better, 9 times faster with a loop and explicit threading.

    But there are still some options left: for example LoopVectorization.jl. You can install this amazing package, but you need a new version, there could be some installation issues, depending on what other packages you have. LoopVectorization has two macros of particular interest, @avx and @avxt, the former does a lot of work to vectorize (in a simd sense) your code, single-threaded, while latter does the same, but multi-threaded.

    using LoopVectorization
    
    function features4(X::Vector{Float64},M::Int,hyper::Float64,mid::Float64)
        X .= X .- mid
        H = 4 * hyper + maximum(abs, X)
        X .= (X .+ H) ./ (2 * H)
        w = collect(1:M)  # I have to use collect here due to some issue with LoopVectorization
        S = (sqrt(2pi) * hyper) .* exp.((-0.5 * hyper^2 * (pi/2H)^2) .* w.^2)
        f = @avx sin.(pi .* X .* w') .* (sqrt.(S)' ./ sqrt(H))
        return f
    end
    
    function features4t(X::Vector{Float64},M::Int,hyper::Float64,mid::Float64)
        X .= X .- mid
        H = 4 * hyper + maximum(abs, X)
        X .= (X .+ H) ./ (2 * H)
        w = collect(1:M)  # I have to use collect here due to some issue with LoopVectorization
        S = (sqrt(2pi) * hyper) .* exp.((-0.5 * hyper^2 * (pi/2H)^2) .* w.^2)
        f = @avxt sin.(pi .* X .* w') .* (sqrt.(S)' ./ sqrt(H))
        return f
    end
    

    The only difference between these functions is @avx vs @avxt.

    Benchmarks:

    jl> @btime features4($X, $M, $hyper, $mid);
      2.695 ms (5 allocations: 7.63 MiB)
    

    A very nice speedup for the single threaded case.

    jl> @btime features4t($X, $M, $hyper, $mid);
      431.700 μs (5 allocations: 7.63 MiB)
    

    The multi-threaded avx code is 40x as fast as the original code on my laptop. Not bad?