juliapythran

How to speedup multiple broadcasts in Julia


This Julia function seems to be quite inefficient (an order of magnitude slower than the equivalent Pythran / C++ code, even after the Julia warmup)...

function my_multi_broadcast(a)
    10 * (2*a.^2 + 4*a.^3) + 2 ./ a
end

arr = ones(1000, 1000)
my_multi_broadcast(arr)

I guess it is only that I don't write it correctly... How can one speedup such "multi broadcasts" in Julia? I guess/hope I don't need to expend the loops...

Edit after the first answer

Thank you! With my setup, the Pythran solutions (in place and out of place) are still 1.5 to 2 times faster (without OpenMP). Is there a way to activate SIMD instructions in Julia? Or another way to speed up such CPU computations?

The Python code:

from transonic import jit

@jit
def broadcast(a):
    return 10 * (2*a**2 + 4*a**3) + 2 / a

@jit
def broadcast_inplace(a):
    a[:] = 10 * (2*a**2 + 4*a**3) + 2 / a

Edit after the @simd suggestion

It seems that @simd does not work out of the box, i.e. just by adding it at the beginning of the line.

ERROR: LoadError: LoadError: Base.SimdLoop.SimdError("for loop expected")
Stacktrace:
 [1] compile(::Expr, ::Bool) at ./simdloop.jl:54
 [2] @simd(::LineNumberNode, ::Module, ::Any) at ./simdloop.jl:126
 [3] include at ./boot.jl:317 [inlined]
 [4] include_relative(::Module, ::String) at ./loading.jl:1044
 [5] include(::Module, ::String) at ./sysimg.jl:29
 [6] exec_options(::Base.JLOptions) at ./client.jl:231
 [7] _start() at ./client.jl:425

I guess that one would have to expand the for loops, but then the code (i) becomes much less readable and (ii) is no longer independent of the dimension.

It seems that we have a case for which simple Python/Numpy code can get accelerated with Pythran faster than what we get with Julia (except if there is a way to accelerate this in Julia? and a future Julia version may solve this). Interesting...


Solution

  • Broadcast all operations like this:

    julia> function my_multi_broadcast2(a)
               @. 10 * (2*a^2 + 4*a^3) + 2 / a
           end
    my_multi_broadcast2 (generic function with 1 method)
    

    The difference is that in 10 * (2*a.^2 + 4*a.^3) + 2 ./ a you actually do not take advantage of broadcast fusion as * and two + are not broadcasted.

    Writing @. 10 * (2*a^2 + 4*a^3) + 2 / a is equivalent to 10 .* (2 .* a.^2 .+ 4 .* a.^3) .+ 2 ./ a.

    And here is the comparison of performance

    julia> @btime my_multi_broadcast($arr);
      58.146 ms (18 allocations: 61.04 MiB)
    
    julia> @btime my_multi_broadcast2($arr);
      5.982 ms (4 allocations: 7.63 MiB)
    

    How does it compare to Pythran / C++, as we get roughly 10x speedup?

    Finally note that if you could mutate arr in place by writing:

    julia> function my_multi_broadcast3(a)
               @. a = 10 * (2*a^2 + 4*a^3) + 2 / a
           end
    my_multi_broadcast3 (generic function with 1 method)
    
    julia> @btime my_multi_broadcast3($arr);
      1.840 ms (0 allocations: 0 bytes)
    

    which is yet faster and does zero allocations (I do not know if you want to modify arr in place or create a new array so I show both approaches).