This Julia function seems to be quite inefficient (an order of magnitude slower than the equivalent Pythran / C++ code, even after the Julia warmup)...
function my_multi_broadcast(a)
10 * (2*a.^2 + 4*a.^3) + 2 ./ a
end
arr = ones(1000, 1000)
my_multi_broadcast(arr)
I guess it is only that I don't write it correctly... How can one speedup such "multi broadcasts" in Julia? I guess/hope I don't need to expend the loops...
Thank you! With my setup, the Pythran solutions (in place and out of place) are still 1.5 to 2 times faster (without OpenMP). Is there a way to activate SIMD instructions in Julia? Or another way to speed up such CPU computations?
The Python code:
from transonic import jit
@jit
def broadcast(a):
return 10 * (2*a**2 + 4*a**3) + 2 / a
@jit
def broadcast_inplace(a):
a[:] = 10 * (2*a**2 + 4*a**3) + 2 / a
@simd
suggestionIt seems that @simd
does not work out of the box, i.e. just by adding it at the beginning of the line.
ERROR: LoadError: LoadError: Base.SimdLoop.SimdError("for loop expected")
Stacktrace:
[1] compile(::Expr, ::Bool) at ./simdloop.jl:54
[2] @simd(::LineNumberNode, ::Module, ::Any) at ./simdloop.jl:126
[3] include at ./boot.jl:317 [inlined]
[4] include_relative(::Module, ::String) at ./loading.jl:1044
[5] include(::Module, ::String) at ./sysimg.jl:29
[6] exec_options(::Base.JLOptions) at ./client.jl:231
[7] _start() at ./client.jl:425
I guess that one would have to expand the for loops, but then the code (i) becomes much less readable and (ii) is no longer independent of the dimension.
It seems that we have a case for which simple Python/Numpy code can get accelerated with Pythran faster than what we get with Julia (except if there is a way to accelerate this in Julia? and a future Julia version may solve this). Interesting...
Broadcast all operations like this:
julia> function my_multi_broadcast2(a)
@. 10 * (2*a^2 + 4*a^3) + 2 / a
end
my_multi_broadcast2 (generic function with 1 method)
The difference is that in 10 * (2*a.^2 + 4*a.^3) + 2 ./ a
you actually do not take advantage of broadcast fusion as *
and two +
are not broadcasted.
Writing @. 10 * (2*a^2 + 4*a^3) + 2 / a
is equivalent to 10 .* (2 .* a.^2 .+ 4 .* a.^3) .+ 2 ./ a
.
And here is the comparison of performance
julia> @btime my_multi_broadcast($arr);
58.146 ms (18 allocations: 61.04 MiB)
julia> @btime my_multi_broadcast2($arr);
5.982 ms (4 allocations: 7.63 MiB)
How does it compare to Pythran / C++, as we get roughly 10x speedup?
Finally note that if you could mutate arr
in place by writing:
julia> function my_multi_broadcast3(a)
@. a = 10 * (2*a^2 + 4*a^3) + 2 / a
end
my_multi_broadcast3 (generic function with 1 method)
julia> @btime my_multi_broadcast3($arr);
1.840 ms (0 allocations: 0 bytes)
which is yet faster and does zero allocations (I do not know if you want to modify arr
in place or create a new array so I show both approaches).