I am trying to familiarize myself with Julia in order to migrate from Matlab, so far so good until I started using broadcasting to port a specific function which performs more or less twice as slow as Matlab.
function features(X::Vector{Float64},M::Int,hyper::Float64,mid::Float64)
X = X.-mid
H = 4.0.*hyper.+maximum(abs.(X))
X = (X.+H)./(2.0.*H)
w = transpose(1:M)
S = (sqrt.(2.0.*pi).*hyper).*exp.((-0.5.*hyper.^2).*((pi.*w./(2.0.*H)).^2))
f = H.^(-0.5).*sin.(pi.*X.*w).*sqrt.(S)
end
Any help would be appreciated!
Firstly, your use of broadcasting isn't optimal. You are using it too much, and not enough ;)
Secondly, almost all the runtime (99.9%) is happening in the broadcasted sin
expression, so the effort should be focused there.
And thirdly, you shouldn't really expect Julia to outperform Matlab in a case like this. This is exactly what Matlab has been optimized for: straightforward elementwise calls into optimized C/Fortran routines. Also, Matlab is multi-threaded by default, implicitly running the elementwise calls in parallel, while Julia requires you to be explicit about multi-threading.
For now, a factor of 2 difference doesn't seem unreasonable.
Still, let's make an effort. Here are a few comments first:
X = X .- mid
You are missing out on making the operation in-place, use
X .= X .- mid
instead. This saves an allocation of an intermediate array.
H = 4.0.*hyper.+maximum(abs.(X))
Broadcasting over scalars (hyper
) is futile, and at worst, wasteful. And abs.(X)
creates an unnecessary temporary array. Instead use the version of maximum
with a function input, which is more efficient:
H = 4 * hyper + maximum(abs, X)
Here are some more unnecessary dots:
S = (sqrt.(2.0.*pi).*hyper).*exp.((-0.5.*hyper.^2).*((pi.*w./(2.0.*H)).^2))
Avoid broadcasting over scalars again and use integers most places instead of floats:
S = (sqrt(2pi) * hyper) .* exp.((-0.5 * hyper^2 * (pi/2H)^2) .* w.^2)
Note that x^(-0.5)
is much slower than 1/sqrt(x)
, so
f = H.^(-0.5).*sin.(pi.*X.*w).*sqrt.(S)
should be
f = sin.(pi .* X .* w') .* (sqrt.(S)' ./ sqrt(H))
Let's put this together:
function features2(X::Vector{Float64},M::Int,hyper::Float64,mid::Float64)
X .= X .- mid
H = 4 * hyper + maximum(abs, X)
X .= (X .+ H) ./ (2 * H)
w = 1:M
S = (sqrt(2pi) * hyper) .* exp.((-0.5 * hyper^2 * (pi/2H)^2) .* w.^2)
f = sin.(pi .* X .* w') .* (sqrt.(S)' ./ sqrt(H))
return f
end
Benchmarks:
jl> X = rand(10000);
jl> M = 100;
jl> hyper = rand();
jl> mid = 0.4;
jl> @btime features($X, $M, $hyper, $mid);
17.339 ms (9 allocations: 7.86 MiB)
jl> @btime features2($X, $M, $hyper, $mid);
17.173 ms (4 allocations: 7.63 MiB)
That's not much of a speedup. Fewer allocations, though. The problem is that runtime is dominated to an enormous degree by the sin
broadcast.
Let's try multithreading. I have 8 cores, so I'm using 8 threads:
function features3(X::Vector{Float64},M::Int,hyper::Float64,mid::Float64)
X .= X .- mid
H = 4 * hyper + maximum(abs, X)
X .= (X .+ H) ./ (2 * H)
w = transpose(1:M)
S = (sqrt(2pi) * hyper) .* exp.((-0.5 * hyper^2 * (pi/2H)^2) .* w.^2)
f = similar(X, length(X), M)
temp = sqrt.(S) ./ sqrt(H)
Threads.@threads for j in axes(f, 2)
wj = w[j]
tempj = temp[j]
for i in axes(f, 1)
@inbounds f[i, j] = tempj * sin(pi * X[i] * w[j])
end
end
return f
end
Bencmark:
jl> @btime features3($X, $M, $hyper, $mid);
1.919 ms (45 allocations: 7.63 MiB)
That's a lot better, 9 times faster with a loop and explicit threading.
But there are still some options left: for example LoopVectorization.jl. You can install this amazing package, but you need a new version, there could be some installation issues, depending on what other packages you have. LoopVectorization has two macros of particular interest, @avx
and @avxt
, the former does a lot of work to vectorize (in a simd sense) your code, single-threaded, while latter does the same, but multi-threaded.
using LoopVectorization
function features4(X::Vector{Float64},M::Int,hyper::Float64,mid::Float64)
X .= X .- mid
H = 4 * hyper + maximum(abs, X)
X .= (X .+ H) ./ (2 * H)
w = collect(1:M) # I have to use collect here due to some issue with LoopVectorization
S = (sqrt(2pi) * hyper) .* exp.((-0.5 * hyper^2 * (pi/2H)^2) .* w.^2)
f = @avx sin.(pi .* X .* w') .* (sqrt.(S)' ./ sqrt(H))
return f
end
function features4t(X::Vector{Float64},M::Int,hyper::Float64,mid::Float64)
X .= X .- mid
H = 4 * hyper + maximum(abs, X)
X .= (X .+ H) ./ (2 * H)
w = collect(1:M) # I have to use collect here due to some issue with LoopVectorization
S = (sqrt(2pi) * hyper) .* exp.((-0.5 * hyper^2 * (pi/2H)^2) .* w.^2)
f = @avxt sin.(pi .* X .* w') .* (sqrt.(S)' ./ sqrt(H))
return f
end
The only difference between these functions is @avx
vs @avxt
.
Benchmarks:
jl> @btime features4($X, $M, $hyper, $mid);
2.695 ms (5 allocations: 7.63 MiB)
A very nice speedup for the single threaded case.
jl> @btime features4t($X, $M, $hyper, $mid);
431.700 μs (5 allocations: 7.63 MiB)
The multi-threaded avx code is 40x as fast as the original code on my laptop. Not bad?