I do time-consuming simulations involving the following (simplified) code:
K=10^5; % large number
L=1000; % smaller number
a=rand(K,L);
b=rand(K,L);
c=rand(L,L);
d=zeros(K,L,L);
parfor m=1:L-1
e=zeros(K,L);
for n=m:L-1
e(:,n+1)=e(:,n)+(n==m)+e(:,n).*a(:,n).*b(:,n)+a(:,1:n)*c(n,1:n)';
end
d(:,:,m)=e;
end
Does anyone know how to speed up this simple code running in parallel (with parfor)?
Since each worker requires matrices a and b and c, there is a large parallel overhead.
The overhead is smaller if I send each worker only the parts of the matrix b it needs (since the inner loop starts at m), but that doesn't make the code very much faster, I think.
Because of the large overhead, parfor is slower than the serial for-loop. As parfor iterations increase (increasing L), the sizes of a, b, and c also increase, and so does the overhead. Therefore, I do not expect the parfor loop to be faster even for large values of L. Or does anyone see it differently?
There may be a performance gain using pre-computation:
tc = tril(c);
ac = a * tc.';
ab = a .* b;
for m=1:L-1
e = zeros(K,L);
for n=m:L-1
e(:, n + 1) = e(:, n) + (n==m) + e(:, n) .* ab(:, n) + ac(:, n);
end
d(:,:,m) = e;
end