matlab optimization parallel-processing parfor

Saving time and memory using parfor?

Consider prova.mat in MATLAB obtained in the following way

for w=1:100
    for p=1:9    
        A{p}=randn(100,1); 
    end
    baseA_.A=A;

    eval(['baseA.A' num2str(w) '= baseA_;'])

end

save(sprintf('prova.mat'),'-v7.3', 'baseA')

To have an idea of the actual dimensions in my data, the 1x9 cell in A1 is composed by the following 9 arrays: 904x5, 913x5, 1722x5, 4136x5, 9180x5, 3174x5, 5970x5, 4455x5, 340068x5. The other Aj's have a similar composition.

Consider the following code

clear all
load prova
tic
parfor w=1:100
       indA=sprintf('A%d', w);
       Aarr=baseA.(indA).A;
       Boot=[];
       for p=1:9
           C=randn(100,1).*Aarr{p};
           Boot=[Boot; C];  
       end
       D{w}=Boot;
end
toc

If I run the parfor loop with 4 local workers in my Macbook Pro it takes 1.2 sec. Replacing parfor with for it takes 0.01 sec.

With my actual data, the difference of time is 31 sec versus 7 sec [the creation of the matrix C is also more complicated].

If have understood correctly the problem is that the computer has to send baseAto each local worker and this takes time and memory.

Could you suggest a solution that is able to make parfor more convenient than for? I thought that saving all cells in baseA was a way to save time by loading once at the beginning, but maybe I'm wrong.

Solution

General information

A lot of functions have implicit multi-threading built-in, making a parfor loop not more efficient, when using these functions, than a serial for loop, since all cores are already being used. parfor will actually be a detriment in this case, since it has the allocation overhead, whilst being as parallel as the function you are trying to use.

When not using one of the implicitly multithreaded functions parfor is basically recommended in two cases: lots of iterations in your loop (i.e., like 1e10), or if each iteration takes a very long time (e.g., eig(magic(1e4))). In the second case you might want to consider using spmd (slower than parfor in my experience). The reason parfor is slower than a for loop for short ranges or fast iterations is the overhead needed to manage all workers correctly, as opposed to just doing the calculation.

Check this question for information on splitting data between separate workers.

Benchmarking

Code

Consider the following example to see the behaviour of for as opposed to that of parfor. First open the parallel pool if you've not already done so:

gcp; % Opens a parallel pool using your current settings

Then execute a couple of large loops:

n = 1000; % Iteration number
EigenValues = cell(n,1); % Prepare to store the data
Time = zeros(n,1);
for ii = 1:n
tic
    EigenValues{ii,1} = eig(magic(1e3)); % Might want to lower the magic if it takes too long
Time(ii,1) = toc; % Collect time after each iteration
end

figure; % Create a plot of results
plot(1:n,t)
title 'Time per iteration'
ylabel 'Time [s]'
xlabel 'Iteration number[-]';

Then do the same with parfor instead of for. You will notice that the average time per iteration goes up (0.27s to 0.39s for my case). Do realise however that the parfor used all available workers, thus the total time (sum(Time)) has to be divided by the number of cores in your computer. So for my case the total time went down from around 270s to 49s, since I have an octacore processor.

So, whilst the time to do each separate iteration goes up using parfor with respect to using for, the total time goes down considerably.

Results

This picture shows the results of the test as I just ran it on my home PC. I used n=1000 and eig(500); my computer has an I5-750 2.66GHz processor with four cores and runs MATLAB R2012a. As you can see the average of the parallel test hovers around 0.29s with a lot of spread, whilst the serial code is quite steady around 0.24s. The total time, however, went down from 234s to 72s, which is a speed up of 3.25 times. The reason that this is not exactly 4 is the memory overhead, as expressed in the extra time each iteration takes. The memory overhead is due to MATLAB having to check what each core is doing and making sure that each loop iteration is performed only once and that the data is put into the correct storage location.