Here is my attempt to benchmark the performance of Intel TBB flow graph. Here is the setup:
continue_msg
to N
successor nodesbroadcast_node<continue_msg>
)t
seconds.Tserial = N * t
Tpar(ideal) = N * t / C
, where C
is the number of cores.Tpar(actual) / Tserial
Here are the results showing the speedup as a function of the processing time of individually task ( i.e. body ):
t = 100 microsecond , speed-up = 14
t = 10 microsecond , speed-up = 7
t = 1 microsecond , speed-up = 1
As can been for lightweight tasks ( whose computation takes less than 1 microseconds ), the parallel code is actually slower that the serial code.
1 ) Are these results inline with Intel TBB benchmarks?
2 ) It there a better paradigm, than a flow graph for the case, when there are thousands of tasks each taking less than 1 microsecond?
The Overhead of Parallelism
Your cost model is wrong.
The ideal parallel computation time is:
Tpar(ideal) = N * t / C + Pstart + Pend
where Pstart
is how long it takes to start your parallelism and Pend
is the time taken to finish it. It's not unusual for Pstart
to be on the order of tens of milliseconds.
I'm not sure if you're familiar with OpenMP (though it's a good thing to know), but, for our purposes it's a threading model that divides work between a team of threads. The following chart shows the overhead of some of the commands in relation to the size of the thread team:
The takeaway is that getting your parallelism going (the parallel for
line) is potentially expensive and grows with the number of threads. Ending parallelism (the barrier
line) has similar costs.
In fact, if you look at TBB's tutorial, Section 3.2.2 ("Automatic Chunking") says:
CAUTION: Typically a loop needs to take at least a million clock cycles for parallel_for to improve its performance. For example, a loop that takes at least 500 microseconds on a 2 GHz processor might benefit from parallel_for.
Achieving Better Speed
The best way to speed up code like this is to only perform the operations in parallel if there are a large number of them and/or to adjust the number of threads doing the work so that each thread has plenty to do. In TBB you could achieve similar behaviour like so:
#include <tbb/parallel_for.h>
// . . .
if(work_size>1000)
tbb::serial::parallel_for( . . . );
else
tbb::parallel_for( . . . );
// . . .
where you'd want to adjust 1000
to a number high enough that you start to see gains from parallelism.
You could also reduce the number of threads, since this reduces the overhead somewhat:
tbb::task_scheduler_init init(nthread);
TBB also performs dynamic load balancing (see here). This is great if you expect your loop iterations/tasks to have a broad distribution of run-times, but represents unnecessary overhead if the expected run-times are the same. I'm not sure if TBB has static scheduling, but it may be worth looking into.
In case people end up here without a strong commitment to TBB, in OpenMP you'd do something like:
#pragma omp parallel for if(work_size>1000) num_threads(4) schedule(static)