c++reducetbbdeterministic

tbb::parallel_reduce vs tbb::parallel_deterministic_reduce


Threading Building Blocks (TBB) library provides two functions for performing reduction over a range:

Which one of two shall be selected if I want to perform the reduction as fast as possible, but still get exactly the same answer independently on hardware concurrency and the load from other processes or threads? I am basically interested in two scenarios:

  1. Computing the sum of elements in integer-value vector.
  2. Computing the sum of elements in floating-point-value vector.

And a side question. On the page about parallel_deterministic_reduce there is one warning:

Since simple_partitioner does not automatically coarsen ranges, make sure to specify an appropriate grain size

Does it mean that the call to parallel_deterministic_reduce with a range having no explicitly specified grain size will lead to poor performance? How grain size shall be set then?


Solution

  • parallel_reduce does not make any guarantees regarding the summation order. If used with floating point numbers, the result is not deterministic since summation of floating point numbers is not associative. In contrast, parallel_deterministic_reduce guarantees that the summation order is always the same, regardless of the number of threads used. But note that there is still no guarantee of a specific summation order, just that the order is deterministic (for example, the result can differ compared to std::accumulate).

    Thus:

    Regarding the note that simple_partitioner does not automatically coarsen ranges: I am not entirely sure why they mention this specifically in the documentation of parallel_deterministic_reduce. In all cases where you use simple_partitioner, you should think about appropriate grain sizes. Not just in case of parallel_deterministic_reduce. A too small grain size can lead to a large overhead. See for example here. In practice this especially means to measure the performance for typical work loads, and play with the grain size or partitioner such that performance is maximized. I guess they just wanted to highlight the general issue again for parallel_deterministic_reduce because parallel_deterministic_reduce supports only simple_partitioner and static_partitioner, but neither affinity_partitioner nor auto_partitioner (where the latter is usually the default).