cudafunctorcub

CUB device scan with custom scan op fails


I am using CUB::InclusiveScan which takes a custom binary, non-commutative, operator. When defining my

template <typename T>
struct MultAddFunctor
{
    const T factor;
    MultAddFunctor(T factor) : factor(factor) {}
    
    __device__ __forceinline__
    T operator()(const T &a, const T &b) const {
        return factor*a + b;
    }
};

Otherwise, my code is nearly identical to the example code in the documentation (except I have freed allocated memory and added additional syncs to rule that out as a problem). When factor is 1.0 this produces the correct results (which is just a prefix-sum). When factor is something else (such as 0.8) the results are correct for the first 12 values but diverge considerably after that. For example, if the array being scanned is just a bunch of 1.0s, I get the following results:

        CUDA     Serial
   0     1.000     1.000 ✅
   1     1.800     1.800 ✅
   2     2.440     2.440 ✅
   3     2.952     2.952 ✅
   4     3.362     3.362 ✅
   5     3.689     3.689 ✅
   6     3.951     3.951 ✅
   7     4.161     4.161 ✅
   8     4.329     4.329 ✅
   9     4.463     4.463 ✅
  10     4.571     4.571 ✅
  11     4.656     4.656 ✅
  12     6.313     4.725 ❌
  13     6.050     4.780 ❌
  14     5.840     4.824 ❌
  15     5.672     4.859 ❌
...

At element 12, there is a sudden jump in the values and then a decrease even though this should just keep getting larger at a consistent rate.

At first, I thought it was due to the non-commutativity of the operation, but the docs explicitly state that that is fine. I also thought that the factor field itself may not be getting to the device correctly, but even if I hard-code a 0.8 in the equation it is still incorrect (although, factor is probably always in global memory so in the future moving factor into shared/local would be better).

What other reason could there be that the scan is computing the incorrect results?


Solution

  • The reason for the failure here is that cub parallel scan like other parallel scan implementations I am aware of, require a binary scan operator that is associative.

    This follows directly from the definition of associativity, using addition as an example. The associative property of addition says that

    (a+b)+c = a+(b+c)
    

    When applied here, it means that in a parallel setting, I can apply the binary op to either (a+b) first, or (b+c) first, and then apply the binary op to the remaining step, and I should get the same result. This is associativity, not commutativity. The op here is not associative. To demonstrate this, we actually perform the math indicated by the operator in each case (imagining/pretending that factor is 1000) and we get:

    step 1: (1000*a+b)+c = a+(1000*b+c)
    step 2: 1000(1000*a+b)+c = 1000*a+(1000*b+c)
    

    and the equality does not hold.