In CUDA 9, nVIDIA seems to have this new notion of "cooperative groups"; and for some reason not entirely clear to me, __ballot()
is now (= CUDA 9) deprecated in favor of __ballot_sync()
. Is that an alias or have the semantics changed?
... similar question for other builtins which now have __sync()
added to their names.
No the semantics are not the same. The function calls themselves are different, one is not an alias for another, new functionality has been exposed, and the implementation behavior is now different between Volta architecture and previous architectures.
First of all, to set the ground work, it's necessary to be cognizant of the fact that Volta introduced the possibility for independent thread scheduling, by introducing a per-thread program counter and other changes. As a result of this, it's possible for Volta to behave in a non-warp-synchronous behavior for extended periods of time, and during periods of execution when previous architectures might still be warp-synchronous.
Most of the warp intrinsics work by only delivering expected results for threads that are actually participating (i.e. are actually active for the issue of that instruction, in that cycle). The programmer can now be explicit about which threads are expected to participate, via the new mask
parameter. However there are some requirements, in particular on Pascal and previous architectures. From the programming guide:
Note, however, that for Pascal and earlier architectures, all threads in
mask
must execute the same warp intrinsic instruction in convergence, and the union of all values in mask must be equal to the warp's active mask.
On Volta, however, the warp execution engine will bring about the necessary synchronization/participation amongst the indicated threads in the mask, in order to make the desired/indicated operation valid (assuming the appropriate _sync
version of the instrinsic is used). To be clear, the warp execution engine will reconverge threads that are diverged on volta in order to match the mask, however it will not overcome programmer induced errors such as preventing a thread from participating in a _sync()
intrinsic via conditional statements.
This related question discusses the mask
parameter. This answer is not intended to address all possible questions that may arise from independent thread scheduling and the impact on warp level intrinsics. For that, I encourage reading of the programming guide.