The AVX512CD instruction families are: VPCONFLICT, VPLZCNT and VPBROADCASTM.
The Wikipedia section about these instruction says:
The instructions in AVX-512 conflict detection (AVX-512CD) are designed to help efficiently calculate conflict-free subsets of elements in loops that normally could not be safely vectorized.
What are some examples that show these instruction being useful in vectorizing loops? It would be helpful if answers will include scalar loops and their vectorized counterparts.
Thanks!
One example where the CD instructions might be useful is histogramming. For scalar code histogramming is just a simple loop like this:
load bin index
load bin count at index
increment bin count
store updated bin count at index
Normally you can't vectorize histogramming because you might have the same bin index more than once in a vector - you might naïvely try something like this:
load vector of N bin indices
perform gathered load using N bin indices to get N bin counts
increment N bin counts
store N updated bin counts using scattered store
but if any of the indices within a vector are the same then you get a conflict, and the resulting bin update will be incorrect.
So, CD instructions to the rescue:
load vector of N bin indices
use CD instruction to test for duplicate indices
set mask for all unique indices
while mask not empty
perform masked gathered load using <N bin indices to get <N bin counts
increment <N bin counts
store <N updated bin counts using masked scattered store
remove non-masked indices and update mask
end
In practice this example is quite inefficient and no better than scalar code, but there are other more compute-intensive examples where using the CD instructions seems to be worthwhile. Typically these will be simulations where the data elements are going to be updated in a non-deterministic fashion. One example (from the LAMMPS Molecular Dynamics Simulator) is referred to in the KNL book by Jeffers et al.