x86vectorizationsimdintel-micavx512

How do the Conflict Detection instructions make it easier to vectorize loops?


The AVX512CD instruction families are: VPCONFLICT, VPLZCNT and VPBROADCASTM.

The Wikipedia section about these instruction says:

The instructions in AVX-512 conflict detection (AVX-512CD) are designed to help efficiently calculate conflict-free subsets of elements in loops that normally could not be safely vectorized.

What are some examples that show these instruction being useful in vectorizing loops? It would be helpful if answers will include scalar loops and their vectorized counterparts.

Thanks!


Solution

  • One example where the CD instructions might be useful is histogramming. For scalar code histogramming is just a simple loop like this:

    load bin index
    load bin count at index
    increment bin count
    store updated bin count at index
    

    Normally you can't vectorize histogramming because you might have the same bin index more than once in a vector - you might naïvely try something like this:

    load vector of N bin indices
    perform gathered load using N bin indices to get N bin counts
    increment N bin counts
    store N updated bin counts using scattered store
    

    but if any of the indices within a vector are the same then you get a conflict, and the resulting bin update will be incorrect.

    So, CD instructions to the rescue:

    load vector of N bin indices
    use CD instruction to test for duplicate indices
    set mask for all unique indices
    while mask not empty
        perform masked gathered load using <N bin indices to get <N bin counts
        increment <N bin counts
        store <N updated bin counts using masked scattered store
        remove non-masked indices and update mask
    end
    

    In practice this example is quite inefficient and no better than scalar code, but there are other more compute-intensive examples where using the CD instructions seems to be worthwhile. Typically these will be simulations where the data elements are going to be updated in a non-deterministic fashion. One example (from the LAMMPS Molecular Dynamics Simulator) is referred to in the KNL book by Jeffers et al.