c++assemblyintrinsicsavxavx2

Comparing Unsigned integers using AVX2 Intrinsics


I want to threshold values greater than 15 using AVX2 instructions but it compares only signed numbers.

    __m256i *pIn0, *pIn1,*pOut;
    __m256i a, b, thres = _mm256_set1_epi8(15); //Threshold value is set to 15
    
    for (int i = 0; i < nHeight; i++)
    {
        pIn0 = (__m256i*)(pY1 + i * nStepSize); //buffer 1 having 8 bit unsigned integers
        pIn1 = (__m256i*)(pY2 + i * nStepSize); //buffer 2 having 8 bit unsigned integers
        pOut = (__m256i*)(pdiffAnd + i * nStepSize);

        int wLimit = nWidth / 32;
        for (int j = 0; j < wLimit; j++)
        {
            a = _mm256_lddqu_si256(pIn0++); //32 values of UINT8 type
            b = _mm256_lddqu_si256(pIn1++); //32 values of UINT8 type


            __m256i diff1 = _mm256_or_si256(_mm256_subs_epu8(a, b), _mm256_subs_epu8(b, a)); //taking their absolute difference

      /* here _mm256_cmpgt_epi8 is comparing values assuming 8 bit Signed integers so values greater than 127 are not getting compared */
            __m256i diff1Mask = _mm256_cmpgt_epi8(diff1, thres);

            __m256i blend1 = _mm256_blendv_epi8(diff1, diff1Mask, diff1Mask);

            _mm256_store_si256(pOut++, blend1);
        }
    }

I thought a solution to find all values less than 0 and perform bitwise OR with diff1Mask but also got stuck to find values less than 0.

PS: I'm a newbie


Solution

  • Assuming your inputs are unsigned bytes

    If you want an expression equivalent to

    result = x>=16 ? 255 : x;
    

    this should be the simplest equivalent AVX2 expression:

    const __m256i threshold = _mm256_set1_epi8(16);
    result = _mm256_or_si256(_mm256_cmpeq_epi8(_mm256_min_epu8(x, threshold), threshold), x);
    

    Using a pblendvb on most Intel architectures takes 2 uops (3 on Alder Lake-P) but you also need an additional 0xff constant -- if you already have that constant anyways, then this is about the same cost as the expression above (depends on the actual architecture and surrounding port usage) -- also on AMD ZEN2 or later this can be better (may also depend on context):

    result = _mm256_blendv_epi8(x, _mm256_set1_epi8(255),
                                _mm256_adds_epu8(x, _mm256_set1_epi8(0x80 - 16));