[SOLVED] x86-64 SIMD mechanism to "compare" 8-bit unsigned integers, giving a vector of +1 / 0 / -1 results (signum)?

x86-64 SIMD mechanism to "compare" 8-bit unsigned integers, giving a vector of +1 / 0 / -1 results (signum)?

Let's say I have two unsigned integer (8-bit) packed registers a and b. I'd like to compare them and get back +1 for a > b, 0 for a=b, or -1 for a < b. Alternatively, distance also works (i.e. instead of -1/+1, the actual difference is returned).

What's the SIMD (presumably AVX2) approach for accomplishing this efficiently? I can't use AVX512 but would be good to know if this is an AVX512 feature.

Solution

a <=> b for signed bytes would be _mm256_subs_epi8(a, b) (subtract using signed saturation). The result can be normailzed to -1 / 0 / 1 using _mm256_sign_epi8(_mm256_set1_epi8(-1), res)

Unsigned bytes could be "range shifted"^[1] to signed bytes:

const __m256i x80 = _mm256_set1_epi8(-128);
__m256i res = _mm256_subs_epi8(_mm256_xor_si256(a, x80), _mm256_xor_si256(b, x80));

Absolute difference can be found by

_mm256_or_si256(_mm256_subs_epu8(a, b), _mm256_subs_epu8(b, a)) or

_mm256_sub_epi8(_mm256_max_epu8(a, b), _mm256_min_epu8(a, b)).

[1] Range Shifting:

SSE2 (or later) often have instructions for signed integers without any corresponding instruction for unsigned integers (or vice versa).

To convert 8-bit integers between signed and unsigned ranges just add or subtract 128.

Signed 8-bit integers have the range -128..127
Unsigned 8-bit integers have the range 0..255

A micro-optimization for this conversion is to flip the most significant bit of a lane using XOR. 128 is 0b10000000 (0x80) in two's complement. There is no 9th bit so there is no need for a borrow or carry. Logical ops may run on more execution ports than arithmetic ops, on some architectures. Most SSE2 instructions have destructive two-operand forms so XOR allows the compiler to freely reorder x ^= y into y ^= x, unlike subtraction.