cassemblybit-manipulationarm64swar

SIMD-within-a-register version of min/max


Suppose I have two uint16_t[4] arrays, a and b. Each integer in these arrays is in the range [0, 16383], so bits 14 and 15 aren't set. Then I have some code to find the minimum and maximum among a[i] and b[i] for each i:

uint16_t min[4], max[4];
for (int i = 0; i < 4; i++) {
    if (a[i] < b[i]) {
        min[i] = a[i];
        max[i] = b[i];
    } else {
        min[i] = b[i];
        max[i] = a[i];
    }
}

Suppose for some reason I can't/won't use SIMD, but I'd still like to compute this as fast as possible, on a 64-bit platform. Thus, a natural solution is to use the SIMD-within-a-register (SWAR) paradigm on 64-bit registers to compute these 4 values in a single iteration, rather than over 4 iterations with 16-bit arithmetic.

What bit-twiddling hacks could be used to implement either (min or max) or ideally both operations using the SWAR paradigm, so that the resulting code is faster than the loop above? My target architecture is ARMv8, so feel free to use any ARMv8 instructions that help reducing instruction counts.

C, assembly, or C + inline assembly solutions are all welcome.


Solution

  • You could use code like this, though it's really a lot longer than just doing it with SIMD:

    orr     x2, x0, #0x8000800080008000     // x2 = 0x8000 | x0
    sub     x2, x2, x1                      // x2 = (0x8000 | x0) - x1
    and     x2, x2, #0x8000800080008000      // x2 = x0 < x1 ? 0x0000 : 0x8000
    mov     x3, #0x7fff7fff7fff7fff
    add     x2, x3, x2, lsr #15             // x2 = x0 < x1 ? 0x7fff : 0x8000
    eor     x4, x0, x1                      // x4 = x0 ^ x1
    and     x3, x4, x2                      // x3 = x0 < x1 ? x0 ^ x1 : 0x0000
    eor     x4, x1, x3                      // x4 = x0 < x1 ? x0 : x1
    eor     x3, x0, x3                      // x3 = x0 < x1 ? x1 : x0
    

    The critical path of this algorithm has 6 instructions. The instructions

    mov     x3, #0x7fff7fff7fff7fff
    eor     x4, x0, x1                      // x4 = x0 ^ x1
    

    are not on the critical path. If executed in a loop, the constant load can likely be hoisted out. The last two instructions can be evaluated independently, yielding minimum and maximum with the same latency.