Suppose I have two uint16_t[4]
arrays, a
and b
. Each integer in these arrays is in the range [0, 16383], so bits 14 and 15 aren't set. Then I have some code to find the minimum and maximum among a[i]
and b[i]
for each i
:
uint16_t min[4], max[4];
for (int i = 0; i < 4; i++) {
if (a[i] < b[i]) {
min[i] = a[i];
max[i] = b[i];
} else {
min[i] = b[i];
max[i] = a[i];
}
}
Suppose for some reason I can't/won't use SIMD, but I'd still like to compute this as fast as possible, on a 64-bit platform. Thus, a natural solution is to use the SIMD-within-a-register (SWAR) paradigm on 64-bit registers to compute these 4 values in a single iteration, rather than over 4 iterations with 16-bit arithmetic.
What bit-twiddling hacks could be used to implement either (min or max) or ideally both operations using the SWAR paradigm, so that the resulting code is faster than the loop above? My target architecture is ARMv8, so feel free to use any ARMv8 instructions that help reducing instruction counts.
C, assembly, or C + inline assembly solutions are all welcome.
You could use code like this, though it's really a lot longer than just doing it with SIMD:
orr x2, x0, #0x8000800080008000 // x2 = 0x8000 | x0
sub x2, x2, x1 // x2 = (0x8000 | x0) - x1
and x2, x2, #0x8000800080008000 // x2 = x0 < x1 ? 0x0000 : 0x8000
mov x3, #0x7fff7fff7fff7fff
add x2, x3, x2, lsr #15 // x2 = x0 < x1 ? 0x7fff : 0x8000
eor x4, x0, x1 // x4 = x0 ^ x1
and x3, x4, x2 // x3 = x0 < x1 ? x0 ^ x1 : 0x0000
eor x4, x1, x3 // x4 = x0 < x1 ? x0 : x1
eor x3, x0, x3 // x3 = x0 < x1 ? x1 : x0
The critical path of this algorithm has 6 instructions. The instructions
mov x3, #0x7fff7fff7fff7fff
eor x4, x0, x1 // x4 = x0 ^ x1
are not on the critical path. If executed in a loop, the constant load can likely be hoisted out. The last two instructions can be evaluated independently, yielding minimum and maximum with the same latency.