I've spent several hours trying to convert the following code to inline assembly (GCC) but in vain:
int new_low = mid + 1;
int new_high = mid - 1;
if (middle < key) {
low = new_low;
}
if (!(middle < key)) {
high = new_high;
}
I want the compiler to optimize away those ifs and instead use conditional move. Unfortunately, the compiler keeps generating jumps. I tried writing inline assembly, but I am not good at it and there is some issue with the register clobbering and I can't get it right. What I am doing wrong?
__asm__ (
"cmp %[array_middle], %[key];"
"cmovb %[new_low], %[low];"
"cmovae %[new_high], %[high];"
: [low] "=&r"(low), [high] "=&r"(high)
: [new_low] "r"(new_low), [new_high] "r"(new_high), [array_middle] "r"(middle), [key] "r"(key)
: "cc"
);
You usually don't need inline asm for this, but the problem with yours is that [low] "=&r"(low)
is a write-only output!! You're telling the compiler that the old value of the variable is irrelevant, it's write-only. But that's not true for cmov
: it's a 3-input instruction: src, dst, and FLAGS.
Use "+&r"
read/write operands for low/high (the cmov destinations). Turns out this is basically a duplicate of how to force the use of cmov in gcc and VS which shows working inline asm almost identical to yours (using an FP compare instead of integer, but the same cmov instructions.)
That should work, but force the compiler to use more registers than it would otherwise need. (e.g. it could use LEA after CMP to do the mid+1 and mid-1, overwriting mid
. Or even use LEA + INC because the B and AE conditions only read CF, and new enough CPUs don't have partial-flag stalls at all, they just have cmov read both parts of FLAGS separately if necessary. That's why cmovbe
is a 2-uop instruction on Skylake, vs. 1 for most other conditions.)
Have you tried ternary operators, like low = (middle < key) ? new_low : low;
? Have you tried profile-guided optimization, so GCC can see that the branch is in fact not very predictable? (gcc optimization flag -O3 makes code slower than -O2 shows that if-conversion to branchless usually is only done at -O3
, at least in some cases).
Also, keep in mind that branchless creates a data dependency, which can be worse for a binary search; speculative execution effectively gives you memory parallelism / prefetch instead of serializing loads. (https://agner.org/optimize/)
For a small binary search that hits in L1d cache, branch misses cost more than load-use latency. But once you expect some L2 cache misses, branch recovery is cheaper than load-use latency from L3. About the branchless binary search.
Software prefetch of the 1/4 and 3/4 elements can help mitigate this, hiding load-use latency. This takes advantage of memory-level parallelism (having multiple loads in flight at once) even for the branchless form where there's a data dependency through the chain of demand loads.
If you can choose a different data structure, What is the most efficient way to implement a BST in such a way the find(value) function is optimized for random values in the tree on x86? shows that a wide implicit tree should be very quickly searchable using SIMD to balance computation against latency, leading to few steps to get to a hit. It is ordered so you can traverse it in order, or find the prev or next element after getting a hit.
Related: