I'm implementing a BST and believe it’s a good idea to use unique_ptr to represent child node ownership. However, I discovered that smart pointers cause the compiler to prefer branching over branchless code (specifically cmov). I’m using g++-13 to compile on Ubuntu 24.04.1 LTS. I tried using the ternary operator to encourage branchless code, but it hasn’t worked. Here’s a simple prototype
#include <memory>
#include <iostream>
struct node {
int val;
node* left_child;
node* right_child;
};
struct smart_node {
int val;
std::unique_ptr<smart_node> left_child;
std::unique_ptr<smart_node> right_child;
};
node*
__attribute__((noinline))
raw(node* root, int key) {
node* prev = root;
node* curr = prev -> left_child;
while (curr != nullptr) {
prev = curr;
bool comp = key < (curr -> val);
curr = comp ? curr -> left_child : curr -> right_child;
}
return prev;
}
smart_node*
__attribute__((noinline))
smart(std::unique_ptr<smart_node> root, int key) {
smart_node* prev = root.get();
smart_node* curr = prev -> left_child.get();
while (curr != nullptr) {
prev = curr;
bool comp = key < (curr -> val);
curr = comp ? curr -> left_child.get() : curr -> right_child.get();
}
return prev;
}
int main() {
// Use null instead of a non-degenerative tree to shorten the code
// I verified this doesn't cause the assembly of the two functions above to mutate
auto last1 = raw(nullptr, 1);
auto last2 = smart(std::make_unique<smart_node>(), 1);
std::cout << last1 -> val << last2 -> val;
}
Below is the assembly code for the loop in raw
.
1310: 48 89 c2 mov %rax,%rdx
1313: 48 8b 42 10 mov 0x10(%rdx),%rax
1317: 3b 32 cmp (%rdx),%esi
1319: 48 0f 4c 42 08 cmovl 0x8(%rdx),%rax
131e: 48 85 c0 test %rax,%rax
1321: 75 ed jne 1310 <_Z3rawP4nodei+0x20>
Below is the assembly code for the loop in smart
.
1360: 48 8b 50 08 mov 0x8(%rax),%rdx
1364: 48 85 d2 test %rdx,%rdx
1367: 74 10 je 1379 <_Z5smartSt10unique_ptrI10smart_nodeSt14default_deleteIS0_EEi+0x39>
1369: 48 89 d0 mov %rdx,%rax
136c: 3b 30 cmp (%rax),%esi
136e: 7c f0 jl 1360 <_Z5smartSt10unique_ptrI10smart_nodeSt14default_deleteIS0_EEi+0x20>
1370: 48 8b 50 10 mov 0x10(%rax),%rdx
1374: 48 85 d2 test %rdx,%rdx
1377: 75 f0 jne 1369 <_Z5smartSt10unique_ptrI10smart_nodeSt14default_deleteIS0_EEi+0x29>
In benchmark (2e6 call to a tree with 2e6 nodes), the branching code takes about 1.4x time the branchless version. I used RelWithDebInfo
so O2 optimization is on. My question is how I can force compiler to use cmov
? I read this post and it seems that there is no clean way to force this behavior. If it is not possible, I'd like to know why the compiler generates different code for the two cases.
To clarify, I don't want assembly since it is not portable. I used assembly to force the branchless version, but it is not the direction I want to pursue.
g++ 13 indeed produces worse code for the smart pointers. Newer versions of gcc (14.2) seem to produce almost identical code, though.
You can get even better assembly by eliminating even the inlined branch entirely while still holding the branches in unique pointers.
struct sv_node {
int val;
std::array<std::unique_ptr<sv_node>,2> child{};
};
sv_node*
__attribute__((noinline))
svec(sv_node* root, int key) {
auto prev = root;
auto curr = prev->child[0].get();
while (curr != nullptr) {
prev = curr;
bool const comp = key >= (curr->val);
curr = curr->child[comp].get();
}
return prev;
}
This produces the following assembly with gcc 13.3:
svec(sv_node*, int):
mov rdx, QWORD PTR [rdi+8]
test rdx, rdx
je .L19
.L21:
xor eax, eax
cmp DWORD PTR [rdx], esi
mov rdi, rdx
setle al
mov rdx, QWORD PTR [rdx+8+rax*8]
test rdx, rdx
jne .L21
.L19:
mov rax, rdi
ret
You can also, instead of using std::array<std::unique_ptr<node>,2>
, make it an std::vector<node>
with similar assembly but slightly different traversal logic. And I suspect similar performance.
However, very importantly, in order to figure out performance, you have to benchmark. Just looking at assembly will not be sufficient.