c++assemblyg++

Smart pointer produces inefficient compiled code


I'm implementing a BST and believe it’s a good idea to use unique_ptr to represent child node ownership. However, I discovered that smart pointers cause the compiler to prefer branching over branchless code (specifically cmov). I’m using g++-13 to compile on Ubuntu 24.04.1 LTS. I tried using the ternary operator to encourage branchless code, but it hasn’t worked. Here’s a simple prototype

#include <memory>
#include <iostream>
struct node {
    int val;
    node* left_child;
    node* right_child;
};

struct smart_node {
    int val;
    std::unique_ptr<smart_node> left_child;
    std::unique_ptr<smart_node> right_child;
};

node*
__attribute__((noinline))
raw(node* root, int key) {
    node* prev = root;
    node* curr = prev -> left_child;
    while (curr != nullptr) {
        prev = curr;
        bool comp = key < (curr -> val);
        curr = comp ? curr -> left_child : curr -> right_child;
    }
    return prev;
}

smart_node*
__attribute__((noinline))
smart(std::unique_ptr<smart_node> root, int key) {
    smart_node* prev = root.get();
    smart_node* curr = prev -> left_child.get();
    while (curr != nullptr) {
        prev = curr;
        bool comp = key < (curr -> val);
        curr = comp ? curr -> left_child.get() : curr -> right_child.get();
    }
    return prev;
}

int main() {
    // Use null instead of a non-degenerative tree to shorten the code
    // I verified this doesn't cause the assembly of the two functions above to mutate
    auto last1 = raw(nullptr, 1);
    auto last2 = smart(std::make_unique<smart_node>(), 1);
    std::cout << last1 -> val << last2 -> val;
}

Below is the assembly code for the loop in raw.

    1310:   48 89 c2                mov    %rax,%rdx
    1313:   48 8b 42 10             mov    0x10(%rdx),%rax
    1317:   3b 32                   cmp    (%rdx),%esi
    1319:   48 0f 4c 42 08          cmovl  0x8(%rdx),%rax
    131e:   48 85 c0                test   %rax,%rax
    1321:   75 ed                   jne    1310 <_Z3rawP4nodei+0x20>

Below is the assembly code for the loop in smart.

    1360:   48 8b 50 08             mov    0x8(%rax),%rdx
    1364:   48 85 d2                test   %rdx,%rdx
    1367:   74 10                   je     1379 <_Z5smartSt10unique_ptrI10smart_nodeSt14default_deleteIS0_EEi+0x39>
    1369:   48 89 d0                mov    %rdx,%rax
    136c:   3b 30                   cmp    (%rax),%esi
    136e:   7c f0                   jl     1360 <_Z5smartSt10unique_ptrI10smart_nodeSt14default_deleteIS0_EEi+0x20>
    1370:   48 8b 50 10             mov    0x10(%rax),%rdx
    1374:   48 85 d2                test   %rdx,%rdx
    1377:   75 f0                   jne    1369 <_Z5smartSt10unique_ptrI10smart_nodeSt14default_deleteIS0_EEi+0x29>

In benchmark (2e6 call to a tree with 2e6 nodes), the branching code takes about 1.4x time the branchless version. I used RelWithDebInfo so O2 optimization is on. My question is how I can force compiler to use cmov? I read this post and it seems that there is no clean way to force this behavior. If it is not possible, I'd like to know why the compiler generates different code for the two cases.

To clarify, I don't want assembly since it is not portable. I used assembly to force the branchless version, but it is not the direction I want to pursue.


Solution

  • g++ 13 indeed produces worse code for the smart pointers. Newer versions of gcc (14.2) seem to produce almost identical code, though.

    You can get even better assembly by eliminating even the inlined branch entirely while still holding the branches in unique pointers.

    struct sv_node {
        int val;
        std::array<std::unique_ptr<sv_node>,2> child{};
    };
    
    sv_node*
    __attribute__((noinline))
    svec(sv_node* root, int key) {
        auto prev = root;
        auto curr = prev->child[0].get();
        while (curr != nullptr) {
            prev = curr;
            bool const comp = key >= (curr->val);
            curr = curr->child[comp].get();
        }
        return prev;
    }
    

    This produces the following assembly with gcc 13.3:

    svec(sv_node*, int):
            mov     rdx, QWORD PTR [rdi+8]
            test    rdx, rdx
            je      .L19
    .L21:
            xor     eax, eax
            cmp     DWORD PTR [rdx], esi
            mov     rdi, rdx
            setle   al
            mov     rdx, QWORD PTR [rdx+8+rax*8]
            test    rdx, rdx
            jne     .L21
    .L19:
            mov     rax, rdi
            ret
    

    (live demo)

    You can also, instead of using std::array<std::unique_ptr<node>,2>, make it an std::vector<node> with similar assembly but slightly different traversal logic. And I suspect similar performance.

    However, very importantly, in order to figure out performance, you have to benchmark. Just looking at assembly will not be sufficient.