c++gcc optimization return-value-optimization nrvo

NRVO vs early return for types not benefitting from move semantics (GCC 14 -Wnrvo)

GCC 14 introduced a new -Wnrvo flag:

New -Wnrvo warning, to warn if the named return value optimization is not performed although it is allowed by [class.copy.elision]. See the manual for more info.

I decided to see what warnings would pop up if I added the flag to SFML 3.x, an open-source multimedia library, and a few warnings came up, all similar to the case shown here:

struct JoystickCaps
{
    unsigned int buttonCount{};
    std::array<bool, 8> axes{}
};

// Early return version (original)
JoystickCaps JoystickImpl::getCapabilities() const
{
    if (directInput)
        return getCapabilitiesDInput(); // <- early return

    JoystickCaps caps;

    caps.buttonCount = m_caps.wNumButtons;
    if (caps.buttonCount > Joystick::ButtonCount)
        caps.buttonCount = Joystick::ButtonCount;

    // ...set more members in 'caps'...

    return caps;
}

This code gets flagged by GCC's new warning:

./SFML/src/SFML/Window/Win32/JoystickImpl.cpp: 
    In member function 'JoystickCaps JoystickImpl::getCapabilities() const':

./SFML/src/SFML/Window/Win32/JoystickImpl.cpp:342:12: 
    warning: not eliding copy on return in 
   'JoystickCaps JoystickImpl::getCapabilities() const' [-Wnrvo]
  342 |     return caps;
      |            ^~~~

I decided to change the code in such a way that the warning would be silenced, but the only solution I managed to find was this one:

// NRVO version
JoystickCaps JoystickImpl::getCapabilities() const
{
    JoystickCaps caps;

    if (directInput)
        caps = getCapabilitiesDInput();
    else 
    {
        caps.buttonCount = m_caps.wNumButtons;
        if (caps.buttonCount > Joystick::ButtonCount)
            caps.buttonCount = Joystick::ButtonCount;

        // ...set more members in 'caps'...
    }

    return caps;
}

Not really that satisfied, as the solution is less readable IMHO, I decided to benchmark it to see what the benefit would be, using quick-bench.com.

Turns out that with -O3, the early return version is 3.8x times faster than the NRVO version.

Questions:

Is GCC's warning being misleading in this case? Does it make sense to purposefully avoid NRVO in situations where an early return is possible?
Why is the NRVO version so much slower? Is there a mistake in my benchmark? Is there a way I could rewrite the function to both support NRVO and not be less efficient than the original one?

Solution

TL;DR This is a nightmare of misleading microbenchmarks, confounded by complicated optimizations and zeroing at initialization. The warning is not very helpful and part of it has to do with zeroing values to be overwritten later interfering with optimizers.

I will go off the definition of JoystickCaps you provided in quickbench instead of the SFML definition, since that seems more pertinent to the question

struct JoystickCaps
{
    unsigned int buttonCount{};
    int axes[16]{}; 
};

The loop you're benchmarking is

bool b = true;
for (auto _ : state) 
{
    auto result = /* one of the functions passed with b as condition */;
    benchmark::DoNotOptimize(result);
    b = !b;
}

One important note is that the function is always inlined into the benchmark loop and the differences you see are never caused by actual function epilogues dealing with returning an object.

In the NRVO case, the optimizer gets confused and creates an extra result object even though it is not needed. There is no standard rule affecting this because there is no difference in observable behaviour, and there is no ABI concern because everything is inlined. This is just an optimizer quirk. For example, the following is the assembly for b == false which creates two identical result objects at rsp and rsp + 0x50

movaps XMMWORD PTR [rsp+0x10],xmm0        # xmm0 is all zeros
mov    DWORD PTR [rsp],0x20               # result1.buttonCount = 32
movups XMMWORD PTR [rsp+0x4],xmm5         # result1.axes[0:4] = {5, 5, 5, 5}
movdqa xmm6,XMMWORD PTR [rsp]             # xmm6 = {32, 5, 5, 5}
movdqa xmm7,XMMWORD PTR [rsp+0x10]        # xmm7 = {5, 0, 0, 0}
mov    DWORD PTR [rsp+0x40],0x0           
mov    DWORD PTR [rsp+0x90],0x0           
movaps XMMWORD PTR [rsp+0x20],xmm0        
movaps XMMWORD PTR [rsp+0x30],xmm0
movaps XMMWORD PTR [rsp+0x50],xmm6        # result2.buttonCount, result2.axes[0:3] = {32, 5, 5, 5}
movaps XMMWORD PTR [rsp+0x60],xmm7        # result2.axes[4:8] = {5, 0, 0, 0}
movaps XMMWORD PTR [rsp+0x70],xmm0
movaps XMMWORD PTR [rsp+0x80],xmm0

Contrast with the early return version

mov    DWORD PTR [rsp+0x50],0x20          # result.buttonCount = 32
movups XMMWORD PTR [rsp+0x54],xmm1        # result.axes[0:4] = {5, 5, 5, 5}
movups XMMWORD PTR [rsp+0x64],xmm0
movups XMMWORD PTR [rsp+0x74],xmm0
movups XMMWORD PTR [rsp+0x84],xmm0

This is the primary slowdown you observe.

If you benchmark this loop instead

bool b = true;
for (auto _ : state) 
{
    benchmark::DoNotOptimize(/* one of the functions passed with b as condition */);
    b = !b;
}

The optimizer has less trouble, and the two versions of the function compiles to identical assembly, which is as fast as the early return version you see.

Quickbench link.

To observe the actual NRVO effect in action, you can force the functions to not be inlined. In which case, you will see that the NRVO version will first unconditionally zero out result, then decide what values to overwrite it with according to the branch condition

f1(bool):
        pxor    xmm0, xmm0
        mov     DWORD PTR [rdi+64], 0
        mov     rax, rdi
        movups  XMMWORD PTR [rdi], xmm0
        movups  XMMWORD PTR [rdi+16], xmm0
        movups  XMMWORD PTR [rdi+32], xmm0
        movups  XMMWORD PTR [rdi+48], xmm0
        test    sil, sil
        je      .L7

In contrast, the early return version branches first and directly fills in the correct values. Again, this is due to some optimizer quirk that I don't quite understand, both are legal as far as the C++ standard is concerned.

Now things get even more messy. If an JoystickCaps object is required to be created (such as when passed to another function by reference), then the warning is correct: without NRVO there will be an extra copy. However, whether an object is required to be created is a whimsical matter. Very small changes can result in the compiler deciding one way or another.

Amusingly enough, in this case the warning is wrong and there is no copy in the early return version anyways, resulting in a ~1.2x slowdown for the NRVO version.

Same quickbench link as above.

Finally, addressing the zeroing at initialization. If you were to instead define JoystickCaps as

struct JoystickCaps
{
    unsigned int buttonCount;
    int axes[16]; 
};

and manually zeroing out values only when needed, then you will see that whether the result object is stored as a local variable or not no longer impedes the optimizer and inlined versions get compiled to be identically fast.

In the no inline case, if an extra copy is performed, then the NRVO version is faster by ~1.7x. If no extra copy is performed, then the NRVO version is slower by ~1.1x.

Quickbench link.