GCC 14 introduced a new -Wnrvo
flag:
New
-Wnrvo
warning, to warn if the named return value optimization is not performed although it is allowed by [class.copy.elision]. See the manual for more info.
I decided to see what warnings would pop up if I added the flag to SFML 3.x, an open-source multimedia library, and a few warnings came up, all similar to the case shown here:
struct JoystickCaps
{
unsigned int buttonCount{};
std::array<bool, 8> axes{}
};
// Early return version (original)
JoystickCaps JoystickImpl::getCapabilities() const
{
if (directInput)
return getCapabilitiesDInput(); // <- early return
JoystickCaps caps;
caps.buttonCount = m_caps.wNumButtons;
if (caps.buttonCount > Joystick::ButtonCount)
caps.buttonCount = Joystick::ButtonCount;
// ...set more members in 'caps'...
return caps;
}
This code gets flagged by GCC's new warning:
./SFML/src/SFML/Window/Win32/JoystickImpl.cpp:
In member function 'JoystickCaps JoystickImpl::getCapabilities() const':
./SFML/src/SFML/Window/Win32/JoystickImpl.cpp:342:12:
warning: not eliding copy on return in
'JoystickCaps JoystickImpl::getCapabilities() const' [-Wnrvo]
342 | return caps;
| ^~~~
I decided to change the code in such a way that the warning would be silenced, but the only solution I managed to find was this one:
// NRVO version
JoystickCaps JoystickImpl::getCapabilities() const
{
JoystickCaps caps;
if (directInput)
caps = getCapabilitiesDInput();
else
{
caps.buttonCount = m_caps.wNumButtons;
if (caps.buttonCount > Joystick::ButtonCount)
caps.buttonCount = Joystick::ButtonCount;
// ...set more members in 'caps'...
}
return caps;
}
Not really that satisfied, as the solution is less readable IMHO, I decided to benchmark it to see what the benefit would be, using quick-bench.com.
Turns out that with -O3
, the early return version is 3.8x times faster than the NRVO version.
Questions:
Is GCC's warning being misleading in this case? Does it make sense to purposefully avoid NRVO in situations where an early return is possible?
Why is the NRVO version so much slower? Is there a mistake in my benchmark? Is there a way I could rewrite the function to both support NRVO and not be less efficient than the original one?
TL;DR This is a nightmare of misleading microbenchmarks, confounded by complicated optimizations and zeroing at initialization. The warning is not very helpful and part of it has to do with zeroing values to be overwritten later interfering with optimizers.
I will go off the definition of JoystickCaps
you provided in quickbench instead of the SFML definition, since that seems more pertinent to the question
struct JoystickCaps
{
unsigned int buttonCount{};
int axes[16]{};
};
The loop you're benchmarking is
bool b = true;
for (auto _ : state)
{
auto result = /* one of the functions passed with b as condition */;
benchmark::DoNotOptimize(result);
b = !b;
}
One important note is that the function is always inlined into the benchmark loop and the differences you see are never caused by actual function epilogues dealing with returning an object.
In the NRVO case, the optimizer gets confused and creates an extra result
object even though it is not needed. There is no standard rule affecting this because there is no difference in observable behaviour, and there is no ABI concern because everything is inlined. This is just an optimizer quirk. For example, the following is the assembly for b == false
which creates two identical result
objects at rsp
and rsp + 0x50
movaps XMMWORD PTR [rsp+0x10],xmm0 # xmm0 is all zeros
mov DWORD PTR [rsp],0x20 # result1.buttonCount = 32
movups XMMWORD PTR [rsp+0x4],xmm5 # result1.axes[0:4] = {5, 5, 5, 5}
movdqa xmm6,XMMWORD PTR [rsp] # xmm6 = {32, 5, 5, 5}
movdqa xmm7,XMMWORD PTR [rsp+0x10] # xmm7 = {5, 0, 0, 0}
mov DWORD PTR [rsp+0x40],0x0
mov DWORD PTR [rsp+0x90],0x0
movaps XMMWORD PTR [rsp+0x20],xmm0
movaps XMMWORD PTR [rsp+0x30],xmm0
movaps XMMWORD PTR [rsp+0x50],xmm6 # result2.buttonCount, result2.axes[0:3] = {32, 5, 5, 5}
movaps XMMWORD PTR [rsp+0x60],xmm7 # result2.axes[4:8] = {5, 0, 0, 0}
movaps XMMWORD PTR [rsp+0x70],xmm0
movaps XMMWORD PTR [rsp+0x80],xmm0
Contrast with the early return version
mov DWORD PTR [rsp+0x50],0x20 # result.buttonCount = 32
movups XMMWORD PTR [rsp+0x54],xmm1 # result.axes[0:4] = {5, 5, 5, 5}
movups XMMWORD PTR [rsp+0x64],xmm0
movups XMMWORD PTR [rsp+0x74],xmm0
movups XMMWORD PTR [rsp+0x84],xmm0
This is the primary slowdown you observe.
If you benchmark this loop instead
bool b = true;
for (auto _ : state)
{
benchmark::DoNotOptimize(/* one of the functions passed with b as condition */);
b = !b;
}
The optimizer has less trouble, and the two versions of the function compiles to identical assembly, which is as fast as the early return version you see.
To observe the actual NRVO effect in action, you can force the functions to not be inlined. In which case, you will see that the NRVO version will first unconditionally zero out result
, then decide what values to overwrite it with according to the branch condition
f1(bool):
pxor xmm0, xmm0
mov DWORD PTR [rdi+64], 0
mov rax, rdi
movups XMMWORD PTR [rdi], xmm0
movups XMMWORD PTR [rdi+16], xmm0
movups XMMWORD PTR [rdi+32], xmm0
movups XMMWORD PTR [rdi+48], xmm0
test sil, sil
je .L7
In contrast, the early return version branches first and directly fills in the correct values. Again, this is due to some optimizer quirk that I don't quite understand, both are legal as far as the C++ standard is concerned.
Now things get even more messy. If an JoystickCaps
object is required to be created (such as when passed to another function by reference), then the warning is correct: without NRVO there will be an extra copy. However, whether an object is required to be created is a whimsical matter. Very small changes can result in the compiler deciding one way or another.
Amusingly enough, in this case the warning is wrong and there is no copy in the early return version anyways, resulting in a ~1.2x slowdown for the NRVO version.
Same quickbench link as above.
Finally, addressing the zeroing at initialization. If you were to instead define JoystickCaps
as
struct JoystickCaps
{
unsigned int buttonCount;
int axes[16];
};
and manually zeroing out values only when needed, then you will see that whether the result
object is stored as a local variable or not no longer impedes the optimizer and inlined versions get compiled to be identically fast.
In the no inline case, if an extra copy is performed, then the NRVO version is faster by ~1.7x. If no extra copy is performed, then the NRVO version is slower by ~1.1x.