c++cperformanceoptimization

nested function call faster or not?


I have this silly argument with a friend and need an authoritative word on it.

I have these two snippet and want to know which one is faster ? [A or B]

(assuming that compiler does not optimize anything)

[A]

if ( foo () ); 

[B]

int t = foo ();
if ( t )

EDIT : Guys, this might look a silly question to you but I have a hardware engineer friend, who was arguing that even WITHOUT optimization (take any processor, any compiler pair) CASE B is always faster because it DOES NOT fetch the memory for the result from previous instruction but directly accesses result from Common Data Bus by bypassing that data (remember the 5-stage pipeline).

While My Argument was that, without compiler informing how much data to copy or check, it is not possible to do that(you have to go to memory to get the data, WITHOUT compiler optimizing that)


Solution

  • For the record, gcc, when compiling with optimization specifically disabled (-O0), produces different code for the two inputs (in my case, the body of foo was return rand(); so that the result would not be determined at compile time).

    Without temporary variable t:

            movl    $0, %eax
            call    foo
            testl   %eax, %eax
            je      .L4
            /* inside of if block */
    .L4:
            /* rest of main() */
    

    Here, the return value of foo is stored in the EAX register, and the register is tested against itself to see if it is 0, and if so, it jumps over the body of the if block.

    With temporary variable t:

            movl    $0, %eax
            call    foo
            movl    %eax, -4(%rbp)
            cmpl    $0, -4(%rbp)
            je      .L4
            /* inside of if block */
    .L4:
            /* rest of main() */
    

    Here, the return value of foo is stored in the EAX register, then pushed onto the stack. Then, the contents of the location on the stack are compared to literal 0, and if they are equal, it jumps over the body of the if block.

    And so if we assume further that the processor is not doing any "optimizations" when it generates the microcode for this, then the version without the temporary should be a few clock cycles faster. It's not going to be substantially faster because even though the version with a temporary involves a stack push, the stack value is almost certainly still going to be in the processor's L1 cache when the comparison instruction is executed immediately afterwords, and so there's not going to be a round trip to RAM.

    Of course the code becomes identical as soon as you turn on any optimization level, even -O1, and who compiles anything that is so critical that they care about a handful of clock cycles with all optimizations off?

    Edit: With regard to your further information about your hardware engineer friend, I can't see how accessing a value in the L1 cache would ever be faster than accessing a register directly. I could see it being just about as fast if the value never even leaves the pipeline, but I can't see it being faster, especially since it still has to execute the movl instruction in addition to the comparison. But show him the assembly code above and ask what he thinks; it will be more productive than trying to discuss the problem in terms of C.