cperformance

Local variables performance vs global in example


I study performance of C code. OS is Windows 10. Suddenly I found that the same code runs 2x faster if I use local variables against the same code with global variables. I can't understand why. Tested both in Visual Studio and GCC. Same performance difference. The test results:

fun_b
4254600908
Duration:  5.168000 sec
fun_a
4254600908
Duration:  18.455000 sec

The sample code:

#include <stdio.h>
#include <time.h>
unsigned int result;
unsigned int mask;
unsigned int i;
unsigned int fun_a(unsigned int value)
{
    result = 0;
    mask = 1;
    for (i = 0; i < 32; i++)
    {
        if (value & mask) result++;
        mask <<= 1;
    }
    return result;
}
unsigned int fun_b(unsigned int value)
{
    unsigned int result = 0;
    unsigned int mask = 1;
    for (unsigned int i = 0; i < 32; i++)
    {
        if (value & mask) result++;
        mask <<= 1;
    }
    return result;
}
int main(void)
{
    unsigned int N = 0x12345678;
    clock_t startClock, finishClock;
    double f;
    unsigned int A;
    
    ////////////////////////////////
    printf("fun_b\n");
    startClock = clock();
    A = 0;
    for (unsigned int i = 0; i < N; i++)
    {
        A += fun_b(i);
    }
    finishClock = clock();
    printf("%u\n", A);
    f = (double)(finishClock - startClock) / (double)(CLOCKS_PER_SEC);
    printf("Duration:  %f sec\n", f);

    ////////////////////////////////
    printf("fun_a\n");
    startClock = clock();
    A = 0;
    for (unsigned int i = 0; i < N; i++)
    {
        A += fun_a(i);
    }
    finishClock = clock();
    printf("%u\n", A);
    f = (double)(finishClock - startClock) / (double)(CLOCKS_PER_SEC);
    printf("Duration:  %f sec\n", f);
}

I expected fun_a and fun_b should take the same time to execute and I can't understand why fun_b runs twice as fast.


Solution

  • When you look at the disassembly (gcc, -O3):

    fun_b(unsigned int):
            mov     edx, 32
            mov     eax, 1
            xor     ecx, ecx
    .L15:
            mov     esi, edi
            and     esi, eax
            cmp     esi, 1
            sbb     ecx, -1
            add     eax, eax
            sub     edx, 1
            jne     .L15
            mov     eax, ecx
            ret
    

    fun_b just does stuff with registers. This is reasonably fast.

    fun_a has to also populate the global variables with the right values because they are part of the function's observable side effects.

    fun_a(unsigned int):
            mov     eax, 32
            xor     esi, esi
            xor     ecx, ecx
            mov     edx, 1
            mov     DWORD PTR result[rip], 0
    .L3:
            test    edx, edi
            je      .L2
            add     ecx, 1
            mov     esi, 1
    .L2:
            add     edx, edx
            sub     eax, 1
            jne     .L3
            mov     DWORD PTR mask[rip], edx
            mov     DWORD PTR i[rip], 32
            test    sil, sil
            je      .L1
            mov     DWORD PTR result[rip], ecx
            mov     eax, ecx
    .L1:
            ret
    

    which is that much slower for a small function like this. Note: it's only doing the final values, you will not be able to observe i and mask changing during the loop. And it's still that much of a difference.

    Note: actually it is slightly more complicated as the compiler has inlined the functions and is considering the entire sum at once. Having to compute the values for the global variables has prevented it from optimizing it as well as it could.

    If you want to benchmark the functions themselves, you would have to prevent inlining by e.g. calling them through function pointers like this: https://godbolt.org/z/v58bqveTc -- then the above explanation applies and the difference becomes even greater somehow.