Local variables performance vs global in example

I study performance of C code. OS is Windows 10. Suddenly I found that the same code runs 2x faster if I use local variables against the same code with global variables. I can't understand why. Tested both in Visual Studio and GCC. Same performance difference. The test results:

fun_b
4254600908
Duration:  5.168000 sec
fun_a
4254600908
Duration:  18.455000 sec

The sample code:

#include <stdio.h>
#include <time.h>
unsigned int result;
unsigned int mask;
unsigned int i;
unsigned int fun_a(unsigned int value)
{
    result = 0;
    mask = 1;
    for (i = 0; i < 32; i++)
    {
        if (value & mask) result++;
        mask <<= 1;
    }
    return result;
}
unsigned int fun_b(unsigned int value)
{
    unsigned int result = 0;
    unsigned int mask = 1;
    for (unsigned int i = 0; i < 32; i++)
    {
        if (value & mask) result++;
        mask <<= 1;
    }
    return result;
}
int main(void)
{
    unsigned int N = 0x12345678;
    clock_t startClock, finishClock;
    double f;
    unsigned int A;
    
    ////////////////////////////////
    printf("fun_b\n");
    startClock = clock();
    A = 0;
    for (unsigned int i = 0; i < N; i++)
    {
        A += fun_b(i);
    }
    finishClock = clock();
    printf("%u\n", A);
    f = (double)(finishClock - startClock) / (double)(CLOCKS_PER_SEC);
    printf("Duration:  %f sec\n", f);

    ////////////////////////////////
    printf("fun_a\n");
    startClock = clock();
    A = 0;
    for (unsigned int i = 0; i < N; i++)
    {
        A += fun_a(i);
    }
    finishClock = clock();
    printf("%u\n", A);
    f = (double)(finishClock - startClock) / (double)(CLOCKS_PER_SEC);
    printf("Duration:  %f sec\n", f);
}

I expected fun_a and fun_b should take the same time to execute and I can't understand why fun_b runs twice as fast.

Solution

When you look at the disassembly (gcc, -O3):

fun_b(unsigned int):
        mov     edx, 32
        mov     eax, 1
        xor     ecx, ecx
.L15:
        mov     esi, edi
        and     esi, eax
        cmp     esi, 1
        sbb     ecx, -1
        add     eax, eax
        sub     edx, 1
        jne     .L15
        mov     eax, ecx
        ret

fun_b just does stuff with registers. This is reasonably fast.

fun_a has to also populate the global variables with the right values because they are part of the function's observable side effects.

fun_a(unsigned int):
        mov     eax, 32
        xor     esi, esi
        xor     ecx, ecx
        mov     edx, 1
        mov     DWORD PTR result[rip], 0
.L3:
        test    edx, edi
        je      .L2
        add     ecx, 1
        mov     esi, 1
.L2:
        add     edx, edx
        sub     eax, 1
        jne     .L3
        mov     DWORD PTR mask[rip], edx
        mov     DWORD PTR i[rip], 32
        test    sil, sil
        je      .L1
        mov     DWORD PTR result[rip], ecx
        mov     eax, ecx
.L1:
        ret

which is that much slower for a small function like this. Note: it's only doing the final values, you will not be able to observe i and mask changing during the loop. And it's still that much of a difference.

Note: actually it is slightly more complicated as the compiler has inlined the functions and is considering the entire sum at once. Having to compute the values for the global variables has prevented it from optimizing it as well as it could.

If you want to benchmark the functions themselves, you would have to prevent inlining by e.g. calling them through function pointers like this: https://godbolt.org/z/v58bqveTc -- then the above explanation applies and the difference becomes even greater somehow.