I study performance of C code. OS is Windows 10. Suddenly I found that the same code runs 2x faster if I use local variables against the same code with global variables. I can't understand why. Tested both in Visual Studio and GCC. Same performance difference. The test results:
fun_b
4254600908
Duration: 5.168000 sec
fun_a
4254600908
Duration: 18.455000 sec
The sample code:
#include <stdio.h>
#include <time.h>
unsigned int result;
unsigned int mask;
unsigned int i;
unsigned int fun_a(unsigned int value)
{
result = 0;
mask = 1;
for (i = 0; i < 32; i++)
{
if (value & mask) result++;
mask <<= 1;
}
return result;
}
unsigned int fun_b(unsigned int value)
{
unsigned int result = 0;
unsigned int mask = 1;
for (unsigned int i = 0; i < 32; i++)
{
if (value & mask) result++;
mask <<= 1;
}
return result;
}
int main(void)
{
unsigned int N = 0x12345678;
clock_t startClock, finishClock;
double f;
unsigned int A;
////////////////////////////////
printf("fun_b\n");
startClock = clock();
A = 0;
for (unsigned int i = 0; i < N; i++)
{
A += fun_b(i);
}
finishClock = clock();
printf("%u\n", A);
f = (double)(finishClock - startClock) / (double)(CLOCKS_PER_SEC);
printf("Duration: %f sec\n", f);
////////////////////////////////
printf("fun_a\n");
startClock = clock();
A = 0;
for (unsigned int i = 0; i < N; i++)
{
A += fun_a(i);
}
finishClock = clock();
printf("%u\n", A);
f = (double)(finishClock - startClock) / (double)(CLOCKS_PER_SEC);
printf("Duration: %f sec\n", f);
}
I expected fun_a and fun_b should take the same time to execute and I can't understand why fun_b runs twice as fast.
When you look at the disassembly (gcc, -O3):
fun_b(unsigned int):
mov edx, 32
mov eax, 1
xor ecx, ecx
.L15:
mov esi, edi
and esi, eax
cmp esi, 1
sbb ecx, -1
add eax, eax
sub edx, 1
jne .L15
mov eax, ecx
ret
fun_b
just does stuff with registers. This is reasonably fast.
fun_a
has to also populate the global variables with the right values because they are part of the function's observable side effects.
fun_a(unsigned int):
mov eax, 32
xor esi, esi
xor ecx, ecx
mov edx, 1
mov DWORD PTR result[rip], 0
.L3:
test edx, edi
je .L2
add ecx, 1
mov esi, 1
.L2:
add edx, edx
sub eax, 1
jne .L3
mov DWORD PTR mask[rip], edx
mov DWORD PTR i[rip], 32
test sil, sil
je .L1
mov DWORD PTR result[rip], ecx
mov eax, ecx
.L1:
ret
which is that much slower for a small function like this. Note: it's only doing the final values, you will not be able to observe i
and mask
changing during the loop. And it's still that much of a difference.
Note: actually it is slightly more complicated as the compiler has inlined the functions and is considering the entire sum at once. Having to compute the values for the global variables has prevented it from optimizing it as well as it could.
If you want to benchmark the functions themselves, you would have to prevent inlining by e.g. calling them through function pointers like this: https://godbolt.org/z/v58bqveTc -- then the above explanation applies and the difference becomes even greater somehow.