On my machine Time A and Time B swap depending on whether A
is
defined or not (which changes the order in which the two calloc
s are called).
I initially attributed this to the paging system. Weirdly, when
mmap
is used instead of calloc
, the situation is even more bizzare -- both the loops take the same amount of time, as expected. As
can be seen with strace
, the calloc
s ultimately result in two
mmap
s, so there is no return-already-allocated-memory magic going on.
I'm running Debian testing on an Intel i7.
#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>
#include <time.h>
#define SIZE 500002816
#ifndef USE_MMAP
#define ALLOC calloc
#else
#define ALLOC(a, b) (mmap(NULL, a * b, PROT_READ | PROT_WRITE, \
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0))
#endif
int main() {
clock_t start, finish;
#ifdef A
int *arr1 = ALLOC(sizeof(int), SIZE);
int *arr2 = ALLOC(sizeof(int), SIZE);
#else
int *arr2 = ALLOC(sizeof(int), SIZE);
int *arr1 = ALLOC(sizeof(int), SIZE);
#endif
int i;
start = clock();
{
for (i = 0; i < SIZE; i++)
arr1[i] = (i + 13) * 5;
}
finish = clock();
printf("Time A: %.2f\n", ((double)(finish - start))/CLOCKS_PER_SEC);
start = clock();
{
for (i = 0; i < SIZE; i++)
arr2[i] = (i + 13) * 5;
}
finish = clock();
printf("Time B: %.2f\n", ((double)(finish - start))/CLOCKS_PER_SEC);
return 0;
}
The output I get:
~/directory $ cc -Wall -O3 bench-loop.c -o bench-loop
~/directory $ ./bench-loop
Time A: 0.94
Time B: 0.34
~/directory $ cc -DA -Wall -O3 bench-loop.c -o bench-loop
~/directory $ ./bench-loop
Time A: 0.34
Time B: 0.90
~/directory $ cc -DUSE_MMAP -DA -Wall -O3 bench-loop.c -o bench-loop
~/directory $ ./bench-loop
Time A: 0.89
Time B: 0.90
~/directory $ cc -DUSE_MMAP -Wall -O3 bench-loop.c -o bench-loop
~/directory $ ./bench-loop
Time A: 0.91
Time B: 0.92
Short Answer
The first time that calloc
is called it is explicitly zeroing out the memory. While the next time that it is called it assumed that the memory returned from mmap
is already zeroed out.
Details
Here's some of the things that I checked to come to this conclusion that you could try yourself if you wanted:
Insert a calloc
call before your first ALLOC
call. You will see that after this the Time for Time A and Time B are the same.
Use the clock()
function to check how long each of the ALLOC
calls take. In the case where they are both using calloc
you will see that the first call takes much longer than the second one.
Use time
to time the execution time of the calloc
version and the USE_MMAP
version. When I did this I saw that the execution time for USE_MMAP
was consistently slightly less.
I ran with strace -tt -T
which shows both the time of when the system call was made and how long it took. Here is part of the output:
Strace output:
21:29:06.127536 mmap(NULL, 2000015360, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fff806fd000 <0.000014>
21:29:07.778442 mmap(NULL, 2000015360, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fff093a0000 <0.000021>
21:29:07.778563 times({tms_utime=63, tms_stime=102, tms_cutime=0, tms_cstime=0}) = 4324241005 <0.000011>
You can see that the first mmap
call took 0.000014
seconds, but that about 1.5
seconds elapsed before the next system call. Then the second mmap
call took 0.000021
seconds, and was followed by the times
call a few hundred microsecond later.
I also stepped through part of the application execution with gdb
and saw that the first call to calloc
resulted in numerous calls to memset
while the second call to calloc
did not make any calls to memset
. You can see the source code for calloc
here (look for __libc_calloc
) if you are interested. As for why calloc
is doing the memset
on the first call but not subsequent ones I don't know. But I feel fairly confident that this explains the behavior you have asked about.
As for why the array that was zeroed memset
has improved performance my guess is that it is because of values being loaded into the TLB rather than the cache since it is a very large array. Regardless the specific reason for the performance difference that you asked about is that the two calloc
calls behave differently when they are executed.