cperformancematrixcalloc

Allocating matrix performances


I have two scenarios, in both i allocate 78*2 sizeof(int) of memory and initialize it to 0. Are there any differences regards performances?

Scenario A:

int ** v = calloc(2 , sizeof(int*));

    for (i=0; i<2; ++i)
    {
        v[i] = calloc(78, sizeof(int));
    }

Scenario B:

int ** v = calloc(78 , sizeof(int*));

    for (i=0; i<78; ++i)
    {
        v[i] = calloc(2, sizeof(int));
    }

I supposed that in performance terms, it's better to use a calloc if an initialize array is needed, let me know if I'm wrong


Solution

  • First, discussing optimization abstractly has some difficulties because compilers are becoming increasingly better at optimization. (For some reason, compiler developers will not stop improving them.) We do not always know what machine code given source code will produce, especially when we write source code today and expect it to be used for many years to come. Optimization may consolidate multiple steps into one or may omit unnecessary steps (such as clearing memory with calloc instead of malloc immediately before the memory is completely overwritten in a for loop). There is a growing difference between what source code nominally says (“Do these specific steps in this specific order”) and what it technically says in the language abstraction (“Compute the same results as this source code in some optimized fashion”).

    However, we can generally figure that writing source code without unnecessary steps is at least as good as writing source code with unnecessary steps. With that in mind, let’s consider the nominal steps in your scenarios.

    In Scenario A, we tell the computer:

    In Scenario B, we tell the computer:

    We can easily see two things:

    However, both scenarios allude to another issue. Presumably, the so-called array will be used later in the program, likely in a form like v[i][j]. Using this as a value means:

    Let’s consider a different way to define v: int (*v)[78] = malloc(2 * sizeof *v);.

    This says:

    Immediately we see that involves fewer steps than Scenario A or Scenario B. But also look at what it does to the steps for using v[i][j] as a value. Because v is a pointer to an array instead of a pointer to a pointer, the computer can calculate where the appropriate element is instead of having to load an address from memory:

    So this pointer-to-array version is one step fewer than the pointer-to-pointer version.

    Further, the pointer-to-pointer version requires an additional fetch from memory for each use of v[i][j]. Fetches from memory can be expensive relative to in-processor operations like multiplying and adding, so it is a good step to eliminate. Having to fetch a pointer can prevent a processor from predicting where the next load from memory might be based on recent patterns of use. Additionally, the pointer-to-array version puts all the elements of the 2×78 array together in memory, which can benefit the cache performance. Processors are also designed for efficient use of consecutive memory. With the pointer-to-pointer version, the separate allocations typically wind up with at least some separation between the rows and may have a lot of separation, which can break the benefits of consecutive memory use.