ccachingoptimizationconvolutionppm

Optimization of the algorithm when working with images in C (image convolution)


I need to write a program that will read an image into an array, perform a convolution operation (sharpening) and save it back to a file (ppm). I wrote a standard algorithm:

    unsigned char* imgBefore = malloc(height*(3*width)*sizeof(unsigned char));
    assert(fread(imgBefore, 3*width, height, inputFile) == height);
    unsigned char* imgAfter = malloc(height*(3*width)*sizeof(unsigned char));

    short ker[3][3] = {{0, -1, 0}, {-1, 5, -1}, {0, -1, 0}};

    unsigned char r, g, b;

    for(int y = 0; y < height; y++) {
        for(int x = 0; x < width; x++) {
            if(y == 0 || y == height-1 || x == 0 || x == width-1) {
                r = imgBefore[3*(width*y + x) + 0];
                g = imgBefore[3*(width*y + x) + 1];
                b = imgBefore[3*(width*y + x) + 2];
                imgAfter[3*(width*y + x) + 0]  = r;
                imgAfter[3*(width*y + x) + 1]  = g;
                imgAfter[3*(width*y + x) + 2]  = b;
                continue;
            }

            int rSum = 0, gSum = 0, bSum = 0, val;

            for(int dy = 0; dy < 3; dy++) { // kernels height
                int yy = 3*width*(y+dy-1);
                for(int dx = 0; dx < 3; dx++) { // kerenels width
                    int xx = 3*(x+dx-1);
                    val = ker[dy][dx];

                    rSum += val * imgBefore[yy + xx + 0];
                    gSum += val * imgBefore[yy + xx + 1];
                    bSum += val * imgBefore[yy + xx + 2];
                }
            }
            rSum = rSum < 0 ? 0 : (rSum > 255 ? 255 : rSum);
            gSum = gSum < 0 ? 0 : (gSum > 255 ? 255 : gSum);
            bSum = bSum < 0 ? 0 : (bSum > 255 ? 255 : bSum);

            imgAfter[3*(width*y + x) + 0] = rSum;
            imgAfter[3*(width*y + x) + 1] = gSum;
            imgAfter[3*(width*y + x) + 2] = bSum;
        }
    }

    fwrite(imgAfter, 3*width, height, outputFile);

Next, I need to optimize its effectiveness in interacting with the cache. It seems to me that the problem part is in this piece of code:

for(int dy = 0; dy < 3; dy++) { // kernels height
    int yy = 3*width*(y+dy-1);
    for(int dx = 0; dx < 3; dx++) { // kerenels width
        int xx = 3*(x+dx-1);
        val = ker[dy][dx];

        rSum += val * imgBefore[yy + xx + 0];
        gSum += val * imgBefore[yy + xx + 1];
        bSum += val * imgBefore[yy + xx + 2];
    }
}

because it first loads one row of the matrix, uses only 3 (9) elements and then goes to the next row. This seems completely ineffective in relation to the cache.

What can I do to fix this?

I also tried to reuse individual pixels or entire rows. All this only worsened the result (there is a great chance that I simply poorly implemented structures like FIFO or added and read from them in the wrong places). If a program needs it, how should it look? For evaluation I use: valgrind --tool=cachegrind --I1=32768,8,64 --D1=32768,8,64 --LL=1048576,16,64 ./a.out

I would be grateful for any advice


Solution

  • The way you access data is not cache-friendly with big images. This can be improved (for example using tiling).

    Be aware that the computation time of this program is not bounded by the memory hierarchy, but by a slow sequential scalar-based computation computation caused by an inefficient data layout.

    Here is the Valgrind analysis result on 10 stencil function calls using a 4608 x 3456 x 3 image:

    --171871-- warning: L3 cache found, using its data for the LL simulation.
    --171871-- warning: specified LL cache: line_size 64  assoc 12  total_size 9,437,184
    --171871-- warning: simulated LL cache: line_size 64  assoc 18  total_size 9,437,184
    ==171871== 
    ==171871== I   refs:      30,437,416,378
    ==171871== I1  misses:               944
    ==171871== LLi misses:               938
    ==171871== I1  miss rate:           0.00%
    ==171871== LLi miss rate:           0.00%
    ==171871== 
    ==171871== D   refs:       7,526,412,742  (7,000,797,573 rd   + 525,615,169 wr)
    ==171871== D1  misses:        30,599,898  (   22,387,768 rd   +   8,212,130 wr)
    ==171871== LLd misses:        15,679,220  (    7,467,138 rd   +   8,212,082 wr)
    ==171871== D1  miss rate:            0.4% (          0.3%     +         1.6%  )
    ==171871== LLd miss rate:            0.2% (          0.1%     +         1.6%  )
    ==171871== 
    ==171871== LL refs:           30,600,842  (   22,388,712 rd   +   8,212,130 wr)
    ==171871== LL misses:         15,680,158  (    7,468,076 rd   +   8,212,082 wr)
    ==171871== LL miss rate:             0.0% (          0.0%     +         1.6%  )
    

    First of all, we can see that there is a lot of D1 cache reference (47 per written pixel, due to scalar accesses).

    The number of D1 misses and the D1 miss rate seems very small so we could think the cache access are good. But it not the case. Valgrind results are very misleading because D1 cache reference works at the 1-byte scalar granularity while D1 misses works at the 64-byte cache line granularity.

    If we look the analysis more carefully, we can see that almost all written data cause D1 cache misses. Moreover, 20% of the data read from L1 need to be (re)loaded from the LLC (for the 7 GB read from the D1 cache, 1.43 GB are loaded from the LLC).

    The problem comes from the large lines of the image. Indeed, 1 image line takes 4608 * 3 = 13824 bytes and 3 lines are needed to write one line. Thus, 41472 bytes are read from the D1 cache to write 13824 bytes. Theoretically, with a big enough D1 cache, 27648 should be reused from the D1 cache. However, the D1 cache is too small and cannot fit all read/written data (roughly because 41472+13824 > 32768).

    We can improve the situation using two methods:

    const int blockSize = 1024;
    for(int xBlock = 0; xBlock < width; xBlock+=blockSize) {
        for(int y = 0; y < height; y++) {
            int xMax = xBlock + blockSize;
            if(xMax > width)
                xMax = width;
            for(int x = xBlock; x < xMax; x++) {
                // Unchanged part of the code
            }
        }
    }
    

    Here is the Valgrind result after applying the tiling method:

    ==183432== D   refs:       7,527,450,132  (7,001,731,093 rd   + 525,719,039 wr)
    ==183432== D1  misses:        16,025,288  (    7,640,368 rd   +   8,384,920 wr)
    ==183432== LLd misses:        16,024,790  (    7,639,918 rd   +   8,384,872 wr)
    

    We can see that there are now 2 times less D1 cache misses (and 3 times less reads).