I need to write a program that will read an image into an array, perform a convolution operation (sharpening) and save it back to a file (ppm). I wrote a standard algorithm:
unsigned char* imgBefore = malloc(height*(3*width)*sizeof(unsigned char));
assert(fread(imgBefore, 3*width, height, inputFile) == height);
unsigned char* imgAfter = malloc(height*(3*width)*sizeof(unsigned char));
short ker[3][3] = {{0, -1, 0}, {-1, 5, -1}, {0, -1, 0}};
unsigned char r, g, b;
for(int y = 0; y < height; y++) {
for(int x = 0; x < width; x++) {
if(y == 0 || y == height-1 || x == 0 || x == width-1) {
r = imgBefore[3*(width*y + x) + 0];
g = imgBefore[3*(width*y + x) + 1];
b = imgBefore[3*(width*y + x) + 2];
imgAfter[3*(width*y + x) + 0] = r;
imgAfter[3*(width*y + x) + 1] = g;
imgAfter[3*(width*y + x) + 2] = b;
continue;
}
int rSum = 0, gSum = 0, bSum = 0, val;
for(int dy = 0; dy < 3; dy++) { // kernels height
int yy = 3*width*(y+dy-1);
for(int dx = 0; dx < 3; dx++) { // kerenels width
int xx = 3*(x+dx-1);
val = ker[dy][dx];
rSum += val * imgBefore[yy + xx + 0];
gSum += val * imgBefore[yy + xx + 1];
bSum += val * imgBefore[yy + xx + 2];
}
}
rSum = rSum < 0 ? 0 : (rSum > 255 ? 255 : rSum);
gSum = gSum < 0 ? 0 : (gSum > 255 ? 255 : gSum);
bSum = bSum < 0 ? 0 : (bSum > 255 ? 255 : bSum);
imgAfter[3*(width*y + x) + 0] = rSum;
imgAfter[3*(width*y + x) + 1] = gSum;
imgAfter[3*(width*y + x) + 2] = bSum;
}
}
fwrite(imgAfter, 3*width, height, outputFile);
Next, I need to optimize its effectiveness in interacting with the cache. It seems to me that the problem part is in this piece of code:
for(int dy = 0; dy < 3; dy++) { // kernels height
int yy = 3*width*(y+dy-1);
for(int dx = 0; dx < 3; dx++) { // kerenels width
int xx = 3*(x+dx-1);
val = ker[dy][dx];
rSum += val * imgBefore[yy + xx + 0];
gSum += val * imgBefore[yy + xx + 1];
bSum += val * imgBefore[yy + xx + 2];
}
}
because it first loads one row of the matrix, uses only 3 (9) elements and then goes to the next row. This seems completely ineffective in relation to the cache.
What can I do to fix this?
I also tried to reuse individual pixels or entire rows. All this only worsened the result (there is a great chance that I simply poorly implemented structures like FIFO or added and read from them in the wrong places). If a program needs it, how should it look? For evaluation I use: valgrind --tool=cachegrind --I1=32768,8,64 --D1=32768,8,64 --LL=1048576,16,64 ./a.out
I would be grateful for any advice
The way you access data is not cache-friendly with big images. This can be improved (for example using tiling).
Be aware that the computation time of this program is not bounded by the memory hierarchy, but by a slow sequential scalar-based computation computation caused by an inefficient data layout.
Here is the Valgrind analysis result on 10 stencil function calls using a 4608 x 3456 x 3
image:
--171871-- warning: L3 cache found, using its data for the LL simulation.
--171871-- warning: specified LL cache: line_size 64 assoc 12 total_size 9,437,184
--171871-- warning: simulated LL cache: line_size 64 assoc 18 total_size 9,437,184
==171871==
==171871== I refs: 30,437,416,378
==171871== I1 misses: 944
==171871== LLi misses: 938
==171871== I1 miss rate: 0.00%
==171871== LLi miss rate: 0.00%
==171871==
==171871== D refs: 7,526,412,742 (7,000,797,573 rd + 525,615,169 wr)
==171871== D1 misses: 30,599,898 ( 22,387,768 rd + 8,212,130 wr)
==171871== LLd misses: 15,679,220 ( 7,467,138 rd + 8,212,082 wr)
==171871== D1 miss rate: 0.4% ( 0.3% + 1.6% )
==171871== LLd miss rate: 0.2% ( 0.1% + 1.6% )
==171871==
==171871== LL refs: 30,600,842 ( 22,388,712 rd + 8,212,130 wr)
==171871== LL misses: 15,680,158 ( 7,468,076 rd + 8,212,082 wr)
==171871== LL miss rate: 0.0% ( 0.0% + 1.6% )
First of all, we can see that there is a lot of D1 cache reference
(47 per written pixel, due to scalar accesses).
The number of D1 misses
and the D1 miss rate
seems very small so we could think the cache access are good. But it not the case. Valgrind results are very misleading because D1 cache reference
works at the 1-byte scalar granularity while D1 misses
works at the 64-byte cache line granularity.
If we look the analysis more carefully, we can see that almost all written data cause D1 cache misses. Moreover, 20% of the data read from L1 need to be (re)loaded from the LLC (for the 7 GB read from the D1 cache, 1.43 GB are loaded from the LLC).
The problem comes from the large lines of the image. Indeed, 1 image line takes 4608 * 3 = 13824
bytes and 3 lines are needed to write one line. Thus, 41472 bytes are read from the D1 cache to write 13824 bytes. Theoretically, with a big enough D1 cache, 27648 should be reused from the D1 cache. However, the D1 cache is too small and cannot fit all read/written data (roughly because 41472+13824 > 32768
).
We can improve the situation using two methods:
x
loop bound at a small fixed size (eg. 1024) so that it will compute verticals blocks of the image, and then add a new encompassing xBlock
loop (that includes the y
and x
loops) which iterate over the vertical blocks. This approach has the benefit of improving the D1 cache reuse, but may still cause some unnecessary cache misses on modern processors (since loaded lines are not contiguous anymore). You can mitigate this by performing a tiling on the y
dimension or by adapting the data layout. Here is an example:const int blockSize = 1024;
for(int xBlock = 0; xBlock < width; xBlock+=blockSize) {
for(int y = 0; y < height; y++) {
int xMax = xBlock + blockSize;
if(xMax > width)
xMax = width;
for(int x = xBlock; x < xMax; x++) {
// Unchanged part of the code
}
}
}
#pragma
directives to do that (doing that manually is a bit tricky in your code). This is good only if the image is too big to fit in the LLC cache or not reused later. See here for more information. Here is the Valgrind result after applying the tiling method:
==183432== D refs: 7,527,450,132 (7,001,731,093 rd + 525,719,039 wr)
==183432== D1 misses: 16,025,288 ( 7,640,368 rd + 8,384,920 wr)
==183432== LLd misses: 16,024,790 ( 7,639,918 rd + 8,384,872 wr)
We can see that there are now 2 times less D1 cache misses (and 3 times less reads).