I have following working program which is producing results correctly however I am confused by some statistics. The setup is as:
The code is:
void MatrixMultiply(img_in1, img_in2, img_out, myRank, nRanks)
{
for (int i=startingRow ; i<endingRow ; i++) //no of rows of first matrix divided on per process basis
{
for (int j=0 ; j<M1Cdim ; j++) //no of cols of first matrix
{
int temp = 0;
for(int k=0 ; k<M2Rdim ; k++) //no of rows of second matrix
{
temp += in1[i*M1Rdim+k] * in2[k*M2Rdim+j]; //out(i,j) += in1(i,k) * in2(k,j)
}
out[i*M1Rdim+j] = temp;
}
}
}
main()
{
for (int i = 1; i <= 10; i++)
{
MPI_Barrier(MPI_COMM_WORLD);
const double t0 = omp_get_wtime();
MatrixMultiply(img_in1, img_in2, img_out, myRank, nRanks);
MPI_Barrier(MPI_COMM_WORLD);
const double t1 = omp_get_wtime();
const double ts = t1-t0; // time in seconds
const double tms = ts*1.0e3; // time in milliseconds
const double gbps = double(sizeof(P)*2*img_in1.height*img_in1.width*img_in2.height)*1e-9/ts; // bandwidth in GB/s
const double fpps = double(2*img_in1.height*img_in1.width*img_in2.height)*1e-9/ts; // performance in GFLOP/s
if (myRank == 0)
{
printf("%5d %15.3f %15.3f %15.3f %s\n", i, tms, gbps, fpps);
}
}
}
The statistics I am getting are:
Step Time, ms GB/s GFLOP/s
1 2.306 116.408 116.408
2 2.334 115.017 115.017
3 2.297 116.855 116.855
4 2.295 116.964 116.964
5 16.692 16.082 16.082
6 11.468 23.407 23.407
7 2.299 116.758 116.758
8 2.291 117.171 117.171
9 2.295 116.964 116.964
10 10.792 24.874 24.874
So my question is:
Why iteration 5, 6 and 10 are showing worse results than other iterations?
My suspicion is that even though the data is placed in high bandwidth memory (mcdram) however code itself is executing from cache so might be getting hit. Although the overall program is quite small like 54KB but if running on shared server then some iterations might be evicted from instruction cache resulting in this performance penalty.
This matrix-multiplication code is very inefficient!
Indeed, The line in2[k*M2Rdim+j]
is likely to cause cache thrashing and thus high-instability in the computation timing if lines have often to be reloaded from the MCD-RAM. Although the MCD-RAM have a high bandwidth, it also have a high latency (similar to the one of the DDR-RAM). The latency is probably a huge issue in this case.
Specifically, striding down one column of a matrix is terrible for spatial locality. And even worse when the matrix dimension is a power of 2: you're likely to get conflict misses on cache because all those cache lines will alias to the same set in a set associative cache. This can lead to cache thrashing even with a small working set.
Thus, please use BLAS functions (from MKL, OpenBLAS, ATLAS, etc.)! They are far more optimized than that. If you cannot, please consider improving this code. You can find a quite good explanation of you to do that here. I think that a speed-up of more than 10 is easily achievable.
I also advise you to profile your code using tools like perf or VTune that enable you to analyze hardware events (such as L1/L2 cache operations) and confirm/reject the cash-thrashing hypothesis as well as helping you to improve this code.