performance benchmarking cpu-architecture cpu-cache xeon-phi

Understanding matrix multiply on Intel Xeon PHi 7210

I have following working program which is producing results correctly however I am confused by some statistics. The setup is as:

Hardware: Intel Xeon Phi processor 7210
Software: Multiplication of two NxN matrices (in my case 512x512)
Data Structures: All 3 matrices are malloc'ed in high bandwidth memory (i.e. in 16GB mcdram)

The code is:

void MatrixMultiply(img_in1, img_in2, img_out, myRank, nRanks)
{
    for (int i=startingRow ; i<endingRow ; i++)     //no of rows of first matrix divided on per process basis
    {
        for (int j=0 ; j<M1Cdim ; j++)          //no of cols of first matrix
        {
            int temp = 0;
            for(int k=0 ; k<M2Rdim ; k++)       //no of rows of second matrix
            {
            temp += in1[i*M1Rdim+k] *  in2[k*M2Rdim+j]; //out(i,j) +=  in1(i,k) * in2(k,j)
        }
        out[i*M1Rdim+j] = temp;
    }
    }
}

main()
{
  for (int i = 1; i <= 10; i++) 
  {
    MPI_Barrier(MPI_COMM_WORLD);
    const double t0 = omp_get_wtime();
    MatrixMultiply(img_in1, img_in2, img_out, myRank, nRanks);
    MPI_Barrier(MPI_COMM_WORLD);
    const double t1 = omp_get_wtime();

    const double ts   = t1-t0;      // time in seconds
    const double tms  = ts*1.0e3;   // time in milliseconds
    const double gbps = double(sizeof(P)*2*img_in1.height*img_in1.width*img_in2.height)*1e-9/ts;    // bandwidth in GB/s
    const double fpps = double(2*img_in1.height*img_in1.width*img_in2.height)*1e-9/ts;          // performance in GFLOP/s

    if (myRank == 0) 
    {   
      printf("%5d %15.3f %15.3f %15.3f %s\n", i, tms, gbps, fpps);
    }
  }
}

The statistics I am getting are:

 Step        Time, ms            GB/s         GFLOP/s
    1           2.306         116.408         116.408
    2           2.334         115.017         115.017
    3           2.297         116.855         116.855
    4           2.295         116.964         116.964 
    5          16.692          16.082          16.082 
    6          11.468          23.407          23.407 
    7           2.299         116.758         116.758 
    8           2.291         117.171         117.171 
    9           2.295         116.964         116.964 
   10          10.792          24.874          24.874

So my question is:

Why iteration 5, 6 and 10 are showing worse results than other iterations?

My suspicion is that even though the data is placed in high bandwidth memory (mcdram) however code itself is executing from cache so might be getting hit. Although the overall program is quite small like 54KB but if running on shared server then some iterations might be evicted from instruction cache resulting in this performance penalty.

Solution

This matrix-multiplication code is very inefficient!

Indeed, The line in2[k*M2Rdim+j] is likely to cause cache thrashing and thus high-instability in the computation timing if lines have often to be reloaded from the MCD-RAM. Although the MCD-RAM have a high bandwidth, it also have a high latency (similar to the one of the DDR-RAM). The latency is probably a huge issue in this case.

Specifically, striding down one column of a matrix is terrible for spatial locality. And even worse when the matrix dimension is a power of 2: you're likely to get conflict misses on cache because all those cache lines will alias to the same set in a set associative cache. This can lead to cache thrashing even with a small working set.

Thus, please use BLAS functions (from MKL, OpenBLAS, ATLAS, etc.)! They are far more optimized than that. If you cannot, please consider improving this code. You can find a quite good explanation of you to do that here. I think that a speed-up of more than 10 is easily achievable.

I also advise you to profile your code using tools like perf or VTune that enable you to analyze hardware events (such as L1/L2 cache operations) and confirm/reject the cash-thrashing hypothesis as well as helping you to improve this code.