c++fortranmatrix-multiplicationblas

How does BLAS get such extreme performance?


Out of curiosity I decided to benchmark my own matrix multiplication function versus the BLAS implementation... I was to say the least surprised at the result:

Custom Implementation, 10 trials of 1000x1000 matrix multiplication:

Took: 15.76542 seconds.

BLAS Implementation, 10 trials of 1000x1000 matrix multiplication:

Took: 1.32432 seconds.

This is using single precision floating point numbers.

My Implementation:

template<class ValT>
void mmult(const ValT* A, int ADim1, int ADim2, const ValT* B, int BDim1, int BDim2, ValT* C)
{
    if ( ADim2!=BDim1 )
        throw std::runtime_error("Error sizes off");

    memset((void*)C,0,sizeof(ValT)*ADim1*BDim2);
    int cc2,cc1,cr1;
    for ( cc2=0 ; cc2<BDim2 ; ++cc2 )
        for ( cc1=0 ; cc1<ADim2 ; ++cc1 )
            for ( cr1=0 ; cr1<ADim1 ; ++cr1 )
                C[cc2*ADim2+cr1] += A[cc1*ADim1+cr1]*B[cc2*BDim1+cc1];
}

I have two questions:

  1. Given that a matrix-matrix multiplication say: nxm * mxn requires n*n*m multiplications, so in the case above 1000^3 or 1e9 operations. How is it possible on my 2.6Ghz processor for BLAS to do 10*1e9 operations in 1.32 seconds? Even if multiplcations were a single operation and there was nothing else being done, it should take ~4 seconds.
  2. Why is my implementation so much slower?

Solution

  • A good starting point is the great book The Science of Programming Matrix Computations by Robert A. van de Geijn and Enrique S. Quintana-Ortí. They provide a free download version.

    BLAS is divided into three levels:

    By the way, most (or even all) of the high performance BLAS implementations are NOT implemented in Fortran. ATLAS is implemented in C. GotoBLAS/OpenBLAS is implemented in C and its performance-critical parts in Assembler. Only the reference implementation of BLAS is implemented in Fortran. However, all these BLAS implementations provide a Fortran interface such that it can be linked against LAPACK (LAPACK gains all its performance from BLAS).

    Optimized compilers play a minor role in this respect (and for GotoBLAS/OpenBLAS the compiler does not matter at all).

    IMHO no BLAS implementation uses algorithms like the Coppersmith–Winograd algorithm or the Strassen algorithm. The likely reasons are:

    Edit/Update:

    The new and groundbreaking papers for this topic are the BLIS papers. They are exceptionally well written. For my lecture "Software Basics for High Performance Computing" I implemented the matrix-matrix product following their paper. Actually I implemented several variants of the matrix-matrix product. The simplest variant is entirely written in plain C and has less than 450 lines of code. All the other variants merely optimize the loops

        for (l=0; l<MR*NR; ++l) {
            AB[l] = 0;
        }
        for (l=0; l<kc; ++l) {
            for (j=0; j<NR; ++j) {
                for (i=0; i<MR; ++i) {
                    AB[i+j*MR] += A[i]*B[j];
                }
            }
            A += MR;
            B += NR;
        }
    

    The overall performance of the matrix-matrix product only depends on these loops. About 99.9% of the time is spent here. In the other variants I used intrinsics and assembler code to improve the performance. You can see the tutorial going through all the variants here:

    ulmBLAS: Tutorial on GEMM (Matrix-Matrix Product)

    Together with the BLIS papers it becomes fairly easy to understand how libraries like Intel MKL can gain such performance. And why it does not matter whether you use row or column major storage!

    The final benchmarks are here (we called our project ulmBLAS):

    Benchmarks for ulmBLAS, BLIS, MKL, openBLAS and Eigen

    Another Edit/Update:

    I also wrote some tutorials on how BLAS is used for numerical linear algebra problems like solving a system of linear equations:

    High Performance LU Factorization

    (This LU factorization is for example used by Matlab for solving a system of linear equations.)

    I hope to find time to extend the tutorial to describe and demonstrate how to realise a highly scalable parallel implementation of the LU factorization like in PLASMA.

    Ok, here you go: Coding a Cache Optimized Parallel LU Factorization

    P.S.: I also did make some experiments on improving the performance of uBLAS. It actually is pretty simple to boost (yeah, play on words :) ) the performance of uBLAS:

    Experiments on uBLAS.

    Here a similar project with BLAZE:

    Experiments on BLAZE.