c++matrixlapackinversionmagma

Why my inversions of matrices are such slow with LAPACKE in C++ : MAGMA Alternative and set up


I am using LAPACK to inverse a matrix: I did a reference passing, i.e by working on the address. Here below the function with an input matrix and an output matrix referenced by their address.

The issue is that I am obliged to convert the F_matrix into 1D array and I think this is a waste of performances on the runtime level : which way could I find to get rid of this supplementary task which is time consuming I think if I call a lot of times the function matrix_inverse_lapack.

Below the function concerned :

// Passing Matrixes by Reference
void matrix_inverse_lapack(vector<vector<double>> const &F_matrix, vector<vector<double>> &F_output) {

  // Index for loop and arrays
  int i, j, ip, idx;

  // Size of F_matrix
  int N = F_matrix.size();

  int *IPIV = new int[N];

  // Statement of main array to inverse
  double *arr = new double[N*N];

  // Output Diagonal block
  double *diag = new double[N];

for (i = 0; i<N; i++){
    for (j = 0; j<N; j++){
      idx = i*N + j;
      arr[idx] = F_matrix[i][j];
    }
  }

  // LAPACKE routines
  int info1 = LAPACKE_dgetrf(LAPACK_ROW_MAJOR, N, N, arr, N, IPIV);
  int info2 = LAPACKE_dgetri(LAPACK_ROW_MAJOR, N, arr, N, IPIV);

 for (i = 0; i<N; i++){
    for (j = 0; j<N; j++){
      idx = i*N + j;
      F_output[i][j] = arr[idx];
    }
  }

  delete[] IPIV;
  delete[] arr;
}

For example, I call it this way :

vector<vector<double>> CO_CL(lsize*(2*Dim_x+Dim_y), vector<double>(lsize*(2*Dim_x+Dim_y), 0));

... some code

matrix_inverse_lapack(CO_CL, CO_CL);

The performances on inversion are not which are expected, I think this is due to this conversion 2D -> 1D that I described in the function matrix_inverse_lapack.

Update

I was advised to install MAGMA on my MacOS Big Sur 11.3 but I have a lot of difficulties to set up it.

I have a AMD Radeon Pro 5600M graphic card. I have already installed by default Big Sur version all the Framework OpenCL (maybe I am wrong by saying that). Anyone could tell the procedure to follow for the installation of MAGMA. I saw that on a MAGMA software exists on http://magma.maths.usyd.edu.au/magma/ but it is really expensive and doesn't correspond to what I want : I just need all the SDK (headers and libraries) , if possible built with my GPU card. I have already installed all the Intel OpenAPI SDK on my MacOS. Maybe, I could link it to a MAGMA installation.

I saw another link https://icl.utk.edu/magma/software/index.html where MAGMA seems to be public : there is none link with the non-free version above, isn't there ?


Solution

  • First of all let me complain that OP did not provide all necessary data. The program is almost complete, but it is not a minimal, reproducible example. This is important because (a) it wastes time and (b) it hides potentially relevant information, eg. about the matrix initialization. Second, OP did not provide any details on the compilation, which, again may be relevant. Last, but not least, OP didn't check the status code for possible errors from Lapack functions, and this could also be important for correct interpretation of the results.

    Let's start from a minimal reproducible example:

    #include <lapacke.h>
    #include <vector>
    #include <chrono>
    #include <iostream>
    
    using Matrix = std::vector<std::vector<double>>;
    
    std::ostream &operator<<(std::ostream &out, Matrix const &v)
    {
        const auto size = std::min<int>(10, v.size());
        for (int i = 0; i < size; i++)
        {
            for (int j = 0; j < size; j++)
            {
                out << v[i][j] << "\t";
            }
            if (size < std::ssize(v)) out << "...";
            out << "\n";
        }
        return out;
    }
    
    void matrix_inverse_lapack(Matrix const &F_matrix, Matrix &F_output, std::vector<int> &IPIV_buffer,
                               std::vector<double> &matrix_buffer)
    {
        //  std::cout << F_matrix << "\n";
        auto t0 = std::chrono::steady_clock::now();
    
        const int N = F_matrix.size();
    
        for (int i = 0; i < N; i++)
        {
            for (int j = 0; j < N; j++)
            {
                auto idx = i * N + j;
                matrix_buffer[idx] = F_matrix[i][j];
            }
        }
    
        auto t1 = std::chrono::steady_clock::now();
        // LAPACKE routines
        int info1 = LAPACKE_dgetrf(LAPACK_ROW_MAJOR, N, N, matrix_buffer.data(), N, IPIV_buffer.data());
        int info2 = LAPACKE_dgetri(LAPACK_ROW_MAJOR, N, matrix_buffer.data(), N, IPIV_buffer.data());
        auto t2 = std::chrono::steady_clock::now();
    
        for (int i = 0; i < N; i++)
        {
            for (int j = 0; j < N; j++)
            {
                auto idx = i * N + j;
                F_output[i][j] = matrix_buffer[idx];
            }
        }
        auto t3 = std::chrono::steady_clock::now();
    
        auto whole_fun_time = std::chrono::duration<double>(t3 - t0).count();
        auto lapack_time = std::chrono::duration<double>(t2 - t1).count();
        //   std::cout << F_output << "\n";
        std::cout << "status: " << info1 << "\t" << info2 << "\t" << (info1 == 0 && info2 == 0 ? "Success" : "Failure")
                  << "\n";
        std::cout << "whole function:            " << whole_fun_time << "\n";
        std::cout << "LAPACKE matrix operations: " << lapack_time << "\n";
        std::cout << "conversion:                " << (whole_fun_time - lapack_time) / whole_fun_time * 100.0 << "%\n";
    }
    
    int main(int argc, const char *argv[])
    {
        const int M = 5;  // numer of test repetitions
    
        const int N = (argc > 1) ? std::stoi(argv[1]) : 10;
        std::cout << "Matrix size = " << N << "\n";
    
        std::vector<int> IPIV_buffer(N);
        std::vector<double> matrix_buffer(N * N);
    
        // Test matrix_inverse_lapack M times
        for (int i = 0; i < M; i++)
        {
            Matrix CO_CL(N);
            for (auto &v : CO_CL) v.resize(N);
    
            int idx = 1;
            for (auto &v : CO_CL)
            {
                for (auto &x : v)
                {
                    x = idx + 1.0 / idx;
                    idx++;
                }
            }
            matrix_inverse_lapack(CO_CL, CO_CL, IPIV_buffer, matrix_buffer);
        }
    }
    

    Here, operator<< is an overkill, but may be useful for anyone wanting to verify half-manually that the code works (by uncommenting lines 26 and 58), and ensuring that the code is correct is more important that measuring its performance.

    The code can be compiled with

    g++ -std=c++20 -O3  main.cpp -llapacke
    

    The program relies on an external library, lapacke, which needs to be installed, headers + binaries, for the code to compile and run.

    My code differs a bit from OP's: it is closer to "modern C++" in that it refrains from using naked pointers; I also added external buffers to matrix_inverse_lapack to suppress continual launching of memory allocator and deallocator, a small improvement that reduces the 2D-1D-2D conversion overhead in a measurable way. I also had to initialize the matrix and find a way to read in OP's mind what the value of N could be. I also added some timer readings for benchmarking. Apart from this, the logic of the code is unchanged.

    Now a benchmark carried out on a decent workstation. It lists the percentage of time the conversion takes relative to the total time taken by matrix_inverse_lapack. In other words, I measure the conversion overhead:

     N =   10, 3.5%   
     N =   30, 1.5%   
     N =  100, 1%   
     N =  300, 0.5%   
     N = 1000, 0.35%  
     N = 3000, 0.1%  
    

    The time taken by Lapack nicely scales as N3, as expected (data not shown). The time to invert a matrix is about 16 seconds for N = 3000, and about 5-6 s (5 microseconds) for N = 10.

    I assume the overhead of even 3% is completely acceptable. I believe OP uses matrices of size larger then 100, in which case the overhead at or below 1% is certainly acceptable.

    So what OP (or anyone having a similar problem) could have done wrong to obtain "unacceptable overhead conversion values"? Here's my short list

    1. Improper compilation
    2. Improper matrix initialization (for tests)
    3. Improper benchmarking

    1. Improper compilation

    If one forgets to compile in Release mode, one ends up with optimized Lapacke competing with unoptimized conversion. On my machine this peaks at an 33% overhead for N = 20.

    2. Improper matrix initialization (for tests)

    If one initializes the matrix like this:

            for (auto &v : CO_CL)
            {
                for (auto &x : v)
                {
                    x = idx; // rather than, eg., idx + 1.0/idx
                    idx++;
                }
            }
    

    then the matrix is singular, lapack returns quite quickly with the status different from 0. This increases the relative importance of the conversion part. But singular matrices are not what one wants to invert (it's impossible to do).

    3. Improper benchmarking

    Here's an example of the program output for N = 10:

     ./a.out 10 
     Matrix size = 10
     status: 0  0   Success
     whole function:            0.000127658
     LAPACKE matrix operations: 0.000126783
     conversion:                0.685425%
     status: 0  0   Success
     whole function:            1.2497e-05
     LAPACKE matrix operations: 1.2095e-05
     conversion:                3.21677%
     status: 0  0   Success
     whole function:            1.0535e-05
     LAPACKE matrix operations: 1.0197e-05
     conversion:                3.20835%
     status: 0  0   Success
     whole function:            9.741e-06
     LAPACKE matrix operations: 9.422e-06
     conversion:                3.27482%
     status: 0  0   Success
     whole function:            9.939e-06
     LAPACKE matrix operations: 9.618e-06
     conversion:                3.2297%
    

    One can see that the first call to lapack functions can take 10 times more time than the subsequent calls. This is quite a stable pattern, as if Lapack needed some time for self-initialization. It can affect the measurements for small N badly.

    4. What else can be done?

    OP apperas to believe that his approach to 2D arrays is good and Lapack is strange and old-fashionable in its packing a 2D array into a 1D array. No. It is Lapack who is right.

    If one defines a 2D array as vector<vector<double>>, one obtains one advantage: code simplicity. This comes at a price. Each row of such a matrix is allocated separateley from the others. Thus, a matrix 100 by 100 may be stored in 100 completely different memory blocks. This has a bad impact on the cache (and prefetcher) utilization. Lapck (and other linear algebra packages) enforces compactification of the data in a single, continuous array. This is so to minimize cache and prefetcher misses. If OP had used such an approach from the very beginning, he would probably have gained more than 1-3% that they pay now for the conversion.

    This compactification can be achieved in at least three ways.

    Each of the above solutions forces some changes to the surrounding code, which OP might not want to do. All depends on the code complexity and expected performance gains.

    EDIT: Alternative libraries

    An alternative approach is to use a library that is known for being a highly optimzed one. Lapack by itself can be regardered as a standard interface with many implementations and it may happen that OP uses an unoptimized one. Which library to choose may depend on the hardware/software platform OP is interested in and may vary in time.

    As for now (mid-2021) a decent suggestions are:

    If OP uses martices of sizes at least 100, then GPU-oriented MAGMA might be worth trying.

    An easier (installation, running) way might with a parallel CPU library, e.g. Plasma. Plsama is Lapack-compliant, it has been being developed by a large team of people, including Jack Dongarra, it also should be rather easy to compile it locally as it is provided with a CMake script.

    An example how much a parallel CPU-based, multicore implementation can outperform a single-threaded implementation of the LU-decomposition can be found for example here: https://cse.buffalo.edu/faculty/miller/Courses/CSE633/Tummala-Spring-2014-CSE633.pdf (short answer: 5 to 15 times for matrices of size 1000).