c++linear-algebraeigenrcppeigen

Matrix multiplication of an Eigen Matrix for a subset of columns


What is the fastest method for matrix multiplication of an Eigen::Matrix over a random set of column indices?

Eigen::MatrixXd mat = Eigen::MatrixXd::Random(100, 1000);
// vector of random indices (linspaced here for brevity)
Eigen::VectorXi idx = VectorXi::LinSpaced(8,1000,9);

I'm using RcppEigen and R, which is still on a 3.x version of Eigen (no support for () with index arrays), and regardless, my understanding is that the () operator still performs a deep copy.

Right now I'm doing a deep copy and generating a new matrix with data only for columns in idx:

template <typename T>
inline Eigen::Matrix<T, -1, -1> subset_cols(const Eigen::Matrix<T, -1, -1>& x, const std::vector<size_t>& cols) {
    Eigen::Matrix<T, -1, -1> y(x.rows(), cols.size());
    for (size_t i = 0; i < cols.size(); ++i)
        y.col(i) = x.col(cols[i]);
    return y;
}

and then doing matrix multiplication:

Eigen::MatrixXd sub_mat = subset_cols(mat, idx);
Eigen::MatrixXd a = sub_mat * sub_mat.transpose();

a is what I want.

There must be some way to avoid a deep copy and instead use Eigen::Map?

Edit 5/9/22: In reply to @Markus, who proposed an approach using raw data access and Eigen::Map. The proposed solution is a bit slower than matrix multiplication of a deep copy. Benchmarking here is done with Rcpp code and R:

//[[Rcpp::depends(RcppClock)]]
#include <RcppClock.h>

//[[Rcpp::export]]
void bench(Eigen::MatrixXd mat, Eigen::VectorXi idx){
  Rcpp::Clock clock;
  size_t reps = 100;
  while(reps-- > 0){
    clock.tick("copy");
    Eigen::MatrixXd sub_mat = subset_cols(mat, idx);
    Eigen::MatrixXd a = sub_mat * sub_mat.transpose();
    clock.tock("copy");
    
    clock.tick("map");
    double *b_raw = new double[mat.rows() * mat.rows()];
    Eigen::Map<Eigen::MatrixXd> b(b_raw, mat.rows(), mat.rows());
    subset_AAt(b_raw, mat, idx);
    clock.tock("map");
  }
  clock.stop("clock");
}

Here are three runs of a 100,000-column matrix with 100 rows. We are doing matrix multiplication on (1) a subset of 10 columns, (2) a subset of 1000 columns, and (3) a subset of 10000 columns.

R:

bench(
  matrix(runif(100000 * 100), 100, 100000), 
  sample(100000, 10) - 1)

# Unit: microseconds 
# ticker   mean     sd   min    max neval
#    copy  31.65  4.376 30.15  69.46   100
#     map 113.46 21.355 68.54 166.29   100

bench(
  matrix(runif(100000 * 100), 100, 100000), 
  sample(100000, 1000) - 1)

#  Unit: milliseconds 
#  ticker  mean     sd   min   max neval
#    copy 2.361 0.5789 1.972  4.86   100
#     map 9.495 2.4201 7.962 19.90   100

bench(
  matrix(runif(100000 * 100), 100, 100000), 
  sample(100000, 10000) - 1)

#  Unit: milliseconds 
#  ticker   mean     sd    min   max neval
#    copy  23.04  2.774  20.95  42.4   100
#     map 378.14 19.424 351.56 492.0   100

I benchmarked on a few machines with similar results. Above results are from a good HPC node.

Edit: 5/10/2022 Here is a code snippet that performs matrix multiplication for a subset of columns as quickly as any code not directly using the Eigen BLAS:

template <typename T>
Eigen::Matrix<T, -1, -1> subset_AAt(const Eigen::Matrix<T, -1, -1>& A, const Eigen::VectorXi& cols) {
  const size_t n = A.rows();
  Eigen::Matrix<T, -1, -1> AAt(n, n);
  for (size_t k = 0; k < cols.size(); ++k) {
    const T* A_data = A.data() + cols(k) * n;
    for (size_t i = 0; i < n; ++i) {
      T tmp_i = A_data[i];
      for (size_t j = 0; j <= i; ++j) {
        AAt(i * n + j) += tmp_i * A_data[j];
      }
    }
  }
  return AAt;
}

Solution

  • Exploiting symmetry

    You can exploit that the resulting matrix will be symmetric like so:

    Mat sub_mat = subset_cols(mat, idx); // From your original post
    Mat a = Mat::Zero(numRows, numRows);
    a.selfadjointView<Eigen::Lower>().rankUpdate(sub_mat); // (1)
    a.triangularView<Eigen::Upper>() = a.transpose(); // (2)
    

    Line (1) will compute a += sub_mat * sub_mat.transpose() for the lower part only. (2) will then write the lower part to the upper part. Also see the documentation (here and here). Of course, if you can live with only the lower part, step (2) can be omitted.

    For a 100x100000 matrix mat, I get a speed up of a factor of roughly

    both on Windows using MSVC and on Linux using clang with full optimizations and AVX.

    Enabling parallelization

    Another way to speed up the computation is to enable parallelization by compiling with OpenMP. Eigen takes care of the rest. The code above that exploits the symmetry does not benefit from it, however. But the original code

    Eigen::MatrixXd sub_mat = subset_cols(mat, idx);
    Eigen::MatrixXd a = sub_mat * sub_mat.transpose();
    

    does.

    For a 100x100000 matrix mat, using clang on Linux, running with 4 threads (on 4 real cores) and comparing to a single thread, I get a speed up of a factor of roughly

    In other words, 4 cores or more outperform the symmetric method shown above except for a very small number of columns. Using only 2 cores was always slower. Note that using SMT hurt the performance in my tests, sometimes notably.

    Other notes

    I already wrote this in the comment, but for the sake of completeness: Eigen::Map will not work because the strides are non-equidistant. Using slicing gives me ~10% better performance than your copying method on Linux with clang and gcc, but somewhat worse on MSVC. Also, as you noted, it is not available on the 3.3 branch of Eigen. There is a custom way to mimic it, but it performed always worse in my tests. Also, in my tests, it did not save any memory compared to the copying method.

    I think it is hard to beat the copying method itself regarding performance because the Eigen matrices are column major by default, meaning that copying a few columns is rather cheap. Moreover, without really knowing details, I suspect that Eigen can then throw the full might of its optimization on the full matrix to compute the product and transpose without having to deal with views or anything like this. This might give Eigen more chances for vectorization or cache locality.

    Apart from this, not only optimizations should be turned on but also the highest possible instruction set should be used. Turning on AVX in my tests improved the performance by ~1.5x. Unfortunately, I cannot test AVX512.