linear-algebrasparse-matrixblasintel-mkl

What is wrong with my sparse matrix-multiple vectors (SpMM) product function for CSR?


I have the following code for the sparse matrix-vector (SpMV) product in C assuming a CSR storage format:

void dcsrmv(SparseMatrixCSR *A, double *x, double *y) {
  for (int i=0; i<A->m; i++) {
    double sum = 0.;
    for (int jj=A->row_start[i]; jj<A->row_start[i+1]; jj++) {
      sum += A->val[jj] * x[A->col_idx[jj]];
    }
    y[i] = sum;
  }
}

when I run this function 100 times with the array ecology1 from the SuiteSparse Matrix Collection, the average runtime is 5.9 ms, whereas if I run MKL's mkl_sparse_d_mv function with the same matrix, I get an average runtime of 8.4 ms.

Now, I would like to have a sparse matrix-multiple vectors (SpMM) product function for the same storage format, which I coded following 3 approaches:

void dcsrmm0(SparseMatrixCSR *A, double *X, double *Y, int nvec) {
  for (int vec=0; vec<nvec; vec++) {
    double *x = X + vec * A->n;
    double *y = Y + vec * A->m;
    for (int i=0; i<A->m; i++) {
      double sum = 0.;
      for (int jj=A->row_start[i]; jj<A->row_start[i+1]; jj++) {
        sum += A->val[jj] * x[A->col_idx[jj]];
      }
      y[i] = sum;
    }
  }
}
void dcsrmm1(SparseMatrixCSR *A, double *X, double *Y, int nvec) {
  for (int i=0; i<A->m; i++) {
    for (int vec=0; vec<nvec; vec++) {
      double sum = 0.;
      double *x = X + vec * A->n;
      double *y = Y + vec * A->m;
      for (int jj=A->row_start[i]; jj<A->row_start[i+1]; jj++) {
        sum += A->val[jj] * x[A->col_idx[jj]];
      }
      y[i] = sum;
    }
  }
}

and

void dcsrmm2(SparseMatrixCSR *A, double *X, double *Y, int nvec) {
  double *x[nvec], *y[nvec];
  for (int vec=0; vec<nvec; vec++) x[vec] = X + vec * A->n;
  for (int vec=0; vec<nvec; vec++) y[vec] = Y + vec * A->m;
  double sum[nvec];
  for (int i=0; i<A->m; i++) {
    for (int vec=0; vec<nvec; vec++) sum[vec] = 0.;
    for (int jj=A->row_start[i]; jj<A->row_start[i+1]; jj++) {
      for (int vec=0; vec<nvec; vec++) {
        sum[vec] += A->val[jj] * x[vec][A->col_idx[jj]];
      }
    }
    for (int vec=0; vec<nvec; vec++) y[vec][i] = sum[vec];
  }
}

where X and Y are stored by column.

Then I benchmark these functions with nvec=10. In which case dcsrmm0 takes 49.8 ms on average, dcsrmm1 takes 38.3 ms on average, and dcsrmm2 takes 34.8 ms on average, while mkl_sparse_d_mm takes only 26.7 ms.

So my question is, why is it that my implementation of dcsrmv does better than mkl_sparse_d_mv, but my dcsrmm0-2 functions lag so much behind mkl_sparse_d_mm? What can I do to improve these functions?


Solution

  • I found that the following alternative gives me the best results:

    void dcsrmm2_improved(SparseMatrixCSR *A, double *X, double *Y, int nvec) {
      double *x[nvec];
      for (int vec=0; vec<nvec; vec++) x[vec] = X + vec * A->n;
      for (int i=0; i<A->m; i++) {
        double *y = Y + i;
        for (int vec=0; vec<nvec; vec++) y[vec * A->m] = 0.;
        for (int jj=A->row_start[i]; jj<A->row_start[i+1]; jj++) {
          double val = A->val[jj];
          int j = A->col_idx[jj];
          for (int vec=0; vec<nvec; vec++) {
            y[vec * A->m] += val * x[vec][j];
          }
        }
      }
    }
    

    when I run this function 100 times with the array ecology1 from the SuiteSparse Matrix Collection, I get an average runtime of 27.6 ms, which is reasonably close to the 26.7 ms obtained with mkl_sparse_d_mm using the CSR format.

    If anyone knows how to get further improvements, let me know.