I have the following code for the sparse matrix-vector (SpMV) product in C assuming a CSR storage format:
void dcsrmv(SparseMatrixCSR *A, double *x, double *y) {
for (int i=0; i<A->m; i++) {
double sum = 0.;
for (int jj=A->row_start[i]; jj<A->row_start[i+1]; jj++) {
sum += A->val[jj] * x[A->col_idx[jj]];
}
y[i] = sum;
}
}
when I run this function 100 times with the array ecology1
from the SuiteSparse Matrix Collection, the average runtime is 5.9 ms, whereas if I run MKL's mkl_sparse_d_mv
function with the same matrix, I get an average runtime of 8.4 ms.
Now, I would like to have a sparse matrix-multiple vectors (SpMM) product function for the same storage format, which I coded following 3 approaches:
void dcsrmm0(SparseMatrixCSR *A, double *X, double *Y, int nvec) {
for (int vec=0; vec<nvec; vec++) {
double *x = X + vec * A->n;
double *y = Y + vec * A->m;
for (int i=0; i<A->m; i++) {
double sum = 0.;
for (int jj=A->row_start[i]; jj<A->row_start[i+1]; jj++) {
sum += A->val[jj] * x[A->col_idx[jj]];
}
y[i] = sum;
}
}
}
void dcsrmm1(SparseMatrixCSR *A, double *X, double *Y, int nvec) {
for (int i=0; i<A->m; i++) {
for (int vec=0; vec<nvec; vec++) {
double sum = 0.;
double *x = X + vec * A->n;
double *y = Y + vec * A->m;
for (int jj=A->row_start[i]; jj<A->row_start[i+1]; jj++) {
sum += A->val[jj] * x[A->col_idx[jj]];
}
y[i] = sum;
}
}
}
and
void dcsrmm2(SparseMatrixCSR *A, double *X, double *Y, int nvec) {
double *x[nvec], *y[nvec];
for (int vec=0; vec<nvec; vec++) x[vec] = X + vec * A->n;
for (int vec=0; vec<nvec; vec++) y[vec] = Y + vec * A->m;
double sum[nvec];
for (int i=0; i<A->m; i++) {
for (int vec=0; vec<nvec; vec++) sum[vec] = 0.;
for (int jj=A->row_start[i]; jj<A->row_start[i+1]; jj++) {
for (int vec=0; vec<nvec; vec++) {
sum[vec] += A->val[jj] * x[vec][A->col_idx[jj]];
}
}
for (int vec=0; vec<nvec; vec++) y[vec][i] = sum[vec];
}
}
where X
and Y
are stored by column.
Then I benchmark these functions with nvec=10
. In which case dcsrmm0
takes 49.8 ms on average, dcsrmm1
takes 38.3 ms on average, and dcsrmm2
takes 34.8 ms on average, while mkl_sparse_d_mm
takes only 26.7 ms.
So my question is, why is it that my implementation of dcsrmv
does better than mkl_sparse_d_mv
, but my dcsrmm0-2
functions lag so much behind mkl_sparse_d_mm
? What can I do to improve these functions?
I found that the following alternative gives me the best results:
void dcsrmm2_improved(SparseMatrixCSR *A, double *X, double *Y, int nvec) {
double *x[nvec];
for (int vec=0; vec<nvec; vec++) x[vec] = X + vec * A->n;
for (int i=0; i<A->m; i++) {
double *y = Y + i;
for (int vec=0; vec<nvec; vec++) y[vec * A->m] = 0.;
for (int jj=A->row_start[i]; jj<A->row_start[i+1]; jj++) {
double val = A->val[jj];
int j = A->col_idx[jj];
for (int vec=0; vec<nvec; vec++) {
y[vec * A->m] += val * x[vec][j];
}
}
}
}
when I run this function 100 times with the array ecology1
from the SuiteSparse Matrix Collection, I get an average runtime of 27.6 ms, which is reasonably close to the 26.7 ms obtained with mkl_sparse_d_mm
using the CSR format.
If anyone knows how to get further improvements, let me know.