I recently coded a multithreaded version of the sparse matrix-vector (SpMV) product using a COO storage format:
void dcoomv(SparseMatrixCOO *A, double *x, double *y) {
for (int i=0; i<A->m; i++) y[i] = 0.;
#pragma omp parallel for reduction(+:y[:A->m])
for (int i=0; i<A->nnz; i++) {
y[A->row_idx[i]] += A->val[i] * x[A->col_idx[i]];
}
}
Considering the potential for race conditions of the COO SpMV, this is about as good as it gets and it does not so bad. Actually, it does better than MKL's mkl_sparse_d_mv
with COO, which I think is not parallelized at all.
Now, I would like to parallelize the sparse matrix-matrix (SpMM) product. I did two attempts. First, I wrote
void dcoomm_pll1(SparseMatrixCOO *A, double *X, double *Y, int nvec) {
for (int i=0; i<(A->m * nvec); i++) Y[i] = 0.;
#pragma omp parallel for reduction(+:Y[:A->m*nvec])
for (int i=0; i<A->nnz; i++) {
for (int vec=0; vec<nvec; vec++) {
Y[A->row_idx[i] + vec * A->m] += A->val[i] * X[A->col_idx[i] + vec * A->n];
}
}
}
This, however, does not speedup but rather slows down the code in parallel.
Second, I did
void dcoomm_pll2(SparseMatrixCOO *A, double *X, double *Y, int nvec) {
for (int i=0; i<(A->m * nvec); i++) Y[i] = 0.;
for (int i=0; i<A->nnz; i++) {
#pragma omp parallel for
for (int vec=0; vec<nvec; vec++) {
Y[A->row_idx[i] + vec * A->m] += A->val[i] * X[A->col_idx[i] + vec * A->n];
}
}
}
which, again, leads to some slowdown.
For reminder, the sequential implementation of dcoomm
is given by
void dcoomm(SparseMatrixCOO *A, double *X, double *Y, int nvec) {
for (int i=0; i<(A->m * nvec); i++) Y[i] = 0.;
for (int i=0; i<A->nnz; i++) {
for (int vec=0; vec<nvec; vec++) {
Y[A->row_idx[i] + vec * A->m] += A->val[i] * X[A->col_idx[i] + vec * A->n];
}
}
}
My sequential version is about 30% faster than MKL's mkl_sparse_d_mm
with COO. However, both my multithreaded versions are much slower than mkl_sparse_d_mm
.
As I tried parallelizing the COO SpMM function with both dcoomm_pll1
and dcoomm_pll2
, none of which offers satisfactory performance, I wonder how to parallelize the COO SpMM so as to improve parallel perfomance.
The memory access pattern is not contiguous in your matrix-matrix multiplication code, therefore leading to more cache misses. Cache misses are already bad for sequential code, and even worse for multithreaded codes. Why not calling nvec times a sequential version of dcoomv, and parallelize this loop? You don't even need a reduction.
void dcoomm_pll3(SparseMatrixCOO *A, double *X, double *Y, int nvec) {
#pragma omp parallel for
for (int vec=0; vec<nvec; vec++) {
dcoomv_seq(A, X + vec * A->n, Y + vec * A->m);
}
}