[SOLVED] How to parallelize the sparse matrix-matrix (SpMM) product with OpenMP for the COO format?

How to parallelize the sparse matrix-matrix (SpMM) product with OpenMP for the COO format?

I recently coded a multithreaded version of the sparse matrix-vector (SpMV) product using a COO storage format:

void dcoomv(SparseMatrixCOO *A, double *x, double *y) {
  for (int i=0; i<A->m; i++) y[i] = 0.;
  #pragma omp parallel for reduction(+:y[:A->m])
  for (int i=0; i<A->nnz; i++) {
    y[A->row_idx[i]] += A->val[i] * x[A->col_idx[i]];
  }
}

Considering the potential for race conditions of the COO SpMV, this is about as good as it gets and it does not so bad. Actually, it does better than MKL's mkl_sparse_d_mv with COO, which I think is not parallelized at all.

Now, I would like to parallelize the sparse matrix-matrix (SpMM) product. I did two attempts. First, I wrote

void dcoomm_pll1(SparseMatrixCOO *A, double *X, double *Y, int nvec) {
  for (int i=0; i<(A->m * nvec); i++) Y[i] = 0.;
  #pragma omp parallel for reduction(+:Y[:A->m*nvec])
  for (int i=0; i<A->nnz; i++) {
    for (int vec=0; vec<nvec; vec++) {
      Y[A->row_idx[i] + vec * A->m] += A->val[i] * X[A->col_idx[i] + vec * A->n];
    }
  }
}

This, however, does not speedup but rather slows down the code in parallel.

Second, I did

void dcoomm_pll2(SparseMatrixCOO *A, double *X, double *Y, int nvec) {
  for (int i=0; i<(A->m * nvec); i++) Y[i] = 0.;
  for (int i=0; i<A->nnz; i++) {
    #pragma omp parallel for
    for (int vec=0; vec<nvec; vec++) {
      Y[A->row_idx[i] + vec * A->m] += A->val[i] * X[A->col_idx[i] + vec * A->n];
    }
  }
}

which, again, leads to some slowdown.

For reminder, the sequential implementation of dcoomm is given by

void dcoomm(SparseMatrixCOO *A, double *X, double *Y, int nvec) {
  for (int i=0; i<(A->m * nvec); i++) Y[i] = 0.;
  for (int i=0; i<A->nnz; i++) {
    for (int vec=0; vec<nvec; vec++) {
      Y[A->row_idx[i] + vec * A->m] += A->val[i] * X[A->col_idx[i] + vec * A->n];
    }
  }
}

My sequential version is about 30% faster than MKL's mkl_sparse_d_mm with COO. However, both my multithreaded versions are much slower than mkl_sparse_d_mm.

As I tried parallelizing the COO SpMM function with both dcoomm_pll1 and dcoomm_pll2, none of which offers satisfactory performance, I wonder how to parallelize the COO SpMM so as to improve parallel perfomance.

Solution

The memory access pattern is not contiguous in your matrix-matrix multiplication code, therefore leading to more cache misses. Cache misses are already bad for sequential code, and even worse for multithreaded codes. Why not calling nvec times a sequential version of dcoomv, and parallelize this loop? You don't even need a reduction.

void dcoomm_pll3(SparseMatrixCOO *A, double *X, double *Y, int nvec) {
  #pragma omp parallel for
  for (int vec=0; vec<nvec; vec++) {
     dcoomv_seq(A, X + vec * A->n, Y + vec * A->m);
  }
}