matrixcudacublas

cublasSgemm row-major multiplication


I'm trying to use cublasSgemm to multiply two non-square matrices that are stored in row-major order. I know that this function has one parameter where you can specify that if you want to transpose the matrices (CUBLAS_OP_T) but the result is stored in column-major order and I need it in row-major too.

Also,the code I have is not capable of multiplying non-square matrices with the parameter CUBLAS_OP_T. Only square or non-square with CUBLAS_OP_N.

Besides, I know that there are the option to declare the matrices in column-order with

#define IDX2C(i,j,ld) (((j)*(ld))+(i)) 

but this isn't an option because the matrices that I have to use will be set in another program.

I suppose that there is a lot of information on the internet, but I'm not able to find any answer to my question.

my code is the following:

int main(int argc, char *argv[]) {
  int m = 2;
  int k = 3;
  int n = 4;
  int print = 1;
  cudaError_t cudaStat;   // cudaMalloc status
  cublasStatus_t stat;    // CUBLAS functions status
  cublasHandle_t handle;  // CUBLAS context

  int i, j;

  float *a, *b, *c;

  // malloc for a,b,c...

  // define a mxk matrix a row by row
  int ind = 11;
  for (j = 0; j < m * k; j++) {
    a[j] = (float)ind++;
  }

  // define a kxn matrix b column by column
  ind = 11;
  for (j = 0; j < k * n; j++) {
    b[j] = (float)ind++;
  }

  // DEVICE
  float *d_a, *d_b, *d_c;

  // cudaMalloc for d_a, d_b, d_c...

  stat = cublasCreate(&handle);  // initialize CUBLAS context

  stat = cublasSetMatrix(m, k, sizeof(*a), a, m, d_a, m);
  stat = cublasSetMatrix(k, n, sizeof(*b), b, k, d_b, k);

  float al = 1.0f;
  float bet = 0.0f;
  stat = cublasSgemm(handle, CUBLAS_OP_T, CUBLAS_OP_T, m, n, k, &al, d_a, m,
                     d_b, k, &bet, d_c, m);

  stat = cublasGetMatrix(m, n, sizeof(*c), d_c, m, c, m);  // cp d_c - >c

  if (print == 1) {
    printf("\nc after Sgemm :\n");
    for (i = 0; i < m * n; i++) {
      printf("%f ", c[i]);  // print c after Sgemm
    }
  }

  cudaFree(d_a);
  cudaFree(d_b);
  cudaFree(d_c);
  cublasDestroy(handle);  // destroy CUBLAS context
  free(a);
  free(b);
  free(c);

  return EXIT_SUCCESS;
}

The output is the transpose of multiplying A * B, that is: (A * B)T.

But what I want is C = A * B in row-major order.

I hope someone can help me.


Solution

  • As you said, cuBLAS interprets matrices as column-major ordered, so when you execute

    cublasSgemm(handle,CUBLAS_OP_T,CUBLAS_OP_T,m,n,k,&al,d_a,m,d_b,k,&bet,d_c,m)
    

    you are correctly transposing each input (which was created in row-major form) in preparation for the column-major interpretation. The problem is that cuBLAS also dumps the result in column-major order.

    We will trick cuBLAS into computing , which will be outputted in column major order and will thus look like when we slyly interpret it in row-major order. So instead of computing AB = C, we do = . Luckily, and we already obtained by the very action of creating A and B in row-major order, so we can simply bypass the transposition with CUBLAS_OP_N. So change the line to

    // Note the order of supplying A and B are reversed to compute (BA)^T
    cublasSgemm(handle,CUBLAS_OP_N,CUBLAS_OP_N,n,m,k,&al,
                d_b,n,d_a,k,&bet,d_c,n)
    

    The example code you supplied should calculate

    and after running with the updated call to cublasSgemm, we get:

    c after Sgemm :
    548.000000 584.000000 620.000000 656.000000 683.000000 728.000000 773.000000 818.000000