I would like to multiply two matrices A and B using CUDA's cublasSgemm
function.
However, A is in row-major format and B is in column-major format and I want the output to be in row-major format.
Is that possible - and if so - how do I have to set these parameters
cublasHandle_t handle,
cublasOperation_t transa, cublasOperation_t transb,
int m, int n, int k,
const float *alpha,
const float *A, int lda,
const float *B, int ldb,
const float *beta,
float *C, int ldc
if e.g. A=1x4
and B=4x16
such that C=1x16
?
The example that you give doesn't make much sense because it's a matrix-vector product, so I pick my own example to make it clearer:
C = A * B
CuBLAS uses column-major ordering, so talking about row-major order is distracting. It's better to say that they are transposed. This gives us A = [4, 8], B = [4, 6], C = [6, 8]. That makes it easy to see that you want C = transpose(B) * A
In CuBLAS, that would be
cublasSgemm(handle,
CUBLAS_OP_T /*transpose left side*/,
CUBLAS_OP_N /* no transposition right side*/,
6 /*rows in C*/, 8 /*columns in C*/, 4 /* rows on right side*/,
&alpha /*1.f*/, B /*left side*/, ld_B,
A /*right side*/, ld_A, &beta /*0.f*/,
C, ld_C);
ld_A, ld_B, and ld_C are the leading dimensions, a.k.a. the outer dimensions of the A, B, and C matrices. Also known as the outer stride, it is the number of elements between successive columns in each of the matrices (since they are column major). Due to padding, or using slices of larger matrices, this can be larger than the "logical" number of rows.
In other words, if we have float* A
and A[0]
is the first row, first column, then A[ld_A]
is the first row, second column and A[ld_A + 1]
is the second row, second column.
If you use cudaMalloc3D
to allocate suitably padded matrices, it's pitch / sizeof(float)
using the attribute of the cudaPitchedPtr