I have read the following post
Accessing submatrices using LAPACK
I would like to do something similar calling cuBLAS routines from Fortran.
Basically I have a large matrix partitioned into 3 x 3
blocks with the partitioning changing in each step of a loop. At the moment, I allocate/free pointers for each individual sub-block and copy the relevant parts of the matrix to and from the device at each step. That creates a lot of overhead which I am hoping to eliminate. Is that feasible?
You can do device pointer arithmetic in host code in just the same way as you would with host pointers. For example, if you had an MxN matrix stored on the GPU:
float *A_d;
cudaMalloc((void **)&A_d, size_t(M*N)*sizeof(float));
and you wanted to operate on a submatrix starting at (x1,y1), then you would pass A+x1+M*y1
to any CUBLAS function which expects a matrix as an argument.