[SOLVED] What is the most efficient way to compute the inverse of a general matrix using cuSolver?

What is the most efficient way to compute the inverse of a general matrix using cuSolver?

I have in mind to to use getrf and getrs from the cuSolver package and to solve AB=X with B=I.

Is this the most best way to solve this problem?
If so, what is the best way to create the col-major identity matrix B in device memory? It can be done trivially using a for loop but this would 1. take up a lot of memory and 2. be quite slow. Is there a faster way?

Note that cuSolver does not provide getri unfortunately. Therefore I must to use getrs.

Solution

Until CUDA provides the LAPACK API getri, I think getrf and getrs is the best choice for large matrix inversion.

The matrix B is of the same size as A, so I don't think allocating B makes this task consume much larger memory than its input/output data does.

The complexity of getrf and getrs are O(n^3) and O(n^2), respectively, while setting B=I is of O(n^2) + O(n). I don't think it should be a bottleneck of the whole procedure. You may share your implementation, so we could check where the problem could be.