I have in mind to to use getrf
and getrs
from the cuSolver package and to solve AB=X
with B=I
.
Is this the most best way to solve this problem?
If so, what is the best way to create the col-major identity matrix B
in device memory? It can be done trivially using a for
loop but this would 1. take up a lot of memory and 2. be quite slow. Is there a faster way?
Note that cuSolver does not provide getri
unfortunately. Therefore I must to use getrs
.
Until CUDA provides the LAPACK API getri
, I think getrf
and getrs
is the best choice for large matrix inversion.
The matrix B
is of the same size as A
, so I don't think allocating B
makes this task consume much larger memory than its input/output data does.
The complexity of getrf
and getrs
are O(n^3)
and O(n^2)
, respectively, while setting B=I
is of O(n^2) + O(n)
. I don't think it should be a bottleneck of the whole procedure. You may share your implementation, so we could check where the problem could be.