An iterative algorithm calls LAPACKE_sgelsd each iteration with a single column of B. Subsequent calls often use the same A matrix. I believe a substantial performance improvement would be to cache or some how reuse intermediate results from the previous iteration when the A matrix has not changed. This should be somewhat similar to the gains possible when passing multiple columns for B. Is that correct? How difficult would it be to implement, and how could it be done? It uses openblas. Thank you.
Instead of caching intermediate results, the pseudo inverse can be computed and cached. It can be computed this approach, summarized as:
The result is the pseudo inverse * B.