I'm now working on two multi-dimensional arrays arr
and cost
. arr
is dense with size (width, height, m, n)
and cost
is sparse with size (width, height, width, height)
.
Values: width
and height
are around 5000
, m
is less than 100
, and n
is less than 10
. The sparse cost
is strictly block-sparse. For cost[i,j,:,:]
, there are only a certain k*k
block, where k
is less than 50
.
Now I want this:
result = np.tensordot(cost, arr, axes=[[2,3],[0,1]])
The size of result
is then (width, height, m, n)
(the same as arr
). However the cost
array is too large if defined as a dense array.
So, the question is: How to do a fast tensor-dot (better with GPU) on sparse arrays?
Possible solutions are ideas in any programming language are OK, including but not limited to Python, C++, Julia, etc.
I've tried:
CPU versions (multithreaded): C++ (simply with std::vector
), numpy
/scipy
, Julia
GPU versions: cupy
, CUDA.jl
The C++ version works well with some simple for-loops on std::vector
s, but it is not fast enough.
With cupy
and CUDA.jl
, they all work very fast on small tests with width
and height
set to small values and define both cost
and arr
are dense. But I don't know how to modify it to a sparse version.
I solved this finally via cupy
with some reshapings.
I've implemented the Julia version tensor-dot with reshaping but I didn't realize this is the key to solving this 2d-sparse-matrix problem. Thanks for @hpaulj who reminded me of this.
The minimal example is:
import cupy as cp
import numpy as np
from cupyx.scipy import sparse as sparse_gpu
from scipy import sparse as sparse_cpu
# sparse_cost2d_cpu = get_cost_cpu(...)
data = cp.asarray(sparse_cost2d_cpu.data,dtype=DTYPE_CP)
indices = cp.asarray(sparse_cost2d_cpu.indices)
indptr = cp.asarray(sparse_cost2d_cpu.indptr)
sparse_cost2d_gpu = sparse_gpu.csr_matrix((data,indices,indptr),shape=(width*height,width*height),dtype=DTYPE_CP)
arr4d_cpu = np.random.random((width, height, nvar, nvalue)).astype(DTYPE_NP)
arr2d_cpu = arr.reshape((width*height,nvar*nvalue))
arr2d_gpu = cp.asarray(arr2d_cpu)
for i in range(100):
new_arr2d_gpu = sparse_cost2d_gpu.dot(arr2d_gpu)
arr2d_gpu = new_arr2d_gpu*_lambda + arr2d_gpu*(1-_lambda)
arr4d_gpu = arr2d_gpu.reshape((width,height,nvar,nvalue))
# Some other computation with arr4d_gpu
arr2d_gpu = arr4d_gpu.reshape((width*height,nvar*nvalue))