I'm writing optimised Floyd-Warshall algorithm on GPU using numba. I need it to work in a few seconds in case of 10k matricies. Right now the processing is done in around 60s. Here is my code:
def calcualte_distance_to_all_gpu(matrix):
threadsperblock = (32, 32)
blockspergrid_x = math.ceil(matrix.shape[0]/ threadsperblock[0])
blockspergrid_y = math.ceil(matrix.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)
calculate_distance_to_all_cuda[blockspergrid, threadsperblock](matrix)
@cuda.jit
def calculate_distance_to_all_cuda(matrix):
i, j = cuda.grid(2)
N = len(matrix)
for k in prange(N):
if i < matrix.shape[0] and j < matrix.shape[1]:
if matrix[i, k] + matrix[k, j] < matrix[i, j]:
matrix[i, j] = matrix[i, k] + matrix[k, j]
To be honest I'm pretty new to writing scripts on GPU, so do you have any ideas how to make this code even faster? I also noticed on my GPU that while processing there is only a small peak to 100% and then it's stops being busy, so maybe the problem is in sending data from CPU to GPU? If yes is there anyway to optimize this process? Or maybe should I use diffrent algorithm for this task?
It turned out that my approach was wrong from the begining, cause you can't paralelize this algorithm in straightforward way. Here is some nice article how to do this with code:
https://moorejs.github.io/APSP-in-parallel/#References
In a few days I'll rewrite this approach to python numba and post it in a comment ;).