pythonoptimizationgpunumbafloyd-warshall

Floyd-Warshall algorithm on GPU using numba


I'm writing optimised Floyd-Warshall algorithm on GPU using numba. I need it to work in a few seconds in case of 10k matricies. Right now the processing is done in around 60s. Here is my code:

def calcualte_distance_to_all_gpu(matrix):
    threadsperblock = (32, 32)
    blockspergrid_x = math.ceil(matrix.shape[0]/ threadsperblock[0])
    blockspergrid_y = math.ceil(matrix.shape[1] / threadsperblock[1])
    blockspergrid = (blockspergrid_x, blockspergrid_y)
    calculate_distance_to_all_cuda[blockspergrid, threadsperblock](matrix)


@cuda.jit
def calculate_distance_to_all_cuda(matrix):
    i, j = cuda.grid(2)

    N = len(matrix)
    for k in prange(N):
        if i < matrix.shape[0] and j < matrix.shape[1]:
            if matrix[i, k] + matrix[k, j] < matrix[i, j]:
                matrix[i, j] = matrix[i, k] + matrix[k, j]

To be honest I'm pretty new to writing scripts on GPU, so do you have any ideas how to make this code even faster? I also noticed on my GPU that while processing there is only a small peak to 100% and then it's stops being busy, so maybe the problem is in sending data from CPU to GPU? If yes is there anyway to optimize this process? Or maybe should I use diffrent algorithm for this task?


Solution

  • It turned out that my approach was wrong from the begining, cause you can't paralelize this algorithm in straightforward way. Here is some nice article how to do this with code:

    https://moorejs.github.io/APSP-in-parallel/#References

    In a few days I'll rewrite this approach to python numba and post it in a comment ;).