python-3.xcudanumbapycudanumba-pro

Why numba cuda is running slow after recalling it several times?


I am experimenting how to use cuda inside numba. However I have encountered something different from my expectation. Here is my code

from numba import cuda
@cuda.jit
def matmul(A, B, C):
"""Perform square matrix multiplication of C = A * B
"""
d=cuda.local.array((3,3),dtype=numba.float64)
i, j = cuda.grid(2)
if i < C.shape[0] and j < C.shape[1]:
    tmp = 0.
    for k in range(A.shape[1]):
        tmp += A[i, k] * B[k, j]
    C[i, j] = tmp

This is the matrix function I self defined for testing using numba.cuda. Before running the tests, I also loaded the arrays in the following code:

import numpy as np
a=np.random.rand(2000,2000)
b=np.random.rand(2000,2000)
c=np.empty((2000,2000))
a1=cuda.to_device(a)
b1=cuda.to_device(b)
c1=cuda.to_device(c)

Then I used the following code for experiment:

from time import time
count =0
start=time()
for i in range(2000):
  matmul[(256,256),(16,16)](a1,b1,c1)
  count +=1
  print(count)

The for loops ran very fast for the first 1028 runs. However it ran very slow after the 1028th.What exactly caused this and how do I fix it. I am running on win10 by the way.

Here is my cuda information called from numba.cuda

from numba import cuda
gpu = cuda.get_current_device()
print("name = %s" % gpu.name)
print("maxThreadsPerBlock = %s" % str(gpu.MAX_THREADS_PER_BLOCK))
print("maxBlockDimX = %s" % str(gpu.MAX_BLOCK_DIM_X))
print("maxBlockDimY = %s" % str(gpu.MAX_BLOCK_DIM_Y))
print("maxBlockDimZ = %s" % str(gpu.MAX_BLOCK_DIM_Z))
print("maxGridDimX = %s" % str(gpu.MAX_GRID_DIM_X))
print("maxGridDimY = %s" % str(gpu.MAX_GRID_DIM_Y))
print("maxGridDimZ = %s" % str(gpu.MAX_GRID_DIM_Z))
print("maxSharedMemoryPerBlock = %s" % 
str(gpu.MAX_SHARED_MEMORY_PER_BLOCK))
print("asyncEngineCount = %s" % str(gpu.ASYNC_ENGINE_COUNT))
print("canMapHostMemory = %s" % str(gpu.CAN_MAP_HOST_MEMORY))
print("multiProcessorCount = %s" % str(gpu.MULTIPROCESSOR_COUNT))
print("warpSize = %s" % str(gpu.WARP_SIZE))
print("unifiedAddressing = %s" % str(gpu.UNIFIED_ADDRESSING))
print("pciBusID = %s" % str(gpu.PCI_BUS_ID))
print("pciDeviceID = %s" % str(gpu.PCI_DEVICE_ID))

and the output is:

name = b'GeForce GTX 1050 Ti'

maxThreadsPerBlock = 1024

maxBlockDimX = 1024

maxBlockDimY = 1024

maxBlockDimZ = 64

maxGridDimX = 2147483647

maxGridDimY = 65535

maxGridDimZ = 65535

maxSharedMemoryPerBlock = 49152

asyncEngineCount = 2

canMapHostMemory = 1

multiProcessorCount = 6

warpSize = 32

unifiedAddressing = 1

pciBusID = 3

pciDeviceID = 0


Solution

  • This is caused by the asynchronous launch queue associated with GPU kernel launches.

    When you tell numba to submit a GPU kernel:

    matmul[(256,256),(16,16)](a1,b1,c1)
    

    This request goes into a queue, and the CPU thread (i.e. python) that issued that kernel call can continue, even though the GPU kernel has not completed or even started yet.

    The CUDA runtime queues up these requests and issues them as the GPU is ready for more work.

    What you are witnessing initially during the very fast incrementing of your for-loop is the queue filling up with work requests. That is not representative of the actual time that the GPU requires to perform the work.

    Eventually the queue fills, and the CUDA runtime halts the CPU thread (i.e. python) at the point of kernel launch, until a queue slot opens up. At that point, the for-loop is allowed to proceed for one more iteration. It is at this point (perhaps around 1028 iterations) that you start to see a "slow down". Thereafter, the for-loop proceeds at approximately the rate at which GPU kernels are executed and removed from the processing queue.

    There is nothing to fix here; this is expected behavior.

    If you want the for-loop to proceed only at the rate at which GPU kernels are actually executed, then you should insert a synchronizing function in your for-loop.

    For example, numba provides numba.cuda.synchronize() So if you modify your for-loop as follows:

    for i in range(2000):
      matmul[(256,256),(16,16)](a1,b1,c1)
      cuda.synchronize()
      count +=1
      print(count)
    

    You will see the for-loop proceed at the actual rate of GPU work completion, instead of the "queue-filling" rate.