I am experimenting how to use cuda inside numba. However I have encountered something different from my expectation. Here is my code
from numba import cuda
@cuda.jit
def matmul(A, B, C):
"""Perform square matrix multiplication of C = A * B
"""
d=cuda.local.array((3,3),dtype=numba.float64)
i, j = cuda.grid(2)
if i < C.shape[0] and j < C.shape[1]:
tmp = 0.
for k in range(A.shape[1]):
tmp += A[i, k] * B[k, j]
C[i, j] = tmp
This is the matrix function I self defined for testing using numba.cuda. Before running the tests, I also loaded the arrays in the following code:
import numpy as np
a=np.random.rand(2000,2000)
b=np.random.rand(2000,2000)
c=np.empty((2000,2000))
a1=cuda.to_device(a)
b1=cuda.to_device(b)
c1=cuda.to_device(c)
Then I used the following code for experiment:
from time import time
count =0
start=time()
for i in range(2000):
matmul[(256,256),(16,16)](a1,b1,c1)
count +=1
print(count)
The for loops ran very fast for the first 1028 runs. However it ran very slow after the 1028th.What exactly caused this and how do I fix it. I am running on win10 by the way.
Here is my cuda information called from numba.cuda
from numba import cuda
gpu = cuda.get_current_device()
print("name = %s" % gpu.name)
print("maxThreadsPerBlock = %s" % str(gpu.MAX_THREADS_PER_BLOCK))
print("maxBlockDimX = %s" % str(gpu.MAX_BLOCK_DIM_X))
print("maxBlockDimY = %s" % str(gpu.MAX_BLOCK_DIM_Y))
print("maxBlockDimZ = %s" % str(gpu.MAX_BLOCK_DIM_Z))
print("maxGridDimX = %s" % str(gpu.MAX_GRID_DIM_X))
print("maxGridDimY = %s" % str(gpu.MAX_GRID_DIM_Y))
print("maxGridDimZ = %s" % str(gpu.MAX_GRID_DIM_Z))
print("maxSharedMemoryPerBlock = %s" %
str(gpu.MAX_SHARED_MEMORY_PER_BLOCK))
print("asyncEngineCount = %s" % str(gpu.ASYNC_ENGINE_COUNT))
print("canMapHostMemory = %s" % str(gpu.CAN_MAP_HOST_MEMORY))
print("multiProcessorCount = %s" % str(gpu.MULTIPROCESSOR_COUNT))
print("warpSize = %s" % str(gpu.WARP_SIZE))
print("unifiedAddressing = %s" % str(gpu.UNIFIED_ADDRESSING))
print("pciBusID = %s" % str(gpu.PCI_BUS_ID))
print("pciDeviceID = %s" % str(gpu.PCI_DEVICE_ID))
and the output is:
name = b'GeForce GTX 1050 Ti'
maxThreadsPerBlock = 1024
maxBlockDimX = 1024
maxBlockDimY = 1024
maxBlockDimZ = 64
maxGridDimX = 2147483647
maxGridDimY = 65535
maxGridDimZ = 65535
maxSharedMemoryPerBlock = 49152
asyncEngineCount = 2
canMapHostMemory = 1
multiProcessorCount = 6
warpSize = 32
unifiedAddressing = 1
pciBusID = 3
pciDeviceID = 0
This is caused by the asynchronous launch queue associated with GPU kernel launches.
When you tell numba to submit a GPU kernel:
matmul[(256,256),(16,16)](a1,b1,c1)
This request goes into a queue, and the CPU thread (i.e. python) that issued that kernel call can continue, even though the GPU kernel has not completed or even started yet.
The CUDA runtime queues up these requests and issues them as the GPU is ready for more work.
What you are witnessing initially during the very fast incrementing of your for-loop is the queue filling up with work requests. That is not representative of the actual time that the GPU requires to perform the work.
Eventually the queue fills, and the CUDA runtime halts the CPU thread (i.e. python) at the point of kernel launch, until a queue slot opens up. At that point, the for-loop is allowed to proceed for one more iteration. It is at this point (perhaps around 1028 iterations) that you start to see a "slow down". Thereafter, the for-loop proceeds at approximately the rate at which GPU kernels are executed and removed from the processing queue.
There is nothing to fix here; this is expected behavior.
If you want the for-loop to proceed only at the rate at which GPU kernels are actually executed, then you should insert a synchronizing function in your for-loop.
For example, numba provides numba.cuda.synchronize() So if you modify your for-loop as follows:
for i in range(2000):
matmul[(256,256),(16,16)](a1,b1,c1)
cuda.synchronize()
count +=1
print(count)
You will see the for-loop proceed at the actual rate of GPU work completion, instead of the "queue-filling" rate.