pythoncudanumbanumba-pro

a CUDA error When a large array is used as input data


I have a code to do some calculation in GPU by python3.5 with numba and CUDA8.0. When an array with size(50,27) was input, it run successfully and get right result. I change the input data to size(200,340), it has an error.

I use shared memory in my code. Is there not enough shared memory? Or the grid size and block size are not good? I don't know how to identify it and choose appropriate size for grid and block.

I set small grid size and block size, the error is the same.

What should I do to solve this problem? Thanks for some advice.

I simplified my code and it has the same error. It is convenient to set the size of the input data here:df = np.random.random_sample((300, 200)) + 10.

The code:

import os,sys,time,math
import pandas as pd
import numpy as np

from numba import cuda, float32

os.environ['NUMBAPRO_NVVM']=r'D:\NVIDIA GPU Computing Toolkit\CUDA\v8.0\nvvm\bin\nvvm64_31_0.dll'
os.environ['NUMBAPRO_LIBDEVICE']=r'D:\NVIDIA GPU Computing Toolkit\CUDA\v8.0\nvvm\libdevice'

bpg = 8
tpb = (4,32) 

tsize = (3,4) 
hsize = (1,4)

@cuda.jit
def calcu_T(D, T):

    gw = cuda.gridDim.x
    bx = cuda.blockIdx.x
    tx = cuda.threadIdx.x
    bw = cuda.blockDim.x
    ty = cuda.threadIdx.y
    bh = cuda.blockDim.y

    c_num = D.shape[1]
    c_index = bx

    while c_index<c_num*c_num:
        c_x = int(c_index/c_num)
        c_y = c_index%c_num

        if c_x==c_y:
            T[c_x,c_y] = 0.0
        else:
            X = D[:,c_x]
            Y = D[:,c_y]

            hbuf = cuda.shared.array(hsize, float32)

            h = tx

            Xi = X[h:]
            Xi1 = X[:-h]
            Yih = Y[:-h]

            sbuf = cuda.shared.array(tsize, float32)

            L = len(Xi)

            #mean
            if ty==0:
                Xi_m = 0.0
                Xi1_m = 0.0
                Yih_m = 0.0
                for i in range(L):
                    Xi_m += Xi[i]
                    Xi1_m += Xi1[i]
                    Yih_m += Yih[i]
                Xi_m = Xi_m/L
                Xi1_m = Xi1_m/L
                Yih_m = Yih_m/L
                sbuf[0,tx] = Xi_m
                sbuf[1,tx] = Xi1_m
                sbuf[2,tx] = Yih_m

            cuda.syncthreads()

            sl = cuda.shared.array(tpb, float32)

            r_index = ty
            s_l = 0.0
            while r_index<L:
                s1 = 0.0
                for i in range(L):
                    s1 += (Xi[r_index]+Xi1[i])/sbuf[0,tx]

                s_l += s1
                r_index +=bh
            sl[tx,ty] = s_l
            cuda.syncthreads()

            #
            if ty==0:
                ht = 0.0
                for i in range(bh):
                    ht += sl[tx,i]
                hbuf[0,tx] = ht/L
            cuda.syncthreads()

            #max
            if tx==0 and ty==0:
                m_t = 0.0
                for index,ele in enumerate(hbuf[0]):
                    if index==0:
                        m_t = ele
                    elif ele>m_t:
                        m_t = ele

                T[c_x,c_y] = m_t

        c_index +=gw



df = np.random.random_sample((300, 200)) + 10
D = np.array(df, dtype=np.float32)
r,c = D.shape

T = np.empty([c,c])

dD = cuda.to_device(D)
dT = cuda.device_array_like(T)

calcu_T[bpg, tpb](dD,dT)
dT.copy_to_host(T)

The error:

Traceback (most recent call last):
      File "G:\myworkspace\python3.5\forte\forte170327\test10fortest8.py", line 118, in <module>
        dT.copy_to_host(T)
      File "D:\python3.5.3\lib\site-packages\numba\cuda\cudadrv\devicearray.py", line 198, in copy_to_host
        _driver.device_to_host(hostary, self, self.alloc_size, stream=stream)
      File "D:\python3.5.3\lib\site-packages\numba\cuda\cudadrv\driver.py", line 1481, in device_to_host
        fn(host_pointer(dst), device_pointer(src), size, *varargs)
      File "D:\python3.5.3\lib\site-packages\numba\cuda\cudadrv\driver.py", line 259, in safe_cuda_api_call
        self._check_error(fname, retcode)
      File "D:\python3.5.3\lib\site-packages\numba\cuda\cudadrv\driver.py", line 296, in _check_error
        raise CudaAPIError(retcode, msg)
    numba.cuda.cudadrv.driver.CudaAPIError: [719] Call to cuMemcpyDtoH results in UNKNOWN_CUDA_ERROR

My device information:

Device 0: 
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    5.0
  Total amount of global memory:                 2048 MBytes (2147483648 bytes)
  ( 5) Multiprocessors, (128) CUDA Cores/MP:     640 CUDA Cores
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes

Solution

  • There is nothing wrong with your code. If I run your code on my GTX970, I get this:

    In [11]: main??
    Signature: main()
    Source:   
    def main():
    
        df = np.random.random_sample((300, 200)) + 10
        D = np.array(df, dtype=np.float32)
        r,c = D.shape
    
        T = np.empty([c,c])
    
        dD = cuda.to_device(D)
        dT = cuda.device_array_like(T)
    
        calcu_T[bpg, tpb](dD,dT)
        dT.copy_to_host(T)
    File:      ~/SO/crash.py
    Type:      function
    
    In [12]: %timeit -n 3 -r 3 main()
    3 loops, best of 3: 6.61 s per loop
    

    i.e. no runtime errors, but the python code including the kernel takes 6.6 seconds to run. If I profile the code with the CUDA profiler:

    $ nvprof python crash.py
    
    ==13828== NVPROF is profiling process 13828, command: python crash.py
    All finished
    ==13828== Profiling application: python crash.py
    ==13828== Profiling result:
    Time(%)      Time     Calls       Avg       Min       Max  Name
    100.00%  6.59109s         1  6.59109s  6.59109s  6.59109s  cudapy::__main__::calcu_T$241(Array<float, int=2, A, mutable, aligned>, Array<double, int=2, A, mutable, aligned>)
      0.00%  26.271us         1  26.271us  26.271us  26.271us  [CUDA memcpy DtoH]
      0.00%  21.279us         1  21.279us  21.279us  21.279us  [CUDA memcpy HtoD]
    
    ==13828== API calls:
    Time(%)      Time     Calls       Avg       Min       Max  Name
     98.51%  6.59118s         1  6.59118s  6.59118s  6.59118s  cuMemcpyDtoH
      1.42%  94.890ms         1  94.890ms  94.890ms  94.890ms  cuDevicePrimaryCtxRetain
      0.05%  3.4116ms         1  3.4116ms  3.4116ms  3.4116ms  cuModuleLoadDataEx
      0.01%  417.96us         1  417.96us  417.96us  417.96us  cuLinkCreate
      0.00%  227.57us         1  227.57us  227.57us  227.57us  cuLinkAddData
      0.00%  195.72us         2  97.859us  95.710us  100.01us  cuMemAlloc
      0.00%  190.10us         1  190.10us  190.10us  190.10us  cuLinkComplete
      0.00%  139.04us         1  139.04us  139.04us  139.04us  cuMemGetInfo
      0.00%  53.193us         1  53.193us  53.193us  53.193us  cuMemcpyHtoD
      0.00%  29.538us         1  29.538us  29.538us  29.538us  cuDeviceGetName
      0.00%  17.895us         1  17.895us  17.895us  17.895us  cuLaunchKernel
      0.00%  2.0250us         1  2.0250us  2.0250us  2.0250us  cuCtxPushCurrent
      0.00%  2.0150us         5     403ns     255ns     752ns  cuFuncGetAttribute
      0.00%  1.6260us         2     813ns     547ns  1.0790us  cuDeviceGetCount
      0.00%  1.1430us         1  1.1430us  1.1430us  1.1430us  cuModuleGetFunction
      0.00%     951ns         2     475ns     372ns     579ns  cuDeviceGet
      0.00%     796ns         1     796ns     796ns     796ns  cuLinkDestroy
      0.00%     787ns         1     787ns     787ns     787ns  cuDeviceComputeCapability
    

    you can see that the kernel you have posted takes 6.5 seconds to run.

    You have provided no details, but I will guess that you are running on Windows, your GPU is a display GPU and your code runs slow enough that it is hitting the WDDM display manager watchdog timeout limit. This is extremely well documented and has been asked about literally hundreds of times before -- for example here.

    Your search engine of choice and the CUDA Windows getting started guide will provide you with information on what your alternatives are to improve the situation from a operating system and hardware point of view. The most obvious, however, is simply to improve your code to make it run faster.