cudacupy

Access CUDAarray in CuPy using pointer from C++


I'm trying to allocate a CUDAarray (as in, texture memory) in c++ and pass the pointer up to CuPy. From there, I would like to treat it as an ndarray.

Many examples show how to cudaMalloc() linear memory in c++ and share using cupy.cuda.UnownedMemory(), which I have gotten to work. However, when I try to send across the pointer to a CUDAarray, allocated with cudaMallocArray(), I get a cudaErrorIllegalAddress.

The cupy docs on external memory 1 2 do not mention any limitations. CUDAarray is CUDA memory that was allocated in other applications, right? However, the cupy ndarray docs do say pointer to the head of the memory, which suspiciously sounds like opaque memory structures won't work.

In this question, someone clarifies that the CUDAarray object isn't a direct 'pointer' in the sense of "a starting location in memory", but rather a reference to the GPU's internal tracking of the memory. Again, I'm inclined to think I will have to cudaMemcpy2DFromArray() into linear memory in c++ before transferring up to CuPy.

However, this comment in the cupy source confuses me.

cdef class CUDAarray:
    ...
    # TODO(leofang): perhaps this wrapper is not needed when cupy.ndarray
    # can be backed by texture memory/CUDA arrays?
    ...

Is the TODO to delete the wrapper because it already works, or as a future note to change the wrapper when someone else makes cupy.ndarray support CUDAarrays?

As for other methods, I can instantiate a cupy.cuda.texture.CUDAarray(), but I cant replace the ptr attribute with my own - its read-only (and that would be sloppy, I'd have to free the old one anyway, and I don't want it calling free before the external code is done with it).

I also looked for ways to pass a whole texture object to cupy, but again this just uses the CUDAarray class inside the ResourceDescriptor wrapper, so same problems.

Whats the best way to do this? Is cudaMemcpy2DFromArray() my only option? It works, but is there a performance penalty?

Here is my example c++ function I'm using to test:

cudaArray_t cuArray = NULL; // yes this is global for the DLL. It stays alive until the DLL exit is called, when the python process exits.

extern "C"
uintptr_t __declspec(dllexport) __stdcall getCudaArray()
{
    if (cuArray != NULL) {
        return reinterpret_cast<uintptr_t>(cuArray);
    }

    const int height = 1024;
    const int width = 1024;

    // Allocate and set some host data
    float* h_data = (float*)std::malloc(sizeof(float) * width * height);
    for (int i = 0; i < height * width; ++i)
        h_data[i] = i + 0.1;

    // Allocate CUDA array in device memory
    cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);
    cudaMallocArray(&cuArray, &channelDesc, width, height);

    // Set pitch of the source, aka width of row in bytes
    const size_t srcpitch = width * sizeof(float);

    // Copy data located at address h_data in host memory to device memory
    cudaMemcpy2DToArray(cuArray, 0, 0, h_data, srcpitch, width * sizeof(float), height, cudaMemcpyHostToDevice);

    // return pointer to CUDAarray
    return reinterpret_cast<uintptr_t>(cuArray);
}

extern "C"
uint64_t __declspec(dllexport) __stdcall clearCudaArray()
{
    cudaFreeArray(cuArray);
    cuArray = NULL;
    return 0;
}

and this is an example python script. ShTexDll is the library I am writing.

import ShTexDll
import cupy as cp

ptr = ShTexDll.getCudaArray()

mem = cp.cuda.UnownedMemory(ptr, 4*1024*1024, owner=None, device_id=0)
memptr = cp.cuda.MemoryPointer(mem, offset=0)

arr = cp.ndarray((1024,1024), dtype=cp.float32, memptr=memptr, strides=(4,1024*4))

print(arr)

ShTexDll.clearCudaArray()

The error I get when trying to access arr:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "cupy\_core\core.pyx", line 1741, in cupy._core.core._ndarray_base.__str__
  File "cupy\_core\core.pyx", line 1854, in cupy._core.core._ndarray_base.get
  File "cupy\cuda\memory.pyx", line 586, in cupy.cuda.memory.MemoryPointer.copy_to_host_async
  File "cupy_backends\cuda\api\runtime.pyx", line 606, in cupy_backends.cuda.api.runtime.memcpyAsync
  File "cupy_backends\cuda\api\runtime.pyx", line 146, in cupy_backends.cuda.api.runtime.check_status
cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorIllegalAddress: an illegal memory access was encountered

Solution

  • As answered on GitHub by leofang:

    there's no way to address a CUDA Array in a linear fashion, i.e. pointer arithmetics does not work. So I don't think it is reasonable to use it to back an ndarray, which requires the underlying memory to be addressable via pointers (with proper offsets).

    I was mistaken and hopeful, it seems. The solution if you have a CUDAarray and want to pass it up to python to use as an ndarray is to first use cudaMemcpy into a linear structure.

    float* d_BL = NULL;
    cudaArray_t cuArrayL = NULL;
    
    extern "C"
    uintptr_t __declspec(dllexport) __stdcall getCudaArrayAsLinear()
    {
        if (cuArrayL != NULL) {
            return reinterpret_cast<uintptr_t>(cuArrayL);
        }
    
        const int height = 1024;
        const int width = 1024;
    
        // Allocate and set some host data
        float* h_data = (float*)std::malloc(sizeof(float) * width * height);
        for (int i = 0; i < height * width; ++i)
            h_data[i] = i + 0.1;
    
        // Allocate CUDA array in device memory
        cudaChannelFormatDesc channelDesc =
            cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);
    
        cudaMallocArray(&cuArrayL, &channelDesc, width, height);
    
        // allocate cuda linear memory in device
        cudaMalloc((void**)&d_BL, sizeof(float) * width * height);
    
        std::cout << h_data[1234] << std::endl;
    
        // Set pitch of the source (the width in memory in bytes of the 2D array pointed
        // to by src, including padding), we dont have any padding
        const size_t spitch = width * sizeof(float);
        // Copy data located at address h_data in host memory to device memory
        cudaMemcpy2DToArray(cuArrayL, 0, 0, h_data, spitch, width * sizeof(float),
            height, cudaMemcpyHostToDevice);
    
    
        // copy from cudaArray to linear memory
        gpuErrchk(cudaMemcpy2DFromArray(d_BL, sizeof(float)*width, reinterpret_cast<cudaArray_const_t>(cuArrayL), 0, 0, sizeof(float) * width, height, cudaMemcpyDeviceToDevice));
    
        return reinterpret_cast<uintptr_t>(d_BL);
    }
    
    
    extern "C"
    uint64_t __declspec(dllexport) __stdcall clearCudaArrayAsLinear()
    {
        cudaFreeArray(cuArrayL);
        cudaFree(d_BL);
        d_BL = NULL;
        cuArrayL = NULL;
        return 0;
    }