numpymemory-managementcudachainercupy

How to use CUDA pinned "zero-copy" memory for a memory mapped file?


Objective/Problem

In Python, I am looking for a fast way to read/write data from a memory mapped file to a GPU.

In a previous SO overflow post [ Cupy OutOfMemoryError when trying to cupy.load larger dimension .npy files in memory map mode, but np.load works fine ]

Where it is mentioned this is possible using CUDA pinned "zero-copy" memory. Furthermore, it seems that this method was developed by this person [ cuda - Zero-copy memory, memory-mapped file ] though that person was working in C++.

My previous attempts have been with Cupy, but I am open to any cuda methods.

What I have tried so far

I mentioned how I tried to use Cupy, which allows you to open numpy files in memmory mapped mode.

import os
import numpy as np
import cupy

#Create .npy files. 
for i in range(4):
    numpyMemmap = np.memmap( 'reg.memmap'+str(i), dtype='float32', mode='w+', shape=( 2200000 , 512))
    np.save( 'reg.memmap'+str(i) , numpyMemmap )
    del numpyMemmap
    os.remove( 'reg.memmap'+str(i) )

# Check if they load correctly with np.load.
NPYmemmap = []
for i in range(4):
    NPYmemmap.append( np.load( 'reg.memmap'+str(i)+'.npy' , mmap_mode = 'r+' )  )
del NPYmemmap

# Eventually results in memory error. 
CPYmemmap = []
for i in range(4):
    print(i)
    CPYmemmap.append( cupy.load( 'reg.memmap'+str(i)+'.npy' , mmap_mode = 'r+' )  )

Result of what I have tried

My attempt resulting in OutOfMemoryError:

It was mentioned that

it appears that cupy.load will require that the entire file fit first in host memory, then in device memory.

And it was also mentioned that

CuPy can't handle mmap memory. So, CuPy uses GPU memory directly in default. https://docs-cupy.chainer.org/en/stable/reference/generated/cupy.cuda.MemoryPool.html#cupy.cuda.MemoryPool.malloc You can change default memory allocator if you want to use Unified Memory.

I tried using

cupy.cuda.set_allocator(cupy.cuda.MemoryPool(cupy.cuda.memory.malloc_managed).malloc)

But this didn't seem to make a difference. At the time of the error, my CPU Ram was at ~16 gigs, but my GPU ram was at 0.32 gigs. I am using Google colab where my CPU Ram is 25 gigs and GPU ram is 12 gigs. So it looks like that after the entire file was hosted in host memory, it checked that if it could fit in device memory, and when it saw that it only has 12 out of the required 16 gigs, it threw an error (my best guess).

So, now I am trying to figure out a way to use pinned 'zero-copy' memory to handle a memory mapped file which would feed data to the GPU.

If important, the type of data I am trying to transfer are floating point arrays. Normally, for read-only data, binary files are loaded into GPU memory, but I am working with data I am try to both read and write at every step.


Solution

  • It appears to me that currently, cupy doesn't offer a pinned allocator that can be used in place of the usual device memory allocator, i.e. could be used as the backing for cupy.ndarray. If this is important to you, you might consider filing a cupy issue.

    However, it seems like it may be possible to create one. This should be considered experimental code. And there are some issues associated with its use.

    The basic idea is that we will replace cupy's default device memory allocator with our own, using cupy.cuda.set_allocator as was already suggested to you. We will need to provide our own replacement for the BaseMemory class that is used as the repository for cupy.cuda.memory.MemoryPointer. The key difference here is that we will use a pinned memory allocator instead of a device allocator. This is the gist of the PMemory class below.

    A few other things to be aware of:

    Here's an example:

    import os
    import numpy as np
    import cupy
    
    
    
    class PMemory(cupy.cuda.memory.BaseMemory):
        def __init__(self, size):
            self.size = size
            self.device_id = cupy.cuda.device.get_device_id()
            self.ptr = 0
            if size > 0:
                self.ptr = cupy.cuda.runtime.hostAlloc(size, 0)
        def __del__(self):
            if self.ptr:
                cupy.cuda.runtime.freeHost(self.ptr)
    
    def my_pinned_allocator(bsize):
        return cupy.cuda.memory.MemoryPointer(PMemory(bsize),0)
    
    cupy.cuda.set_allocator(my_pinned_allocator)
    
    #Create 4 .npy files, ~4GB each
    for i in range(4):
        print(i)
        numpyMemmap = np.memmap( 'reg.memmap'+str(i), dtype='float32', mode='w+', shape=( 10000000 , 100))
        np.save( 'reg.memmap'+str(i) , numpyMemmap )
        del numpyMemmap
        os.remove( 'reg.memmap'+str(i) )
    
    # Check if they load correctly with np.load.
    NPYmemmap = []
    for i in range(4):
        print(i)
        NPYmemmap.append( np.load( 'reg.memmap'+str(i)+'.npy' , mmap_mode = 'r+' )  )
    del NPYmemmap
    
    # allocate pinned memory storage
    CPYmemmap = []
    for i in range(4):
        print(i)
        CPYmemmap.append( cupy.load( 'reg.memmap'+str(i)+'.npy' , mmap_mode = 'r+' )  )
    cupy.cuda.set_allocator(None)
    

    I haven't tested this in a setup with 25GB of host memory with these file sizes. But I have tested it with other file sizes that exceed the device memory of my GPU, and it seems to work.

    Again, experimental code, not thoroughly tested, your mileage may vary, would be better to attain this functionality via filing of cupy github issues. And, as I've mentioned previously, this sort of "device memory" will be generally much slower to access from device code than ordinary cupy device memory.

    Finally, this is not really a "memory mapped file" as all the file contents will be loaded into host memory, and furthermore, this methodology "uses up" host memory. If you have 20GB of files to access, you will need more than 20GB of host memory. As long as you have those files "loaded", 20GB of host memory will be in use.

    UPDATE: cupy provides support for pinned allocators now, see here. This answer should only be used for historical reference.