CuPy memory management

I am trying to understand how CuPy handles memory. Specifically the difference between used_bytes and total_bytes as shown here

I have a simple code that either directly allocates an array on device or first allocates it on host and then moves to device. However, in both cases the used_bytes is far lesser than total_bytes, and more importantly when total_bytes reaches the memory pool limit, the applications terminates.

Source code

(venv) [aditya@node01 cupy]$ cat example1.py 
import cupy as cp
import numpy as np
import time
import argparse

mempool = cp.get_default_memory_pool()
pinned_mempool = cp.get_default_pinned_memory_pool()

#Manual setting of pool
with cp.cuda.Device(0):
   mempool.set_limit(size=39*1024**3)

parser = argparse.ArgumentParser(description='Array size')
parser.add_argument('-x', type=int, help='Size of x dimension')
parser.add_argument('-y', type=int, help='Size of y dimension')
parser.add_argument('-gpufirst', default = False, type=lambda x: (str(x).lower() == 'true'), help='Direct GPU alloction first')
args=parser.parse_args()

B_GB=1024**3

def direct_gpu(x,y):
   #Allocate array directly on Device
   print("check direct device allocation")
   direct_gpu = cp.arange(x*y).reshape(x,y).astype(cp.float32) 
   print("used:",mempool.used_bytes()/(B_GB), "total:", mempool.total_bytes()/(B_GB), "limit:", mempool.get_limit()/(B_GB),"nfreeblocks:", pinned_mempool.n_free_blocks())
   direct_gpu = None   
 
def direct_cpu(x,y):
   #Allocate array on Host Memory and then move to Device
   print("check moving array from host to device")
   direct_cpu = np.arange(x*y).reshape(x,y).astype(np.float32)
   copied_gpu = cp.asarray(direct_cpu)
   print("used:",mempool.used_bytes()/(B_GB), "total:", mempool.total_bytes()/(B_GB), "limit:", mempool.get_limit()/(B_GB),"nfreeblocks:", pinned_mempool.n_free_blocks())
   direct_cpu = None
   direct_gpu = None

print("gpufirst:",args.gpufirst)
if (args.gpufirst):
   direct_gpu(args.x,args.y)
   direct_cpu(args.x,args.y)
else:
   direct_cpu(args.x,args.y)
   direct_gpu(args.x,args.y)

Usage

(venv) [aditya@node01 cupy]$ python example1.py -x 150000 -y 20000 -gpufirst true
gpufirst: True
check direct device allocation
used: 11.175870895385742 total: 33.52761268615723 limit: 39.0 nfreeblocks: 0
check moving array from host to device
used: 11.175870895385742 total: 33.52761268615723 limit: 39.0 nfreeblocks: 0

Since in the above example used memory is 11.17GB, and my memory limit is 39GB, I should be able to increase the array size in x from 150,000 to 200,000 i.e. a 33% increase in used_bytes, however if total_bytes is used for the threshold then I should exceed the 39GB limit, which is what is happening. Why is total_bytes important shouldn't used be the critical parameter here?

(venv) [aditya@node01 cupy]$ python example1.py -x 200000 -y 20000 -gpufirst true
gpufirst: True
check direct device allocation
Traceback (most recent call last):
  File "/home/aditya/Downloads/Tickets/cupy/example1.py", line 39, in <module>
    direct_gpu(args.x,args.y)
  File "/home/aditya/Downloads/Tickets/cupy/example1.py", line 24, in direct_gpu
    direct_gpu = cp.arange(x*y).reshape(x,y).astype(cp.float32) 
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "cupy/_core/core.pyx", line 565, in cupy._core.core._ndarray_base.astype
  File "cupy/_core/core.pyx", line 623, in cupy._core.core._ndarray_base.astype
  File "cupy/_core/core.pyx", line 151, in cupy._core.core.ndarray.__new__
  File "cupy/_core/core.pyx", line 239, in cupy._core.core._ndarray_base._init
  File "cupy/cuda/memory.pyx", line 738, in cupy.cuda.memory.alloc
  File "cupy/cuda/memory.pyx", line 1424, in cupy.cuda.memory.MemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 1445, in cupy.cuda.memory.MemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 1116, in cupy.cuda.memory.SingleDeviceMemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 1137, in cupy.cuda.memory.SingleDeviceMemoryPool._malloc
  File "cupy/cuda/memory.pyx", line 1344, in cupy.cuda.memory.SingleDeviceMemoryPool._try_malloc
  File "cupy/cuda/memory.pyx", line 1356, in cupy.cuda.memory.SingleDeviceMemoryPool._try_malloc
cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 16,000,000,000 bytes (allocated so far: 32,000,000,000 bytes, limit set to: 41,875,931,136 bytes).

Solution

Intermediate allocations need to be taken into consideration. In this line:

   direct_gpu = cp.arange(x*y).reshape(x,y).astype(cp.float32)

cp.arange(x*y) creates an array, which requires x * y * sizeof(int64) bytes of GPU memory.
.astype(cp.float32) creates an array, which requires another x * y * sizeof(float32) bytes.
The array created in 1. is destroyed.

So at the point of 2., you need x * y * 12 bytes of GPU memory. Use cp.arange(x*y, dtype=cp.float32) instead to reduce this overhead.