I am trying to understand how CuPy handles memory. Specifically the difference between used_bytes and total_bytes as shown here
I have a simple code that either directly allocates an array on device or first allocates it on host and then moves to device. However, in both cases the used_bytes is far lesser than total_bytes, and more importantly when total_bytes reaches the memory pool limit, the applications terminates.
Source code
(venv) [aditya@node01 cupy]$ cat example1.py
import cupy as cp
import numpy as np
import time
import argparse
mempool = cp.get_default_memory_pool()
pinned_mempool = cp.get_default_pinned_memory_pool()
#Manual setting of pool
with cp.cuda.Device(0):
mempool.set_limit(size=39*1024**3)
parser = argparse.ArgumentParser(description='Array size')
parser.add_argument('-x', type=int, help='Size of x dimension')
parser.add_argument('-y', type=int, help='Size of y dimension')
parser.add_argument('-gpufirst', default = False, type=lambda x: (str(x).lower() == 'true'), help='Direct GPU alloction first')
args=parser.parse_args()
B_GB=1024**3
def direct_gpu(x,y):
#Allocate array directly on Device
print("check direct device allocation")
direct_gpu = cp.arange(x*y).reshape(x,y).astype(cp.float32)
print("used:",mempool.used_bytes()/(B_GB), "total:", mempool.total_bytes()/(B_GB), "limit:", mempool.get_limit()/(B_GB),"nfreeblocks:", pinned_mempool.n_free_blocks())
direct_gpu = None
def direct_cpu(x,y):
#Allocate array on Host Memory and then move to Device
print("check moving array from host to device")
direct_cpu = np.arange(x*y).reshape(x,y).astype(np.float32)
copied_gpu = cp.asarray(direct_cpu)
print("used:",mempool.used_bytes()/(B_GB), "total:", mempool.total_bytes()/(B_GB), "limit:", mempool.get_limit()/(B_GB),"nfreeblocks:", pinned_mempool.n_free_blocks())
direct_cpu = None
direct_gpu = None
print("gpufirst:",args.gpufirst)
if (args.gpufirst):
direct_gpu(args.x,args.y)
direct_cpu(args.x,args.y)
else:
direct_cpu(args.x,args.y)
direct_gpu(args.x,args.y)
Usage
(venv) [aditya@node01 cupy]$ python example1.py -x 150000 -y 20000 -gpufirst true
gpufirst: True
check direct device allocation
used: 11.175870895385742 total: 33.52761268615723 limit: 39.0 nfreeblocks: 0
check moving array from host to device
used: 11.175870895385742 total: 33.52761268615723 limit: 39.0 nfreeblocks: 0
Since in the above example used memory is 11.17GB, and my memory limit is 39GB, I should be able to increase the array size in x from 150,000 to 200,000 i.e. a 33% increase in used_bytes, however if total_bytes is used for the threshold then I should exceed the 39GB limit, which is what is happening. Why is total_bytes important shouldn't used be the critical parameter here?
(venv) [aditya@node01 cupy]$ python example1.py -x 200000 -y 20000 -gpufirst true
gpufirst: True
check direct device allocation
Traceback (most recent call last):
File "/home/aditya/Downloads/Tickets/cupy/example1.py", line 39, in <module>
direct_gpu(args.x,args.y)
File "/home/aditya/Downloads/Tickets/cupy/example1.py", line 24, in direct_gpu
direct_gpu = cp.arange(x*y).reshape(x,y).astype(cp.float32)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "cupy/_core/core.pyx", line 565, in cupy._core.core._ndarray_base.astype
File "cupy/_core/core.pyx", line 623, in cupy._core.core._ndarray_base.astype
File "cupy/_core/core.pyx", line 151, in cupy._core.core.ndarray.__new__
File "cupy/_core/core.pyx", line 239, in cupy._core.core._ndarray_base._init
File "cupy/cuda/memory.pyx", line 738, in cupy.cuda.memory.alloc
File "cupy/cuda/memory.pyx", line 1424, in cupy.cuda.memory.MemoryPool.malloc
File "cupy/cuda/memory.pyx", line 1445, in cupy.cuda.memory.MemoryPool.malloc
File "cupy/cuda/memory.pyx", line 1116, in cupy.cuda.memory.SingleDeviceMemoryPool.malloc
File "cupy/cuda/memory.pyx", line 1137, in cupy.cuda.memory.SingleDeviceMemoryPool._malloc
File "cupy/cuda/memory.pyx", line 1344, in cupy.cuda.memory.SingleDeviceMemoryPool._try_malloc
File "cupy/cuda/memory.pyx", line 1356, in cupy.cuda.memory.SingleDeviceMemoryPool._try_malloc
cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 16,000,000,000 bytes (allocated so far: 32,000,000,000 bytes, limit set to: 41,875,931,136 bytes).
Intermediate allocations need to be taken into consideration. In this line:
direct_gpu = cp.arange(x*y).reshape(x,y).astype(cp.float32)
cp.arange(x*y) creates an array, which requires x * y * sizeof(int64) bytes of GPU memory.
.astype(cp.float32) creates an array, which requires another x * y * sizeof(float32) bytes.
The array created in 1. is destroyed.
So at the point of 2., you need x * y * 12 bytes of GPU memory. Use cp.arange(x*y, dtype=cp.float32) instead to reduce this overhead.