pythonpytorchdatasetram

pytorch code messes up my RAM when using torch.zeros()


I have a function to measure the allocated ram by python in megabytes: def getram(): print(psutil.Process(os.getpid()).memory_info().rss / 1024**2)

And also I have: device = "cuda"

My problem is that the following code allocates RAM and I'm going crazy because of that. Does it have an actual solution or do I have to accept my fate and switch to C++ or something?

The code:

getram()

def load_dataset(dir, filenames):
  dataset = torch.zeros((len(filenames),3,256,256), device=device)
  getram()
  for i, filename in enumerate(filenames):
    f = read_image(f"{dir}/{filename}")
    if 3 != f.shape[0]: print(filename)
    dataset[i] = f.to(device)
  getram()
  return dataset

dataset = load_dataset(dataset_dir, dataset_filenames)

getram()

The code printed out the following:

533.28125
661.2890625
678.27734375
678.27734375

As you can see, as soon as I create the empty tensor with the torch.zeros(), it takes up the RAM for no reason.

I tried gc.collect(), but it didn't help at all.


Solution

  • The RAM usage you are seeing is caused by loading various CUDA libraries, not from the tensor itself. When you first use CUDA, Pytorch lazily loads CUDA libraries into RAM. You can verify this with the code below (RAM usage numbers are what I got for my system, you will probably get different numbers but the overall point should be the same):

    import os
    import psutil
    import torch
    import time 
    
    def getram(): 
        print(psutil.Process(os.getpid()).memory_info().rss / 1024**2)
    
    device = 'cuda:0'
        
    # get baseline ram
    getram()
    > 331.7734375
    
    # create first cuda tensor
    # this causes a large RAM increase due to loading CUDA libraries
    tmp = torch.zeros(1, device=device)
    time.sleep(0.1)
    getram()
    > 1251.25
    
    # create dataset on GPU
    dataset = torch.zeros((128,3,256,256), device=device)
    
    # Slight RAM increase but mostly unchanged
    getram()
    > 1252.203125
    

    Note that the time.sleep(0.1) is there because I found running getram right after allocating tmp would sometimes return a value while CUDA libs were still loading (ie running getram again right after without allocating any other values would yield a different result). The sleep is to ensure the libs are fully loaded.