pythonpytorchgpunvidia-smivram

nvidia-smi vs torch.cuda.memory_allocated


I am checking the gpu memory usage in the training step.

To start with the main question, checking the gpu memory using the torch.cuda.memory_allocated method is different from checking with nvidia-smi. And I want to know why.

Actually, I measured the gpu usage using the vgg16 model.

This code prints the theoretical feature map size and weight size:

import torch
import torch.nn as nn
from functools import reduce

Model_number = 7

Model_name = ["alexnet", "vgg11_bn", "vgg16_bn", "resnet18", "resnet50", "googlenet", "vgg11", "vgg16"]
Model_weights = ["AlexNet_Weights", "VGG11_BN_Weights", "VGG16_BN_Weights", "ResNet18_Weights", "ResNet50_Weights", "GoogLeNet_Weights", "VGG11_Weights", "VGG16_Weights"]

exec(f"from torchvision.models import {Model_name[Model_number]}, {Model_weights[Model_number]}")
exec(f"weights = {Model_weights[Model_number]}.DEFAULT")
exec(f"model = {Model_name[Model_number]}(weights=None)")

weight_memory_allocate = 0
feature_map_allocate = 0

weight_type = 4 # float32 = 4, half = 2

batch_size = 128
input_channels = 3
input_size = [batch_size, 3, 224, 224]

def check_model_info(m):
    global input_size
    global weight_memory_allocate, feature_map_allocate

    if isinstance(m, nn.Conv2d):
        in_channels, out_channels = m.in_channels, m.out_channels
        kernel_size, stride, padding = m.kernel_size[0], m.stride[0], m.padding[0]

        # weight
        weight_memory_allocate += in_channels * out_channels * kernel_size * kernel_size * weight_type
        # bias
        weight_memory_allocate += out_channels * weight_type
        # feature map
        feature_map_allocate += reduce(lambda a, b: a * b, input_size, 1) * weight_type

        out_len = int((input_size[2] + 2 * padding - kernel_size)/stride + 1)
        input_size = [batch_size, out_channels, out_len, out_len]

    elif isinstance(m, nn.Linear):
        input_size = [batch_size, reduce(lambda a, b: a * b, input_size[1:], 1)]
        in_nodes, out_nodes = m.in_features, m.out_features

        # weight
        weight_memory_allocate += in_nodes * out_nodes * weight_type
        # bias
        weight_memory_allocate += out_nodes * weight_type
        #feature map
        feature_map_allocate += reduce(lambda a, b: a * b, input_size, 1) * weight_type

        input_size = [batch_size, out_nodes]

    elif isinstance(m, nn.MaxPool2d):
        out_len = int((input_size[2] + 2 * m.padding - m.kernel_size)/m.stride + 1)
        input_size = [batch_size, input_size[1], out_len, out_len]

model.apply(check_model_info)

print("---------------------------------------------------------")
print("origial memory allocate")
print(f"total = {(weight_memory_allocate + feature_map_allocate)/1024.0/1024.0:.2f}MB")
print(f"weight = {weight_memory_allocate/1024.0/1024.0:.2f}MB")
print(f"feature_map = {feature_map_allocate/1024.0/1024.0:.2f}MB")
print("---------------------------------------------------------")

Output:

---------------------------------------------------------
origial memory allocate
total = 4978.54MB
weight = 527.79MB
feature_map = 4450.75MB
---------------------------------------------------------

And this code checks gpu usage with torch.cuda.memory_allocated:

def test_memory_training(in_size=(3,224,224), out_size=1000, optimizer_type=torch.optim.SGD, batch_size=1, use_amp=False, device=0):
    sample_input = torch.randn(batch_size, *in_size, dtype=torch.float32)
    optimizer = optimizer_type(model.parameters(), lr=.001)
    model.to(device)
    print(f"After model to device: {to_MB(torch.cuda.memory_allocated(device)):.2f}MB")
    for i in range(5):
        optimizer.zero_grad()
        print("Iteration", i)
        with torch.cuda.amp.autocast(enabled=use_amp):
            a = torch.cuda.memory_allocated(device)
            out = model(sample_input.to(device)).sum() # Taking the sum here just to get a scalar output
            b = torch.cuda.memory_allocated(device)
        print(f"After forward pass {to_MB(torch.cuda.memory_allocated(device)):.2f}MB")
        print(f"Memory consumed by forward pass {to_MB(b - a):.2f}MB")
        out.backward()
        print(f"After backward pass {to_MB(torch.cuda.memory_allocated(device)):.2f}MB")
        optimizer.step()
        print(f"After optimizer step {to_MB(torch.cuda.memory_allocated(device)):.2f}MB")
        print("---------------------------------------------------------")


def to_MB(a):
    return a/1024.0/1024.0

test_memory_training(batch_size=batch_size)

Output:

After model to device: 529.04MB
Iteration 0
After forward pass 9481.04MB
Memory consumed by forward pass 8952.00MB
After backward pass 1057.21MB
After optimizer step 1057.21MB
---------------------------------------------------------
Iteration 1
After forward pass 10009.21MB
Memory consumed by forward pass 8952.00MB
After backward pass 1057.21MB
After optimizer step 1057.21MB
---------------------------------------------------------
......

This is the result output by nvidia-smi when training:

enter image description here

Here's a more detailed question:

I think Pytorch store the following 3 things in the training step.

  1. model parameters
  2. input feature map in forward pass
  3. model gradient information for optimizer

And I think in the forward pass, input feature map should be stored. But in theory, I thought 4450.75MB should be stored in memory, but actually 8952.00MB is stored. Almost 2 times difference.

And if you check the memory usage using nvidia-smi and torch.cuda.memory_allocated, the memory usage using nvidia-smi shows about twice as much memory.

what makes this difference?

Thanks for reading the long question. Any help is appreciated.


Solution

  • What is displayed in nvidia-smi is probably not the allocated memory, but the reserved memory.

    You can also read out the reserved memory using torch.cuda