I want to come up with a very simple Lightning example using DeepSpeed, but it refused to parallelize layers even when setting to stage 3.
I'm just blowing up the model by adding FC layers in the hope they get distributed to the different GPU (6 in total)
But I'm ending up with
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 3; 15.00 GiB total capacity; 14.00 GiB already allocated; 5.25 MiB free; 14.00 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Therefore I guess the layers are only put to a single GPU.
The full code is available here, but this is a gist of it:
Blowing up the model with 18000 layers:
class TelModel(L.LightningModule):
def __init__(self):
super().__init__()
embed_dim = 512
component_list = [
nn.Linear(512, embed_dim)
#] + [nn.TransformerEncoderLayer(d_model=512, nhead=8, batch_first=True) for _ in range(n_layers)] + [
] + [nn.Linear(embed_dim, 512) for _ in range(n_layers)] + [
nn.Linear(embed_dim, 512)
]
self.net = torch.nn.Sequential(*component_list)
Initializing DeepSpeed:
tel_model = TelModel()
train_ds = RandomDataset(100)
train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE)
trainer = L.Trainer(accelerator="gpu", devices=6, strategy="deepspeed_stage_3", precision=32)
trainer.fit(tel_model, train_loader)
And finally, I run it like this:
deepspeed lightning-deepspeed-tel.py
The batch size is batch size per device. The CUDA OOM error is most likely because a batch size of 256 is too big. Trying a smaller batch size like 32 or 64 will solve the issue. The effective batch size of your code will be batch_size_per_device x num_nodes x num_gpus