pythondeep-learningpytorchresnet

SimCLR/ResNet18 : last fractional batch mecanism not functional ? (tensor shapes incompatible)


I'm implementing a SimCLR/ResNet18 architecture over a custom dataset.

I know that

Number of Iterations in One Epoch=Batch Size/Total Training Dataset Size​

And if the result is floating then the size of the last batch is adapted for the leftovers (the 'fractional batch'). However in my case, this last mechanism does not seem to work. My dataset is of size 7000. If I give a batch size of 100, I then have 7000/70=100 iterations, without fractional batch and the training goes on. However, if I give a batch size of 32 for instance, then I have the following error (full stack trace)

/home/wlutz/PycharmProjects/hiv-image-analysis/venv/bin/python /home/wlutz/PycharmProjects/hiv-image-analysis/main.py 
2023-10-20 11:12:22.106008: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-10-20 11:12:22.107921: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-10-20 11:12:22.133919: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-10-20 11:12:22.133941: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-10-20 11:12:22.133955: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-10-20 11:12:22.138715: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-20 11:12:22.737271: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pl_bolts/__init__.py:11: FutureWarning: In the future `np.object` will be defined as the corresponding NumPy scalar.
  if not hasattr(numpy, tp_name):
/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pl_bolts/__init__.py:11: FutureWarning: In the future `np.bool` will be defined as the corresponding NumPy scalar.
  if not hasattr(numpy, tp_name):
/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pl_bolts/models/self_supervised/amdim/amdim_module.py:34: UnderReviewWarning: The feature generate_power_seq is currently marked under review. The compatibility with other Lightning projects is not guaranteed and API may change at any time. The API and functionality may change without warning in future releases. More details: https://lightning-bolts.readthedocs.io/en/latest/stability.html
  "lr_options": generate_power_seq(LEARNING_RATE_CIFAR, 11),
/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pl_bolts/models/self_supervised/amdim/amdim_module.py:92: UnderReviewWarning: The feature FeatureMapContrastiveTask is currently marked under review. The compatibility with other Lightning projects is not guaranteed and API may change at any time. The API and functionality may change without warning in future releases. More details: https://lightning-bolts.readthedocs.io/en/latest/stability.html
  contrastive_task: Union[FeatureMapContrastiveTask] = FeatureMapContrastiveTask("01, 02, 11"),
/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pl_bolts/losses/self_supervised_learning.py:228: UnderReviewWarning: The feature AmdimNCELoss is currently marked under review. The compatibility with other Lightning projects is not guaranteed and API may change at any time. The API and functionality may change without warning in future releases. More details: https://lightning-bolts.readthedocs.io/en/latest/stability.html
  self.nce_loss = AmdimNCELoss(tclip)
available_gpus: 0
/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
  warnings.warn(
/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=None`.
  warnings.warn(msg)
Dim MLP input: 512
/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:478: LightningDeprecationWarning: Setting `Trainer(gpus=0)` is deprecated in v1.7 and will be removed in v2.0. Please use `Trainer(accelerator='gpu', devices=0)` instead.
  rank_zero_deprecation(
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:613: UserWarning: Checkpoint directory /home/wlutz/PycharmProjects/hiv-image-analysis/saved_models exists and is not empty.
  rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
/home/wlutz/PycharmProjects/hiv-image-analysis/main.py:330: UnderReviewWarning: The feature LinearWarmupCosineAnnealingLR is currently marked under review. The compatibility with other Lightning projects is not guaranteed and API may change at any time. The API and functionality may change without warning in future releases. More details: https://lightning-bolts.readthedocs.io/en/latest/stability.html
  scheduler_warmup = LinearWarmupCosineAnnealingLR(optimizer, warmup_epochs=10, max_epochs=max_epochs,

  | Name  | Type            | Params
------------------------------------------
0 | model | AddProjection   | 11.5 M
1 | loss  | ContrastiveLoss | 0     
------------------------------------------
11.5 M    Trainable params
0         Non-trainable params
11.5 M    Total params
46.024    Total estimated model params size (MB)
Optimizer Adam, Learning Rate 0.0003, Effective batch size 160
Epoch 0: 100%|█████████▉| 218/219 [04:03<00:01,  1.12s/it, loss=3.74, v_num=58, Contrastive loss_step=3.650]Traceback (most recent call last):
  File "/home/wlutz/PycharmProjects/hiv-image-analysis/main.py", line 388, in <module>
    trainer.fit(model, data_loader)
  File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
    call._call_and_handle_interrupt(
  File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1112, in _run
    results = self._run_stage()
  File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1191, in _run_stage
    self._run_train()
  File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1214, in _run_train
    self.fit_loop.run()
  File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 213, in advance
    batch_output = self.batch_loop.run(kwargs)
  File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance
    outputs = self.optimizer_loop.run(optimizers, kwargs)
  File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 202, in advance
    result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position])
  File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 249, in _run_optimization
    self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure)
  File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 370, in _optimizer_step
    self.trainer._call_lightning_module_hook(
  File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1356, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/core/module.py", line 1754, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py", line 169, in step
    step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
  File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 234, in optimizer_step
    return self.precision_plugin.optimizer_step(
  File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 119, in optimizer_step
    return optimizer.step(closure=closure, **kwargs)
  File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/torch/optim/lr_scheduler.py", line 68, in wrapper
    return wrapped(*args, **kwargs)
  File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/torch/optim/optimizer.py", line 373, in wrapper
    out = func(*args, **kwargs)
  File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/torch/optim/optimizer.py", line 76, in _use_grad
    ret = func(self, *args, **kwargs)
  File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/torch/optim/adam.py", line 143, in step
    loss = closure()
  File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 105, in _wrap_closure
    closure_result = closure()
  File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 149, in __call__
    self._result = self.closure(*args, **kwargs)
  File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 135, in closure
    step_output = self._step_fn()
  File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 419, in _training_step
    training_step_output = self.trainer._call_strategy_hook("training_step", *kwargs.values())
  File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1494, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 378, in training_step
    return self.model.training_step(*args, **kwargs)
  File "/home/wlutz/PycharmProjects/hiv-image-analysis/main.py", line 316, in training_step
    loss = self.loss(z1, z2)
  File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/wlutz/PycharmProjects/hiv-image-analysis/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/wlutz/PycharmProjects/hiv-image-analysis/main.py", line 243, in forward
    denominator = device_as(self.mask, similarity_matrix) * torch.exp(similarity_matrix / self.temperature)
RuntimeError: The size of tensor a (64) must match the size of tensor b (48) at non-singleton dimension 1

Process finished with exit code 1

Here is some code (error happens at last line):

train_config = Hparams()

reproducibility(train_config)

model = SimCLR_pl(train_config, model=resnet18(pretrained=False), feat_dim=512)

transform = Augment(train_config.img_size)
data_loader = get_stl_dataloader(train_config.batch_size, transform)

accumulator = GradientAccumulationScheduler(scheduling={0: train_config.gradient_accumulation_steps})
checkpoint_callback = ModelCheckpoint(filename=filename, dirpath=save_model_path, every_n_epochs=2,
                                      save_last=True, save_top_k=2, monitor='Contrastive loss_epoch', mode='min')

trainer = Trainer(callbacks=[accumulator, checkpoint_callback],
                  gpus=available_gpus,
                  max_epochs=train_config.epochs)

trainer.fit(model, data_loader)

and here are my classes:

class Hparams:
    def __init__(self):
        self.epochs = 10  # number of training epochs
        self.seed = 33333  # randomness seed
        self.cuda = True  # use nvidia gpu
        self.img_size = 224  # image shape
        self.save = "./saved_models/"  # save checkpoint
        self.load = False  # load pretrained checkpoint
        self.gradient_accumulation_steps = 5  # gradient accumulation steps
        self.batch_size = 70
        self.lr = 3e-4  # for ADAm only
        self.weight_decay = 1e-6
        self.embedding_size = 128  # papers value is 128
        self.temperature = 0.5  # 0.1 or 0.5
        self.checkpoint_path = '/media/wlutz/TOSHIBA EXT/Image Analysis/VIH PROJECT/models'  # replace checkpoint path here


class SimCLR_pl(pl.LightningModule):
    def __init__(self, config, model=None, feat_dim=512):
        super().__init__()
        self.config = config

        self.model = AddProjection(config, model=model, mlp_dim=feat_dim)

        self.loss = ContrastiveLoss(config.batch_size, temperature=self.config.temperature)

    def forward(self, X):
        return self.model(X)

    def training_step(self, batch, batch_idx):
        (x1, x2) = batch
        z1 = self.model(x1)
        z2 = self.model(x2)
        loss = self.loss(z1, z2)
        self.log('Contrastive loss', loss, on_step=True, on_epoch=True, prog_bar=True, logger=True)
        return loss

    def configure_optimizers(self):
        max_epochs = int(self.config.epochs)
        param_groups = define_param_groups(self.model, self.config.weight_decay, 'adam')
        lr = self.config.lr
        optimizer = Adam(param_groups, lr=lr, weight_decay=self.config.weight_decay)

        print(f'Optimizer Adam, '
              f'Learning Rate {lr}, '
              f'Effective batch size {self.config.batch_size * self.config.gradient_accumulation_steps}')

        scheduler_warmup = LinearWarmupCosineAnnealingLR(optimizer, warmup_epochs=10, max_epochs=max_epochs,
                                                         warmup_start_lr=0.0)

        return [optimizer], [scheduler_warmup]


class AddProjection(nn.Module):
    def __init__(self, config, model=None, mlp_dim=512):
        super(AddProjection, self).__init__()
        embedding_size = config.embedding_size
        self.backbone = default(model, models.resnet18(pretrained=False, num_classes=config.embedding_size))
        mlp_dim = default(mlp_dim, self.backbone.fc.in_features)
        print('Dim MLP input:', mlp_dim)
        self.backbone.fc = nn.Identity()

        # add mlp projection head
        self.projection = nn.Sequential(
            nn.Linear(in_features=mlp_dim, out_features=mlp_dim),
            nn.BatchNorm1d(mlp_dim),
            nn.ReLU(),
            nn.Linear(in_features=mlp_dim, out_features=embedding_size),
            nn.BatchNorm1d(embedding_size),
        )

    def forward(self, x, return_embedding=False):
        embedding = self.backbone(x)
        if return_embedding:
            return embedding
        return self.projection(embedding)

class ContrastiveLoss(nn.Module):
    """
    Vanilla Contrastive loss, also called InfoNceLoss as in SimCLR paper
    """

    def __init__(self, batch_size, temperature=0.5):
        super().__init__()
        self.batch_size = batch_size
        self.temperature = temperature
        self.mask = (~torch.eye(batch_size * 2, batch_size * 2, dtype=bool)).float()

    def calc_similarity_batch(self, a, b):
        representations = torch.cat([a, b], dim=0)
        similarity_matrix = F.cosine_similarity(representations.unsqueeze(1), representations.unsqueeze(0), dim=2)
        return similarity_matrix

    def forward(self, proj_1, proj_2):
        """
        proj_1 and proj_2 are batched embeddings [batch, embedding_dim]
        where corresponding indices are pairs
        z_i, z_j in the SimCLR paper
        """
        batch_size = proj_1.shape[0]
        z_i = F.normalize(proj_1, p=2, dim=1)
        z_j = F.normalize(proj_2, p=2, dim=1)
        similarity_matrix = self.calc_similarity_batch(z_i, z_j)

        sim_ij = torch.diag(similarity_matrix, batch_size)
        sim_ji = torch.diag(similarity_matrix, -batch_size)

        positives = torch.cat([sim_ij, sim_ji], dim=0)

        nominator = torch.exp(positives / self.temperature)
        # print(" sim matrix ", similarity_matrix.shape)
        # print(" device ", device_as(self.mask, similarity_matrix).shape, " torch exp ", torch.exp(similarity_matrix / self.temperature).shape)
        denominator = device_as(self.mask, similarity_matrix) * torch.exp(similarity_matrix / self.temperature)

        all_losses = -torch.log(nominator / torch.sum(denominator, dim=1))
        loss = torch.sum(all_losses) / (2 * self.batch_size)
        return loss

class ImageDataResourceDataset(VisionDataset):
    train_list = ['train_X_v1.bin', ]
    test_list = ['test_X_v1.bin', ]

    def __init__(self, root: str, transform: Optional[Callable] = None, ):
        super().__init__(root=root, transform=transform)
        self.data = self.__loadfile(self.train_list[0])

    def __len__(self) -> int:
        return self.data.shape[0]

    def __getitem__(self, idx):
        img = self.data[idx]
        img = np.transpose(img, (1, 2, 0))
        img = Image.fromarray(img)
        img = self.transform(img)
        return img

    def __loadfile(self, data_file: str) -> np.ndarray:
        path_to_data = os.path.join(os.getcwd(), 'datasets', data_file)
        everything = np.fromfile(path_to_data, dtype=np.uint8)
        images = np.reshape(everything, (-1, 3, 224, 224))
        images = np.transpose(images, (0, 1, 3, 2))
        return images

For records, my dataset has 7000 RGB images of size 224x224.

How come my last 'fractional' batch is not supported ? Many thanks for your help.


Solution

  • I found the original source code on the GitHub repository at: https://github.com/The-AI-Summer/simclr/blob/main/AI_Summer_SimCLR_Resnet18_STL10.ipynb

    Based on the error message you provided:

    File "/home/wlutz/PycharmProjects/hiv-image-analysis/main.py", line 243, in forward
        denominator = device_as(self.mask, similarity_matrix) * torch.exp(similarity_matrix / self.temperature)
    

    The problem seems to be originating from this line of code:

    denominator = device_as(self.mask, similarity_matrix) * torch.exp(similarity_matrix / self.temperature)
    

    within the class ContrastiveLoss(nn.Module).

    To investigate the issue with this class, let's first examine its initialization variables:

    def __init__(self, batch_size, temperature=0.5):
        super().__init__()
        self.batch_size = batch_size
        self.temperature = temperature
        self.mask = (~torch.eye(batch_size * 2, batch_size * 2, dtype=bool)).float()
    

    The error appears to be related to the batch_size. We need to identify which part of the code calls ContrastiveLoss().

    Now, let's locate where ContrastiveLoss() is used within the class SimCLR_pl(pl.LightningModule):

    self.loss = ContrastiveLoss(config.batch_size, temperature=self.config.temperature)
    

    The problem lies in config.batch_size. The definition of ContrastiveLoss() relies on config.batch_size.

    You are using 7000 data points with a batch size of 32. Therefore, the batch size for the final iteration is 7000 % 32 = 24.

    Because the __init__ function utilizes:

    self.mask = (~torch.eye(batch_size * 2, batch_size * 2, dtype=bool)).float()
    

    The function expects the size to be config.batch_size * 2 = 64. However, your size is now 24 * 2 = 48. This discrepancy is the cause of the error message:

    RuntimeError: The size of tensor a (64) must match the size of tensor b (48) at non-singleton dimension 1
    

    To resolve this issue, you should make sure that the size of batch_size aligns correctly with the expectations of the ContrastiveLoss() class.