pytorchkaggletransformer-modelsimpletransformers

SimpleTransformers "max_seq_length" argument results in CUDA out of memory error in Kaggle and Google Colab


When fine-tuning the sloBERTa Transformer model, based on CamemBERT, for a multiclass classification task with SimpleTransformers, I want to use the model argument "max_seq_length": 512, as previous work states that it gives better results than 128, but the inclusion of this argument triggers the error below. The error is the same in Kaggle and Google Colab environment, and terminating the execution and reruning it does not help. The error is triggered not matter how small the number of training epochs is, and the dataset contains only 600 instances (with text as strings, and labels as integers). I've tried lowering the max_seq_length to 509, 500 and 128, but the error persists.

The setup without this argument works normally and allows training with 90 epochs, so I otherwise have enough memory.

from simpletransformers.classification import ClassificationModel

# define hyperparameter
model_args ={"overwrite_output_dir": True,
             "num_train_epochs": 90,
             "labels_list": LABELS_NUM,
             "learning_rate": 1e-5,
             "train_batch_size": 32,
             "no_cache": True,
             "no_save": True,
             #"max_seq_length": 512,
             "save_steps": -1,
             }

model = ClassificationModel(
    "camembert", "EMBEDDIA/sloberta",
    use_cuda = device,
    num_labels = NUM_LABELS,
    args = model_args)

model.train_model(train_df)

This is the error:

RuntimeError                              Traceback (most recent call last)
/tmp/ipykernel_34/2529369927.py in <module>
    19     args = model_args)
    20 
---> 21 model.train_model(train_df)

/opt/conda/lib/python3.7/site-packages/simpletransformers/classification/classification_model.py in train_model(self, train_df, multi_label, output_dir, show_running_loss, args, eval_df, verbose, **kwargs)
   610             eval_df=eval_df,
   611             verbose=verbose,
--> 612             **kwargs,
   613         )
   614 

/opt/conda/lib/python3.7/site-packages/simpletransformers/classification/classification_model.py in train(self, train_dataloader, output_dir, multi_label, show_running_loss, eval_df, test_df, verbose, **kwargs)
   883                             loss_fct=self.loss_fct,
   884                             num_labels=self.num_labels,
--> 885                             args=self.args,
   886                         )
   887                 else:

/opt/conda/lib/python3.7/site-packages/simpletransformers/classification/classification_model.py in _calculate_loss(self, model, inputs, loss_fct, num_labels, args)
  2256 
  2257     def _calculate_loss(self, model, inputs, loss_fct, num_labels, args):
-> 2258         outputs = model(**inputs)
  2259         # model outputs are always tuple in pytorch-transformers (see doc)
  2260         loss = outputs[0]

/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   720             result = self._slow_forward(*input, **kwargs)
   721         else:
--> 722             result = self.forward(*input, **kwargs)
   723         for hook in itertools.chain(
   724                 _global_forward_hooks.values(),

/opt/conda/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict)
  1210             output_attentions=output_attentions,
  1211             output_hidden_states=output_hidden_states,
-> 1212             return_dict=return_dict,
  1213         )
  1214         sequence_output = outputs[0]

/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   720             result = self._slow_forward(*input, **kwargs)
   721         else:
--> 722             result = self.forward(*input, **kwargs)
   723         for hook in itertools.chain(
   724                 _global_forward_hooks.values(),

/opt/conda/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
   859             output_attentions=output_attentions,
   860             output_hidden_states=output_hidden_states,
--> 861             return_dict=return_dict,
   862         )
   863         sequence_output = encoder_outputs[0]

/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   720             result = self._slow_forward(*input, **kwargs)
   721         else:
--> 722             result = self.forward(*input, **kwargs)
   723         for hook in itertools.chain(
   724                 _global_forward_hooks.values(),

/opt/conda/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
   531                     encoder_attention_mask,
   532                     past_key_value,
--> 533                     output_attentions,
   534                 )
   535 

/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   720             result = self._slow_forward(*input, **kwargs)
   721         else:
--> 722             result = self.forward(*input, **kwargs)
   723         for hook in itertools.chain(
   724                 _global_forward_hooks.values(),

/opt/conda/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
   415             head_mask,
   416             output_attentions=output_attentions,
--> 417             past_key_value=self_attn_past_key_value,
   418         )
   419         attention_output = self_attention_outputs[0]

/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   720             result = self._slow_forward(*input, **kwargs)
   721         else:
--> 722             result = self.forward(*input, **kwargs)
   723         for hook in itertools.chain(
   724                 _global_forward_hooks.values(),

/opt/conda/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
   344             encoder_attention_mask,
   345             past_key_value,
--> 346             output_attentions,
   347         )
   348         attention_output = self.output(self_outputs[0], hidden_states)

/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   720             result = self._slow_forward(*input, **kwargs)
   721         else:
--> 722             result = self.forward(*input, **kwargs)
   723         for hook in itertools.chain(
   724                 _global_forward_hooks.values(),

/opt/conda/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
   273             attention_probs = attention_probs * head_mask
   274 
--> 275         context_layer = torch.matmul(attention_probs, value_layer)
   276 
   277         context_layer = context_layer.permute(0, 2, 1, 3).contiguous()

RuntimeError: CUDA out of memory. Tried to allocate 192.00 MiB (GPU 0; 15.90 GiB total capacity; 15.04 GiB already allocated; 15.75 MiB free; 15.12 GiB reserved in total by PyTorch)

Additional code (if it helps - I've tried everything regarding the pytorch I found on the web - the full code can be accessed at https://www.kaggle.com/tajakuz/0-sloberta-example-max-seq-length-error):

!conda install --yes pytorch>=1.6 cudatoolkit=11.0 -c pytorch

# install simpletransformers
!pip install -q transformers
!pip install --upgrade transformers
!pip install -q simpletransformers

# check installed version
!pip freeze | grep simpletransformers

!pip uninstall -q torch -y
!pip install -q torch==1.6.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html

# pytorch libraries
import torch # the main pytorch library
import torch.nn as nn # the sub-library containing Softmax, Module and other useful functions
import torch.optim as optim # the sub-library containing the common optimizers (SGD, Adam, etc.)
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler

from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'

#importing other necessary packages and ClassificationModel for bert
from tqdm import tqdm
import warnings
warnings.simplefilter('ignore')

from scipy.special import softmax

Thank you so much for your help, it is really appreciated!


Solution

  • This happened because max_seq_length defines the number of input neurons for the model thus increasing the number of trainable parameters which will require it to allocate more memory which might exceed your memory limits on those platforms.

    Most of the time, max_seq_length is up the dataset, and sometimes adding too much could be wasteful in terms of training time and model size.

    What you can do is to find the max number of words per sample in your training dataset and use that as your max_seq_length.