When fine-tuning the sloBERTa Transformer model, based on CamemBERT, for a multiclass classification task with SimpleTransformers, I want to use the model argument "max_seq_length": 512, as previous work states that it gives better results than 128, but the inclusion of this argument triggers the error below. The error is the same in Kaggle and Google Colab environment, and terminating the execution and reruning it does not help. The error is triggered not matter how small the number of training epochs is, and the dataset contains only 600 instances (with text as strings, and labels as integers). I've tried lowering the max_seq_length to 509, 500 and 128, but the error persists.
The setup without this argument works normally and allows training with 90 epochs, so I otherwise have enough memory.
from simpletransformers.classification import ClassificationModel
# define hyperparameter
model_args ={"overwrite_output_dir": True,
"num_train_epochs": 90,
"labels_list": LABELS_NUM,
"learning_rate": 1e-5,
"train_batch_size": 32,
"no_cache": True,
"no_save": True,
#"max_seq_length": 512,
"save_steps": -1,
}
model = ClassificationModel(
"camembert", "EMBEDDIA/sloberta",
use_cuda = device,
num_labels = NUM_LABELS,
args = model_args)
model.train_model(train_df)
This is the error:
RuntimeError Traceback (most recent call last)
/tmp/ipykernel_34/2529369927.py in <module>
19 args = model_args)
20
---> 21 model.train_model(train_df)
/opt/conda/lib/python3.7/site-packages/simpletransformers/classification/classification_model.py in train_model(self, train_df, multi_label, output_dir, show_running_loss, args, eval_df, verbose, **kwargs)
610 eval_df=eval_df,
611 verbose=verbose,
--> 612 **kwargs,
613 )
614
/opt/conda/lib/python3.7/site-packages/simpletransformers/classification/classification_model.py in train(self, train_dataloader, output_dir, multi_label, show_running_loss, eval_df, test_df, verbose, **kwargs)
883 loss_fct=self.loss_fct,
884 num_labels=self.num_labels,
--> 885 args=self.args,
886 )
887 else:
/opt/conda/lib/python3.7/site-packages/simpletransformers/classification/classification_model.py in _calculate_loss(self, model, inputs, loss_fct, num_labels, args)
2256
2257 def _calculate_loss(self, model, inputs, loss_fct, num_labels, args):
-> 2258 outputs = model(**inputs)
2259 # model outputs are always tuple in pytorch-transformers (see doc)
2260 loss = outputs[0]
/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
720 result = self._slow_forward(*input, **kwargs)
721 else:
--> 722 result = self.forward(*input, **kwargs)
723 for hook in itertools.chain(
724 _global_forward_hooks.values(),
/opt/conda/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict)
1210 output_attentions=output_attentions,
1211 output_hidden_states=output_hidden_states,
-> 1212 return_dict=return_dict,
1213 )
1214 sequence_output = outputs[0]
/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
720 result = self._slow_forward(*input, **kwargs)
721 else:
--> 722 result = self.forward(*input, **kwargs)
723 for hook in itertools.chain(
724 _global_forward_hooks.values(),
/opt/conda/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
859 output_attentions=output_attentions,
860 output_hidden_states=output_hidden_states,
--> 861 return_dict=return_dict,
862 )
863 sequence_output = encoder_outputs[0]
/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
720 result = self._slow_forward(*input, **kwargs)
721 else:
--> 722 result = self.forward(*input, **kwargs)
723 for hook in itertools.chain(
724 _global_forward_hooks.values(),
/opt/conda/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
531 encoder_attention_mask,
532 past_key_value,
--> 533 output_attentions,
534 )
535
/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
720 result = self._slow_forward(*input, **kwargs)
721 else:
--> 722 result = self.forward(*input, **kwargs)
723 for hook in itertools.chain(
724 _global_forward_hooks.values(),
/opt/conda/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
415 head_mask,
416 output_attentions=output_attentions,
--> 417 past_key_value=self_attn_past_key_value,
418 )
419 attention_output = self_attention_outputs[0]
/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
720 result = self._slow_forward(*input, **kwargs)
721 else:
--> 722 result = self.forward(*input, **kwargs)
723 for hook in itertools.chain(
724 _global_forward_hooks.values(),
/opt/conda/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
344 encoder_attention_mask,
345 past_key_value,
--> 346 output_attentions,
347 )
348 attention_output = self.output(self_outputs[0], hidden_states)
/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
720 result = self._slow_forward(*input, **kwargs)
721 else:
--> 722 result = self.forward(*input, **kwargs)
723 for hook in itertools.chain(
724 _global_forward_hooks.values(),
/opt/conda/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
273 attention_probs = attention_probs * head_mask
274
--> 275 context_layer = torch.matmul(attention_probs, value_layer)
276
277 context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
RuntimeError: CUDA out of memory. Tried to allocate 192.00 MiB (GPU 0; 15.90 GiB total capacity; 15.04 GiB already allocated; 15.75 MiB free; 15.12 GiB reserved in total by PyTorch)
Additional code (if it helps - I've tried everything regarding the pytorch I found on the web - the full code can be accessed at https://www.kaggle.com/tajakuz/0-sloberta-example-max-seq-length-error):
!conda install --yes pytorch>=1.6 cudatoolkit=11.0 -c pytorch
# install simpletransformers
!pip install -q transformers
!pip install --upgrade transformers
!pip install -q simpletransformers
# check installed version
!pip freeze | grep simpletransformers
!pip uninstall -q torch -y
!pip install -q torch==1.6.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html
# pytorch libraries
import torch # the main pytorch library
import torch.nn as nn # the sub-library containing Softmax, Module and other useful functions
import torch.optim as optim # the sub-library containing the common optimizers (SGD, Adam, etc.)
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'
#importing other necessary packages and ClassificationModel for bert
from tqdm import tqdm
import warnings
warnings.simplefilter('ignore')
from scipy.special import softmax
Thank you so much for your help, it is really appreciated!
This happened because max_seq_length
defines the number of input neurons for the model thus increasing the number of trainable parameters which will require it to allocate more memory which might exceed your memory limits on those platforms.
Most of the time, max_seq_length
is up the dataset, and sometimes adding too much could be wasteful in terms of training time and model size.
What you can do is to find the max number of words per sample in your training dataset and use that as your max_seq_length
.