deep-learning nlp huggingface-transformers bert-language-model pre-trained-model

Continual pre-training vs. Fine-tuning a language model with MLM

I have some custom data I want to use to further pre-train the BERT model. I’ve tried the two following approaches so far:

Starting with a pre-trained BERT checkpoint and continuing the pre-training with Masked Language Modeling (MLM) + Next Sentence Prediction (NSP) heads (e.g. using BertForPreTraining model)
Starting with a pre-trained BERT model with the MLM objective (e.g. using the BertForMaskedLM model assuming we don’t need NSP for the pretraining part.)

But I’m still confused that if using either BertForPreTraining or BertForMaskedLM actually does the continual pre-training on BERT or these are just two models for fine-tuning that use MLM+NSP and MLM for fine-tuning BERT, respectively. Is there even any difference between fine-tuning BERT with MLM+NSP or continually pre-train it using these two heads or this is something we need to test?

I've reviewed similar questions such as this one but still, I want to make sure that whether technically there's a difference between continual pre-training a model from an initial checkpoint and fine-tuning it using the same objective/head.

Solution

The answer is a mere difference in the terminology used. When the model is trained on a large generic corpus, it is called 'pre-training'. When it is adapted to a particular task or dataset it is called as 'fine-tuning'.

Technically speaking, in either cases ('pre-training' or 'fine-tuning'), there are updates to the model weights.

For example, usually, you can just take the pre-trained model and then fine-tune it for a specific task (such as classification, question-answering, etc.). However, if you find that the target dataset is from a specific domain, and you have a few unlabled data that might help the model to adapt to the particular domain, then you can do a MLM or MLM+NSP 'fine-tuning' (unsupervised learning) (some researchers do call this as 'pre-training' especially when a huge corpus is used to train the model), followed by using the target corpus with target task fine-tuning.