bert-language-model softmax cross-entropy

BERT Transformer model gives an error for multiclass classification

I am trying to train a sentiment analysis model with 5 classes (1-Very Negative, 2-Negative, 3-Neutral, 4-Positive, 5-Very Positive) with the BERT model.

from transformers import BertTokenizer, TFBertForSequenceClassification
from transformers import InputExample, InputFeatures
        
model = TFBertForSequenceClassification.from_pretrained("bert-base-cased")
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
        
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0), 
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), 
              metrics=[tf.keras.metrics.SparseCategoricalAccuracy('accuracy')])
    
model.fit(train_data, epochs=2, validation_data=validation_data)

But I get the following error (Just the last part of the error message)

Node: 'sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits'
Received a label value of 5 which is outside the valid range of [0, 2).  Label values: 3 4 5 2 2 4 4 3 4 5 5 4 5 5 4 4 4 3 4 4 5 5 5 4 4 5 3 5 4 4 3 5
         [[{{node sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits}}]] [Op:__inference_train_function_31614]

Can somebody tell me what I am doing wrong here?

Solution

The TFBertForSequenceClassification object needs to create a so-called classification head. The classification head is a cool name for a single NN layer that projects the [CLS] token representation into a vector with one item for each possible target class.

When you initialize the model by calling from_pretrained, you can specify num_labels, which is a number of target labels (see an example in Transformers documentation). If you do not specify it, the number of target classes will be inferred from the first training batch by taking the maximum class ID in the batch. If you are not lucky and the first batch only contains lower label IDs, it initializes a smaller classification head and fails when a batch with higher IDs comes.

Note also, that the class numbers start from zero. If you use labels 1-5, the model will have an additional 0th class that will not be used. If you want to keep the numbers 1-5, your num_labels will be 6.