I am using bert for a sequence classification task with 3 labels. To do this, I am using huggingface transformers with tensorflow, more specifically the TFBertForSequenceClassification class with the bert-base-german-cased model (yes, using german sentences).
I am by no means an expert in NLP, which is why I pretty much followed this approch here: https://towardsdatascience.com/fine-tuning-hugging-face-model-with-custom-dataset-82b8092f5333 (with some tweaks of course)
Everything seems to be working fine, but the output I receive from my model is what throws me off. Here's just some of the output along the way for context.
The main difference I have to the example from the article is the number of labels. I have 3 while the article only featured 2.
I use a LabelEncoder from sklearn.preprocessing to process my labels
label_encoder = LabelEncoder()
Y_integer_encoded = label_encoder.fit_transform(Y)
*Y here is a list of labels as strings, so something like this
['e_3', 'e_1', 'e_2',]
then turns into this:
array([0, 1, 2], dtype=int64)
I then use the BertTokenizer to process my text and create the input datasets (training and testing). These are the shapes of those:
<TensorSliceDataset shapes: ({input_ids: (99,), token_type_ids: (99,), attention_mask: (99,)}, ()), types: ({input_ids: tf.int32, token_type_ids: tf.int32, attention_mask: tf.int32}, tf.int32)>
I then train the model as per Huggingface docs.
The last epoch while training the model looks like this:
Epoch 3/3
108/108 [==============================] - 24s 223ms/step - loss: 25.8196 - accuracy: 0.7963 - val_loss: 24.5137 - val_accuracy: 0.7243
Then I run model.predict on an example sentence and get this output (yes I tokenized the sentence accordingly just like the other article does). The output looks like this:
array([ 3.1293588, -5.280143 , 2.4700692], dtype=float32)
And lastly that's the softmax function I apply in the end and it's output:
tf_prediction = tf.nn.softmax(tf_output, axis=0).numpy()[0]
output: 0.6590041
So here's my question: I don't quite understand that output. With an accuracy of ~70% (validation accuracy), my model should be okay in predicting the labels. Yet only the logits from the direct output don't mean much to me tbh and the output after the softmax function seems to be on a linear scale, as if it came from a sigmoid function. How do I interpret this and translate it to the label I am trying to predict?
And also: shouldn't I feed one hot encoded labels into my bert model for it to work? I always thought Bert needs that but it seems like it doesn't.
Your output means that probability of the first class is 65.9%.
You can feed your labels either as integers or as one-hot vectors. You have to use an appropriate loss function (categorical_crossentropy with one-hot or sparse_categorical_crossentropy with integers).