When I run Dutch sentiment analysis RobBERTje, it outputs just positive/negative labels, netural label is missing in the data.
https://huggingface.co/DTAI-KULeuven/robbert-v2-dutch-sentiment
There are obvious neutral sentences/words e.g. 'Fhdf' (nonsense) and 'Als gisteren inclusief blauw' (neutral), but they both evaluate to positive or negative.
Is there a way to get neutral labels for such examples in RobBERTje?
from transformers import RobertaTokenizer, RobertaForSequenceClassification
from transformers import pipeline
import torch
model_name = "DTAI-KULeuven/robbert-v2-dutch-sentiment"
model = RobertaForSequenceClassification.from_pretrained(model_name)
tokenizer = RobertaTokenizer.from_pretrained(model_name)
classifier = pipeline('sentiment-analysis', model=model, tokenizer = tokenizer)
result1 = classifier('Fhdf')
result2 = classifier('Als gisteren inclusief blauw')
print(result1)
print(result2)
Output:
[{'label': 'Positive', 'score': 0.7520257234573364}]
[{'label': 'Negative', 'score': 0.7538396120071411}]
This model was trained only on negative
and positive
labels. Therefore, it will try to categorize every input as positive or negative, even if it is nonsensical or neutral.
what you can do is to:
1- Find other models that was trained to include neutral
label.
2- Fine-tune this model on a dataset that includes neutral
label.
3- Empirically define a threshold based on the confidence outputs and interpret it as neutral
.
The first 2 choices are extensive in effort. I would suggest you go with the third option for a quick workaround. Try feeding the model with a few neutral input and observe the range of confidence score in the output. then use that threshold to classify as neutral
.
Here's a sample:
def classify_with_neutral(text, threshold=0.5):
result = classifier(text)[0] # Get the classification result
if result['score'] < threshold:
result['label'] = 'Neutral' # Override label to 'Neutral'
return result