I am trying to use a xlm-roberta model I have fine-tuned for token classification, but no matter what I do, I always get as an output all tokens stuck together, like:
[{'entity_group': 'LABEL_0',
'score': 0.4824247,
'word': 'Thedogandthecatwenttothehouse',
'start': 0,
'end': 325}]
What could I do to get the words properly separated as an output as it happens with other models, like Bert?
I have tried to conduct the training with add_prefix_space=True
but it does not seem to have any effect:
tokenizer = AutoTokenizer.from_pretrained('MMG/xlm-roberta-large-ner-spanish', add_prefix_space=True)
model = AutoModelForTokenClassification.from_pretrained("xlm-roberta-large-finetuned-conll03-english", use_cache=None, num_labels=NUM_LABELS, ignore_mismatched_sizes=True)
pipe = pipeline(task="token-classification", model=model.to("cpu"), binary_output=True, tokenizer=tokenizer, aggregation_strategy="average")
Thanks a lot in advance for your help.
The problem occurred because you've tried employing an average
aggregation strategy. Since tokenizers' unit of calculation is subword
, an aggregation strategy should be employed to reconstruct the original text. In this case, the strategy is determined as not appropriate. For more information check out here!
Another point for keeping in mind is aggregation between the model and tokenizer, each model extracted the problem space using one particular tokenizer and it would function well using the same one!
tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-large-finetuned-conll03-english', add_prefix_space=True)
model = AutoModelForTokenClassification.from_pretrained("xlm-roberta-large-finetuned-conll03-english", use_cache=None, num_labels=2, ignore_mismatched_sizes=True)
pipe = pipeline(task="token-classification", model=model.to("cpu"), binary_output=True, tokenizer=tokenizer)
pipe('This is a simple test!')
output:
[{'entity': 'LABEL_0',
'score': 0.61674744,
'index': 1,
'word': '▁This',
'start': 0,
'end': 4},
{'entity': 'LABEL_0',
'score': 0.64719814,
'index': 2,
'word': '▁is',
'start': 5,
'end': 7},
{'entity': 'LABEL_0',
'score': 0.6912893,
'index': 3,
'word': '▁a',
'start': 8,
'end': 9},
{'entity': 'LABEL_0',
'score': 0.58730906,
'index': 4,
'word': '▁simple',
'start': 10,
'end': 16},
{'entity': 'LABEL_0',
'score': 0.62718534,
'index': 5,
'word': '▁test',
'start': 17,
'end': 21},
{'entity': 'LABEL_0',
'score': 0.733932,
'index': 6,
'word': '!',
'start': 21,
'end': 22}]