I have noticed that if I tokenize a full text with many sentences, I sometimes get a different number of tokens than if I tokenise each sentence individually and add up the tokens. I have done some debugging and have this small reproducible example to show the issue
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('facebook/bart-large-cnn')
print(tokenizer.tokenize("Thames is a river"))
print(tokenizer.tokenize("We are in London. Thames is a river"))
I get the following output
['Th', 'ames', 'Ġis', 'Ġa', 'Ġriver']
['We', 'Ġare', 'Ġin', 'ĠLondon', '.', 'ĠThames', 'Ġis', 'Ġa', 'Ġriver']
I would like to understand why the word Thames has been split into two tokens when it’s at the start of sequence, whereas it’s a single word if it’s not at the start of sequence. I have noticed this behaviour is very frequent and, assuming it’s not a bug, I would like to understand why the BART tokeniser behaves like this.
According to https://github.com/huggingface/transformers/blob/main/src/transformers/models/bart/tokenization_bart.py:
This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will be encoded differently whether it is at the beginning of the sentence (without space) or not. You can get around that behavior by passing add_prefix_space=True
when instantiating this tokenizer or when you call it on some text, but since the model was not pretrained this way, it might yield a decrease in performance.
Trying
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('facebook/bart-large-cnn', add_prefix_space=True)
print(tokenizer.tokenize("Thames is a river"))
print(tokenizer.tokenize("We are in London. Thames is a river"))
yields the 'correct' result to me.