nlpbert-language-modelfrench

BERT Vocabulary : Why every word has '▁' before?


my question is related to camemBERT model (french version of BERT) and its Tokenizer :

Why every word of the vocabulary has a "▁" character before ? For example, it's not "sirop" but "▁sirop" (sirop => syrup).

from transformers import CamembertTokenizer
tokenizer = Camembert.Tokenizer.from_pretrained("camembert-base")
voc = tokenizer.get_vocab() #Vocabulary of the model

print("sirop" in voc) # Will display False
print("▁sirop" in voc) # Will display True

Thank you for answering :)


Solution

  • If I understand it correctly the CamembertTokenizer uses this special character from SentencePiece, see the source code.

    SentencePiece on the other hand uses Subword Tokens (splitting of words into smaller tokens), but to internally always keep track of what is a "real" split between words (where there was a whitespace) and what is a Subword splitting, they use this character before the start of each "real" token, follow-up subword tokens (but not punctuation) don't have this token, see the Explaination in the Github Repository. Basically the whitespace is always part of the tokenization, but to avoid problems it is internally escaped as "▁".

    They use this example: "Hello World." becomes [Hello] [▁Wor] [ld] [.], which can then be used by the model and later transformed back into the original string (detokenized = ''.join(pieces).replace('▁', ' ')) --> "Hello World." without ambiguity and without having to save the original string seperately.