huggingface-tokenizers

when is add_prefix_space option required and why?


What is the purpose of the add_prefix_space and how to know which models would require it?

model_checkpoint = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, add_prefix_space=True)

Searched HuggingFace but not clear what it is.

add_prefix_space (bool, optional, defaults to True) — Whether to add a space to the first word if there isn’t already one. This lets us treat hello exactly like say hello.

In GPT2 and Roberta tokenizers, the space before a word is part of a word, i.e. "Hello how are you puppetter" will be tokenized in ["Hello", "Ġhow", "Ġare", "Ġyou", "Ġpuppet", "ter"]. You can notice the spaces included in the words a Ġ here. Spaces are converted in a special character (the Ġ ) in the tokenizer prior to BPE splitting mostly to avoid digesting spaces since the standard BPE algorithm used spaces in its process (this can seem a bit hacky but was in the original GPT2 tokenizer implementation by OpenAI).


The code was taken from Fine-Tuning Large Language Models (LLMs).


Solution

  • The distilbert-base-uncased model has been trained to treat spaces as part of the token. As a result, the first word of the sentence is encoded differently if it is not preceded by a white space. To ensure the first word includes a space, we set add_prefix_space=True. You can check both the model and its paper here: distilbert/distilbert-base-uncased

    Unfortunately you have to read through the model's technical report or research around to see how they are trained. Happy learning! :)