python nlp huggingface-transformers huggingface-tokenizers gpt-2

Hugging face - Efficient tokenization of unknown token in GPT2

I am trying to train a dialog system using GPT2. For tokenization, I am using the following configuration for adding the special tokens.

from transformers import (
     AdamW,
     AutoConfig,
     AutoTokenizer,
     PreTrainedModel,
     PreTrainedTokenizer,
     get_linear_schedule_with_warmup,
)

SPECIAL_TOKENS = {
    "bos_token": "<|endoftext|>",
    "eos_token": "<|endoftext|>",
    "pad_token": "[PAD]",
    "additional_special_tokens": ["[SYS]", "[USR]", "[KG]", "[SUB]", "[PRED]", "[OBJ]", "[TRIPLE]", "[SEP]", "[Q]","[DOM]"]
}
tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
tokenizer.add_special_tokens(SPECIAL_TOKENS)

Next, when I am trying to tokenize a sequence(dialog's utterance) and later convert into ids, some of the most important tokens in my sequence are getting mapped as unknown tokens, since the ids of these important tokens becomes the same as bos and eos as they all map to <|endoftext|> as in the GPT2's source code.

Here is a working example -

tokenized_sequence = ['[PRED]', 'name', '[SUB]', 'frankie_and_bennys', '[PRED]', 'address', '[SUB]', 'cambridge_leisure_park_clifton_way_cherry_hinton', '[PRED]', 'area', '[SUB]', 'south', '[PRED]', 'food', '[SUB]', 'italian', '[PRED]', 'phone', '[SUB]', '01223_412430', '[PRED]', 'pricerange', '[SUB]', 'expensive', '[PRED]', 'postcode', '[SUB]', 'cb17dy']
important_tokens = ['frankie_and_bennys','cambridge_leisure_park_clifton_way_cherry_hinton','italian','postcode', 'cb17dy']
tokens_to_ids = [50262, 3672, 50261, 50256, 50262, 21975, 50261, 50256, 50262, 20337, 50261, 35782, 50262, 19425, 50261, 50256, 50262, 4862, 50261, 50256, 50262, 50256, 50261, 22031, 50262, 50256, 50261, 50256]
ids_to_tokens = [PRED]name[SUB]<|endoftext|>[PRED]address[SUB]<|endoftext|>[PRED]area[SUB]south[PRED]food[SUB]<|endoftext|>[PRED]phone[SUB]<|endoftext|>[PRED]<|endoftext|>[SUB]expensive[PRED]<|endoftext|>[SUB]<|endoftext|>

As you can see the important_tokens are being mapped to the id 50256 (that is to |endoftext|), the model fails to see and learn these important tokens and hence generate very poor and often hallucinated responses.

What could be a quick and efficient fix for this issue?

Solution

For the important_tokens which contain several actual words (like frankie_and_bennys), you can replace underscore with the space and feed them normally, Or add them as a special token. I prefer the first option because this way you can use pre-trained embedding for their subtokens. For the ones which aren't actual words (like cb17dy), you must add them as special tokens.

from transformers import GPT2TokenizerFast
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
your_string = '[PRED] name [SUB] frankie and bennys frankie_and_bennys [PRED]  cb17dy'
SPECIAL_TOKENS = {
    "bos_token": "<|endoftext|>",
    "eos_token": "<|endoftext|>",
    "pad_token": "[PAD]",
    "additional_special_tokens": ["[SYS]", "[USR]", "[KG]", "[SUB]", "[PRED]", "[OBJ]", "[TRIPLE]", "[SEP]", "[Q]","[DOM]", 'frankie_and_bennys', 'cb17dy']
}
tokenizer.add_special_tokens(SPECIAL_TOKENS)
print(tokenizer(your_string)['input_ids'])
print(tokenizer.convert_ids_to_tokens(tokenizer(your_string)['input_ids']))

the output

[50262, 1438, 220, 50261, 14346, 494, 290, 275, 1697, 893, 220, 50268, 220, 50262, 220, 220, 50269]
['[PRED]', 'Ġname', 'Ġ', '[SUB]', 'Ġfrank', 'ie', 'Ġand', 'Ġb', 'enn', 'ys', 'Ġ', 'frankie_and_bennys', 'Ġ', '[PRED]', 'Ġ', 'Ġ', 'cb17dy']