[SOLVED] How to skip tokenization and translation of custom glossary in huggingface NMT models?

How to skip tokenization and translation of custom glossary in huggingface NMT models?

I am using mBART50 and opus-MT-en-de for bilingual translations from huggingface. We have a custom dictionary of organization-specific glossary containing ~10,000 English terms (ngrams with n=1-5) and their specific German translations. I'd like the model to skip attempting to translate an english substring if the substring is detected in the dictionary.

That is, if my dictionary has a key called "custom string" with corresponding value "desired string", then if the model detects "custom string" substring within "longer sentence containing custom string etc." then instead of translating it to "Längerer Satz mit benutzerdefinierter Zeichenfolge usw", it should skip translating "custom string", and instead impute the corresponding value - "desired string" from dictionary and prevent the translator from changing the imputed value.

Here's the code I am using for translating:

model = MBartForConditionalGeneration.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, return_tensors="pt", padding=True, truncation=True)
src_texts = [tokenizer.convert_ids_to_tokens(tokenizer.encode(t)) for t in src_texts]
target_prefix = [[tokenizer.lang_code_to_token["de"]] for _ in range(len(src_texts))]
results = model.translate_batch(src_texts, target_prefix=target_prefix)
targets = [r.hypotheses[0][1:] for r in results]
translated = [tokenizer.decode(tokenizer.convert_tokens_to_ids(target)) for target in targets]

I am thinking that perhaps I could do a regex based lookup of n-grams in each input sentence, impute the matching strings that are present as keys in my dictionary, wrap them with some special token like 'UKN' (unknown) to prevent the model from changing the imputed value. Does that sound like a reasonable thing to do or is there a better approach (except fine-tuning* the model)? If so, how can I accomplish this?

*For the first phase, I can not invest resources in finetuning these models with custom parallel corpus. That's why I am looking for a simple key-value replacement.

Solution

Constraining beam search (or sampling from a generative model) is difficult because even when you know what string you want to have in the target sentence, you do not know what position it should be. Depending on the language, it may also happen that several inflected forms of the term are possible, so you want to allow all of them in the output.

Huggingface Transformers have some tools that allow enforcing particular phrases beam search and chaining them into disjunctive conditions, which should be all you need. However, it is limited to fixed sequences of tokens, and the decoding might be pretty slow. Especially the PhrasalConstraint class should be useful in this case.