nlptokenizemachine-translation

How to concatenate a split word using NLP caused by tokenizers after machine translation?


Russian translation produces the following result, is there a NLP function which we can use to concatenate as "Europe's" in the following string?

"Nitzchia Protector Todibo can go to one of Europe ' s top clubs"


Solution

  • Try detokenizers but because there are rules to process tokens that are expected to change x 's -> x's but not x ' s -> x's, you might have to iteratively apply the detokenizer, e.g. using sacremoses

    >>> from sacremoses import MosesTokenizer, MosesDetokenizer
    >>> md = MosesDetokenizer(lang='en')
    
    >>> md.detokenize("Nitzchia Protector Todibo can go to one of Europe ' s top clubs".split())
    "Nitzchia Protector Todibo can go to one of Europe 's top clubs"
    
    >>> md.detokenize(md.detokenize("Nitzchia Protector Todibo can go to one of Europe ' s top clubs".split()).split())
    "Nitzchia Protector Todibo can go to one of Europe's top clubs"