hindimachine-translationmoses

Statistical Machine Translation from Hindi to English using MOSES


I need to create a Hindi to English translation system using MOSES. I have got a parallel corpora containing about 10000 Hindi sentences and corresponding English translations. I followed the method described in the Baseline system creation page. But, just in the first stage, when I wanted to tokenise my Hindi corpus and tried to execute

~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l hi < ~/corpus/training/hi-en.hi> ~/corpus/hi-en.tok.hi

, the tokeniser gave me the following output:

Tokenizer Version 1.1
Language: hi
Number of threads: 1
WARNING: No known abbreviations for language 'hi', attempting fall-back to English version...

I even tried with 'hin' but it still didn't recognise the language. Can anyone tell the correct way to make the translation system.


Solution

  • Moses does not support Hindi for tokenization, the tokenizer.perl uses the nonbreaking_prefix.* files (from https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl#L516)

    The languages available with nonbreaking prefixes from Moses are:

    from https://github.com/moses-smt/mosesdecoder/tree/master/scripts/share/nonbreaking_prefixes


    However all hope is not lost, you can surely tokenize your text with other tokenizers before training machine translation model with Moses, try Googling "Hindi Tokenziers", there are tonnes of them around.