perlmoses

Need to split Unicode string


I am using the moses toolkit for my translation system. I am using Assamese and English parallel corpus and trained them. But some proper nouns are not translated. This is because I have a very small corpus (parallel data set). So I want to use the transliteration process in my translation system.

I am using this command for my translation: echo 'কানাদা এখন বিশাল দেশ ।'| ~/mymoses/bin/moses -f ~/work/mert-work/moses.ini

This gave me the output "কানাদা is a vast country".

This is because the word "কানাদা" is not in my parallel corpus.

So I took some parallel list of words in Assamese and English, and break each word character-wise. Thus, each line of the two files would have single words with a space between each character (or each syllable). i have used these 2 files to train the system as normal translation task

Then I used the following command echo 'কানাদা এখন বিশাল দেশ ।'| ~/mymoses/bin/moses -f ~/work/mert-work/moses.ini | ./space.pl

This gave me the output "ক া ন া দ া is a vast country"

I had to break the word because i have trained the system character-wise..

Then i used the transliteration system that i have trained using the command:

echo 'কানাদা এখন বিশাল দেশ ।'| ~/mymoses/bin/moses -f ~/work/mert-work/moses.ini | ./space.pl | ~/mymoses/bin/moses -f ~/work1/train/model/moses.ini

This gave me the output "c a n a d a is a vast country"

The characters are transliterated..but the only problem is the spaces between the word.So i want to use a perl file that will join the word. My final command will be

echo 'কানাদা এখন বিশাল দেশ ।'| ~/mymoses/bin/moses -f ~/work/mert-work/moses.ini | ./space.pl | ~/mymoses/bin/moses -f ~/work1/train/model/moses.ini | ./join.pl

Help me with this "join.pl" file.


Solution

  • How about:

    use utf8;
    my $str = "ভাৰত is a famous country. দিল্লী is the capital of ভাৰত";
    $str =~ s/([\x{0980}-\x{09FF}])(?=[\x{0980}-\x{09FF}])/$1 /g;
    say $str;
    

    output:

    ভ া ৰ ত is a famous country. দ ি ল ্ ল ী is the capital of ভ া ৰ ত
    

    You can use it in your program, just change the while loop to:

    while(<>) {
        s/([\x{0980}-\x{09FF}])(?=[\x{0980}-\x{09FF}])/$1 /g;
        print $_;
    }
    

    But I think you whish to do:

    my %corresp = (
        'ভ' => 'Bh',
        'া' => 'a',
        'ৰ' => 'ra',
        'ত' => 't',
    );
    my $str = "ভাৰত is a famous country. দিল্লী is the capital of ভাৰত";
    $str =~ s/([\x{0980}-\x{09FF}])/exists($corresp{$1}) ? $corresp{$1} : $1/eg;
    say $str;
    

    Output:

    Bharat is a famous country. দিল্লী is the capital of Bharat
    

    NB: It's up to you to build the true corresponding hash. I don't know anything about Assamese characters.