regexperltabsucs

Trouble in Perl using split() function on tab chars when string contains non-Latin characters


I'm working on modifying a Perl script that reads in a series of UCS-2LE encoded files with strings in a tab-delimited format, but I am having trouble splitting the strings on the tab character when the string contains characters outside of the extended Latin character set.

Here is a sample line that I'm reading in from these files (tab-delimited):

adını   transcript  asr turkish

When I had my script write these lines to the output file to try and debug this issue, this is what it's writing:

ad1Ů1ĉtranscript    asr turkish

It appears that it's not recognizing the tab character after the Turkish character. This only happens when the word ends with a non-Latin character (and so is adjacent to the tab).

Here is a part of the code block where the writing to the output file happens and string-splitting happens:

for my $infile (@ARGV){  
    if (!open (INFILE, "<$infile")){
        die "Couldn't open $infile.\n";
    }    

binmode (OUTFILE, ":utf8");

while (<INFILE>) {
    chomp;
    $tTot++;

    if ($lineNo == 1) {                
        $_ = decode('UCS-2LE', $_);      
    }
    else {
        $_ = decode('UCS-2', $_);
    }    

    $_ =~ s/[\r\n]+//g;    
    my @foo = split('\t');

    my $orth = $foo[0];
    my $tscrpt = $foo[1];
    my $langCode = $foo[3];

    if (exists $codeHash{$langCode}) {
      unless ($tscrpt eq '') {
        check($orth, $tscrpt, $langCode);
      }
    }
    else {
        print OUTFILE "Unknown language code $langCode at line $lineNo.\n";
        print OUTFILE $_; # printing the string that's not being split correctly
        print OUTFILE "\n";
        $tBad++;
    }
  }

The purpose of this script is to check that, for each line in the input file, the language code is valid, and, based on that code, check whether the transcription for each word is "legal" according to our transcription system.

Here's what I've tried so far:

  1. Changing the encoding of the input strings as they're read in to UTF-8, UTF-16 or UTF-16LE
  2. Changing the split() character to '\w', /[[:blank:]]/, \p{Blank}, \x{09}, and \N{U+0009}.
  3. Reading Perl Unicode & perlrebackslash documentation and any other remotely relevant posts I've been able to find on various sites

Does anyone have any suggestions as to other things I might try? Thanks in advance!

I should also mention that I have no control over the input file encoding nor the output file encoding; I have to read in UCS-2LE and output UTF-8.


Solution

  • Thanks to everyone's comments and some further research, I figured out how to solve the problem and it was slightly different than I thought; it turned out to be a combination of a split() issue and an encoding issue. I had to both add the encoding in an explicit open statement instead of using the implicit open in the for loop, and skip the first two bytes at the beginning of the file.

    Here's what the corrected, working code looks like for the section I posted in my question:

    for my $infile (@ARGV){
        my $outfile = $infile . '.out';
    
        # SOLUTION part 1: added explicit open statement
        open (INFILE, "<:raw:encoding(UCS-2le):crlf", $infile) or die "Error opening $infile: $!";
    
        # SOLUTION part 2: had to skip the first two bytes of the file 
        seek INFILE, 2, 0;
    
        if (!open (OUTFILE, ">$outfile")) {
            die "Couldn't write to $outfile.\n";
        }
    
        binmode (OUTFILE, ":utf8");
        print OUTFILE "Line#\tOriginal_Entry\tLangCode\tOffending_Char(s)\n";
    
        $tBad = 0;
        $tTot = 0;
        $lineNo = 1;
    
    while (<INFILE>) {
        chomp;
        $tTot++;
    
        # SOLUTION part 3: deleted the "if" block I had here before that was handling encoding
    
        # Rest of code in the original block is the same    
    }
    

    My code now properly recognizes tab characters adjacent to characters not part of the extended Latin set, and splits on tabs as it should.

    NOTE: Another solution would have been to enclose the foreign words in double quotes, but, in our case, we couldn't guarantee that our input files would be formatted that way.

    Thanks to everyone who commented and helped me out!