perlutf-8encodecp1251

The proper way of encoding detection in perl


I've got these two strings:

%EC%E0%EC%E0+%EC%FB%EB%E0+%F0%E0%EC%F3
%D0%BC%D0%B0%D0%BC%D0%B0%20%D0%BC%D1%8B%D0%BB%D0%B0%20%D1%80%D0%B0%D0%BC%D1%83

This is a url-encoded phrase in Russian in cp-1251 and utf-8 respectively. I want to see them in Russian in my utf-8 terminal using perl. Unfortunately, perl module Encode::Detect (after url-decoding) can't detect cp-1251 of the first example. Instead, it proposes this: "x-euc-tw".

The question is, what is the proper way of detecting the right encoding in this case (specifying locale parameters, using other modules...)?


Solution

  • Are UTF-8 and cp1251 the only two options? The odds of having cp1251 text that's also valid UTF-8 is extremely tiny. (It would be gibberish.) So you can do

    use Encode qw( decode );
    my $decoded = eval { decode('UTF-8', $encoded, Encode::FB_CROAK) }
        // decode('cp1251', $encoded);
    

    This will be far far more accurate that an encoding guesser.