I need to read text file which contains strings in arbitrary MBCS encodings. Format of file (simplfied) is like this:
CODEPAGE "STRING"
CODEPAGE STRING
...
where CODEPAGE can be any MBCS codepage: UTF-8, cp1251 (Cyrillic), cp932 (Japanese), etc.
I can't decode the whole file in one call to MultiByteToWideChar. I need to extract string between quotes or until space or carriage return and call MultiByteToWideChar on extracted string.
But in MBCS (multi-byte coding schemes) one character can be represented with more than one byte. If I want to find latin 'A' in multi-byte encoded file, I can't just search for code 65 because 65 can be trailing byte in some encoding sequence.
So I'm not sure if I'm allowed to search for '"' or space or CR in MBCS string. I browsed several codepages (for exapmple Chinese 936 codepage: https://ssl.icu-project.org/icu-bin/convexp?conv=windows-936-2000&s=ALL) and as far as I see all trailing bytes starts from 0x40 so it's safe to scan file for punctuation characters. But is there some guarantee for that for any codepage?
Analyse which octets can occur in encoded octet sequences, discarding the leading one. Result is 0x40..0x7E, 0x80..0xFE.
#!/usr/bin/env perl
use Encode qw(encode);
my @encodings = qw(
cp1006 cp1026 cp1047 cp1250 cp1251 cp1252 cp1253 cp1254 cp1255 cp1256
cp1257 cp1258 cp37 cp424 cp437 cp500 cp737 cp775 cp850 cp852 cp855 cp856
cp857 cp858 cp860 cp861 cp862 cp863 cp864 cp865 cp866 cp869 cp874 cp875
cp932 cp936 cp949 cp950
);
my %continuation_octets;
for my $e (@encodings) {
for my $c (0..0x10_ffff) {
my $encoded = encode $e, chr($c), sub { -1 };
if ($encoded ne -1 && length($encoded) > 1) {
my @octets = split //, $encoded;
shift @octets;
$continuation_octets{$_}++ for @octets;
}
}
}