perlutf-8scalarbytestream

In Perl, how to create a "mixed-encoding" string (or a raw sequence of bytes) in a scalar?


In a Perl script of mine, I have to write a mix of UTf-8 and raw bytes into files.

I have a big string in which everything is encoded as UTF-8. In that "source" string, UTF-8 characters are just like they should be (that is, UTF-8-valid byte sequences), while the "raw bytes" have been stored as if they were codepoints of the value held by the raw byte. So, in the source string, a "raw" byte of 0x50 would be stored as one 0x50 byte; whereas a "raw" byte of 0xff would be stored as a 0xc3 0xbf two-byte utf-8-valid sequence. When I write these "raw" bytes back, I need to put them back to single-byte form.

I have other data structures allowing me to know which parts of the string represent what kind of data. A list of fields, types, lengths, etc.

When writing in a plain file, I write each field in turn, either directly (if it's UTF-8) or by encoding its value to ISO-8859-1 if it's meant to be raw bytes. It works perfectly.

Now, in some cases, I need to write the value not directly to a file, but as a record of a BerkeleyDB (Btree, but that's mostly irrelevant) database. To do that, I need to write ALL the values that compose my record, in a single write operation. Which means that I need to have a scalar that holds a mix of UTF-8 and raw bytes.


Example:

Input Scalar (all hex values): 61 C3 8B 00 C3 BF

Expected Output Format: 2 UTF-8 characters, then 2 raw bytes.

Expected Output: 61 C3 8B 00 FF


At first, I created a string by concatenating the same values I was writing to my file from an empty string. And I tried writing that very string to a "standard" file without adding encoding. I got '?' characters instead of all my raw bytes over 0x7f (because, obviously, Perl decided to consider my string to be UTF-8).


Then, to try and tell Perl that it was already encoded, and to "please not try to be smart about it", I tried to encode the UTF-8 parts into "UTF-8", encode the binary parts into "ISO-8859-1", and concatenate everything. Then I wrote it. This time, the bytes looked perfect, but the parts which were already UTF-8 had been "double-encoded", that is, each byte of a multi-byte character had been seen as its codepoint...


I thought Perl wasn't supposed to re-encode "internal" UTF-8 into "encoded" UTF-8, if it was internally marked as UTF-8. The string holding all the values in UTF-8 comes from a C API, which sets the UTF-8 marker (or is supposed to, at the very least), to let Perl know it is already decoded.

Any idea about what I did miss there?

Is there a way to tell Perl what I want to do is just put a bunch of bytes one after another, and to please not try to interpret them in any way? The file I write to is opened as ">:raw" for that very reason, but I guess I need a way to specify that a given scalar is "raw" too?



Epilogue: I found the cause of the problem. The $bigInputString was supposed to be entirely composed of UTF-8 encoded data. But it did contain raw bytes with big values, because of a bug in C (turns out a "char" (not "unsigned char") is best tested with bitwise operators, instead of a " > 127"... ahem). So, "big" bytes weren't split into a two-bytes UTF-8 sequence, in the C API.

Which means the $bigInputString, created from the bad C data, didn't have the expected contents, and Perl rightfully didn't like it either.

After I corrected the bug, the string correctly encoded to UTF-8 (for the parts I wanted to keep as UTF-8) or LATIN-1 (for the "raw bytes" I wanted to convert back), and I got no further problems.

Sorry for wasting your time, guys. But I still learned some things, so I'll keep this here. Moral of the story, Devel::Peek is GOOD for debugging (thanks ikegami), and one should always double check, instead of assuming. Granted, I was in a hurry on friday, but the fault is still mine.

So, thanks to everyone who helped, or tried to, and special thanks to ikegami (again), who used quite a bit of his time helping me.


Solution

  • So you have

    my $in = "\x61\xC3\x8B\x00\xC3\xBF";
    

    and you want

    my $out = "\x61\xC3\x8B\x00\xFF";
    

    This is the result of decoding only some parts of the input string, so you want something like the following:

    sub decode_utf8 { my ($s) = @_; utf8::decode($s) or die("Invalid Input"); $s }
    
    my $out = join "",
                   substr($in, 0, 3),
       decode_utf8(substr($in, 3, 1)),
       decode_utf8(substr($in, 4, 2));
    

    Tested.

    Alternatively, you could decode the entire thing and re-encode the parts that should be encoded.

    sub encode_utf8 { my ($s) = @_; utf8::encode($s); $s }
    
    utf8::decode($in) or die("Invalid Input");
    my $out = join "",
       encode_utf8(substr($in, 0, 2)),
                   substr($in, 2, 1),
                   substr($in, 3, 1);
    

    Tested.

    You have not indicate how you know which to decode and which not to decode, but you indicated you have this information.