cunicodeutf-8

Easy way to read UTF-8 characters from a binary file?


Here is my problem: I have to read "binary" files, that is, files which have varying "record" sizes, and which may contain binary data, as well as UTF-8-encoded text fields.

Reading a given number of bytes from an input file is trivial, but I was wondering if there were functions to easily read a given number of characters (not bytes) from a file ? Like, if I know I need to read a 10-characters field (encoded in UTF-8, it would be at least 10 bytes long, but could go up to 40 or more, if we're talking "high" codepoints).

I emphasize that I'm reading a "mixed" file, that is, I cannot process it whole as UTF-8, because the binary fields have to be read without being interpreted as UTF-8 characters.

So, while doing it by hand is pretty straightforward (the byte-by-byte, naïve approach, isn't hard to implement - even though I'm dubious about the efficiency), I'm wondering if there are better alternatives out there. If possible, in the standard library, but I'm open to 3rd party code too - if my organization validates its use.


Solution

  • Here are two possibilities:

    (1) If (but typically only if) your locale is set to handle UTF-8, the getwc function should read exactly one UTF-encoded Unicode character, even if it's multiple bytes long. So you could do something like

    setlocale(LC_CTYPE, "UTF-8");
    wint_t c;
    
    for(i = 0; i < 10; i++) {
        c = getwc(ifp);
        /* do something with c */
    }
    

    Now, c here will be a single integer containing a Unicode codepoint, not a UTF-8 multibyte sequence. If (as is likely) you want to store UTF-8 strings in your in-memory data structure(s), you'd have to convert back to UTF-8, likely using wctomb.

    (2) You could read N bytes from the input, then convert them to a wide character stream using mbstowcs. This isn't perfect, either, because it's hard to know what N should be, and the wide character string that mbstowcs gives you is, again, probably not what you want.

    But before exploring either of these approaches, the question really is, what is the format of your input? Those UTF-encoded fragments of text, are they fixed-size, or does the file format contain an explicit count saying how big they are? And in either case, is their size specified in bytes, or in characters? Hopefully it's specified in bytes, in which case you don't need to do any conversion to/from UTF-8, you can just read N characters using fread. If the count is specified in terms of characters (which would be kind of weird, in my experience), you would probably have to use something like my approach (1) above.

    Other than a loop like in (1) above, I don't know of a simple, encapsulated way to do the equivalent of "read N UTF-8 characters, no matter how many bytes it takes".