c++utf-8ucs-4

Read UTF-8 file into UCS-4 string


I am trying to read a UTF-8 encoded file into a UTF-32 (UCS-4) string. Basically internally I want a fixed size character internally to the application.

Here I want to make sure the translation is done as part of the stream processes (because that is what the Locale is supposed to be used for). Alternative questions have been posted to do the translation on the string (but this is wasteful as you have to do a translation phase in memory then you have to do a second pass to send it to the stream). By doing it with the locale in the stream you only have to do a single pass and there is not requirement for a copy to made (assuming you want to maintain the original).

This is what I tried.

#include <iostream>
#include <fstream>
#include <locale>
#include <codecvt>

int main()
{
    std::locale     converter(std::locale(), new std::codecvt_utf8<char32_t>);
    std::basic_ifstream<char32_t>   iFile;
    iFile.imbue(converter);
    iFile.open("test.data");

    std::u32string     line;
    while(std::getline(iFile, line))
    {
    }
}

Since thes are all standard types I was surprized with this compilation error:

/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/istream:275:41:
error: no matching function for call to 'use_facet'

            const ctype<_CharT>& __ct = use_facet<ctype<_CharT> >(__is.getloc());
                                        ^~~~~~~~~~~~~~~~~~~~~~~~~

Compiled with:

g++ -std=c++14 test.cpp

Solution

  • Seems like char32_t is not what I wanted. Simply moving to wchar_t worked for me. I suspect that this only works the way I want on Linux like system and Windows this conversion will be to UTF-16 (UCS-2) (but I can't test that).

    int main()
    {
       std::locale           utf8_to_utf32(std::locale(), new std::codecvt_utf8<wchar_t>);
    
        // Input stream reads UTF-8 and converts to UTF-32 (UCS-4) String
        std::wifstream        iFile("test.data");
        iFile.imbue(utf8_to_utf32);
    
        // Output UTF-32 (UCS-4) string converts to UTF-8 stream
        std::wofstream        oFile("test.res");
        oFile.imbue(utf8_to_utf32);
    
    
        // Now just read like you would normally.
        std::wstring     line;
        while(std::getline(iFile, line))
        {
            // UTF-32 characters are fixed size.
            // So reverse is simple just do it in-place.
            std::reverse(std::begin(line), std::end(line));
    
            // UTF-32 unfortunately also has grapheme clusters (these are groups of characters
            // that are displayed as a single glyph). By doing the reverse above we have split
            // these incorrectly. We need to do a second pass to reverse the characters inside
            // each cluster. This is beyond the scope of this question and left as an excursive
            // (but I may come back to it later).
            oFile << line << "\n";
        }
    }
    

    A comment above suggested this would be slower than reading the data than translating it inline. So I did some tests:

    // read1.cpp Translation in stream using codecvt and Locale

    #include <iostream>
    #include <fstream>
    #include <locale>
    #include <codecvt>
    
    
    int main()
    {
        std::locale           utf8_to_utf32(std::locale(), new std::codecvt_utf8<wchar_t>);
    
        std::wifstream        iFile("test.data");
        iFile.imbue(utf8_to_utf32);
    
        std::wofstream        oFile("test.res1");
        oFile.imbue(utf8_to_utf32);
    
        std::wstring     line;
        while(std::getline(iFile, line))
        {
            std::reverse(std::begin(line), std::end(line));
            oFile << line << "\n";
        }
    }
    

    // read2.cpp Translation using codecvt after reading.

    #include <iostream>
    #include <fstream>
    #include <locale>
    #include <codecvt>
    #include <string>
    
    int main()
    {
        std::ifstream        iFile("test.data");
        std::ofstream        oFile("test.res2");
    
        std::wstring_convert<std::codecvt_utf8<wchar_t>> utf8_to_utf32;
    
        std::string     line;
        std::wstring    wideline;
        while(std::getline(iFile, line))
        {
            wideline = utf8_to_utf32.from_bytes(line);
            std::reverse(std::begin(wideline), std::end(wideline));
            oFile << utf8_to_utf32.to_bytes(wideline) << "\n";
        }
    }
    

    // read3.cpp Using UTF-8

    #include <algorithm>
    #include <iostream>
    #include <string>
    #include <fstream>
    
    static bool is_lead(uint8_t ch) { return ch < 0x80 || ch >= 0xc0; }
    
    /* Reverse a utf-8 string in-place */
    void reverse_utf8(std::string& s) {
      std::reverse(s.begin(), s.end());
      for (auto p = s.begin(), end = s.end(); p != end; ) {
        auto q = p;
        p = std::find_if(p, end, is_lead);
        std::reverse(q, ++p);
      }
    }
    
    int main(int argc, char** argv)
    {
        std::ifstream        iFile("test.data");
        std::ofstream        oFile("test.res3");
    
        std::string     line;
        while(std::getline(iFile, line))
        {
            reverse_utf8(line);
            oFile << line << "\n";
        }
        return 0;
    }
    

    The test file was 58M of unicode japanese

    > ls -lah test.data
    -rw-r--r--  1 loki  staff    58M Jan 28 11:28 test.data
    
    > g++ -O3 -std=c++14 read1.cpp -o a1
    > g++ -O3 -std=c++14 read2.cpp -o a2
    > g++ -O3 -std=c++14 read3.cpp -o a3
    >
    > # This is the one using Locale in stream
    > time ./a1
    
    real    0m0.645s
    user    0m0.521s
    sys 0m0.108s
    >
    > # This is the one doing translation after reading.
    > time ./a2
    
    real    0m1.058s
    user    0m0.916s
    sys 0m0.123s
    >
    > # This is the one using UTF-8
    > time ./a3
    
    real    0m0.785s
    user    0m0.663s
    sys 0m0.104s
    

    Doing the translation in stream is faster but not significantly so (not it was a lot of data). So choose the one that is easies to read.