c++character-encodinglocaleifstreamwifstream

How to handle multiple locales for ifstream, cout, etc, in c++


I am trying to read and process multiple files that are in different encoding. I am supposed to only use STL for this. Suppose that we have iso-8859-15 and UTF-8 files.

In this SO answer it states:

In a nutshell the more interesting part for you:

  1. std::stream (stringstream, fstream, cin, cout) has an inner locale-object, which matches the value of the global C++ locale at the moment of the creation of the stream object. As std::in is created long before your code in main is called, it has most probably the classical C locale, no matter what you do afterwards.
  2. You can make sure, that a std::stream object has the desirable locale by invoking std::stream::imbue(std::locale(your_favorite_locale)).

The problem is that from the two types, only the files that match the locale that was created first are processed correctly. For example If locale_DE_ISO885915 precedes locale_DE_UTF8 then files that are in UTF-8 are not appended correctly in string s and when I cout them out i only see a couple of lines from the file.

void processFiles() {
    //setup locales for file decoding
    std::locale locale_DE_ISO885915("de_DE.iso885915@euro");
    std::locale locale_DE_UTF8("de_DE.UTF-8");
    //std::locale::global(locale_DE_ISO885915);
    //std::cout.imbue(std::locale());
    const std::ctype<wchar_t>& facet_DE_ISO885915 = std::use_facet<std::ctype<wchar_t>>(locale_DE_ISO885915);
    //std::locale::global(locale_DE_UTF8);
    //std::cout.imbue(std::locale());
    const std::ctype<wchar_t>& facet_DE_UTF8 = std::use_facet<std::ctype<wchar_t>>(locale_DE_UTF8);

    std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
    std::string currFile, fileStr;
    std::wifstream inFile;
    std::wstring s;

    for (std::vector<std::string>::const_iterator fci = files.begin(); fci != files.end(); ++fci) {
        currFile = *fci;

        //check file and set locale
        if (currFile.find("-8.txt") != std::string::npos) {
            std::locale::global(locale_DE_ISO885915);
            std::cout.imbue(locale_DE_ISO885915);
        }
        else {
            std::locale::global(locale_DE_UTF8);
            std::cout.imbue(locale_DE_UTF8);
        }

        inFile.open(path + currFile, std::ios_base::binary);
        if (!inFile) {
            //TODO specific file report
            std::cerr << "Failed to open file " << *fci << std::endl;
            exit(1);
        }

        s.clear();
        //read file content
        std::wstring line;
        while( (inFile.good()) && std::getline(inFile, line) ) {
            s.append(line + L"\n");
        }
        inFile.close();

        //remove punctuation, numbers, tolower...
        for (unsigned int i = 0; i < s.length(); ++i) {
            if (ispunct(s[i]) || isdigit(s[i]))
                s[i] = L' ';
        }

        if (currFile.find("-8.txt") != std::string::npos) {
            facet_DE_ISO885915.tolower(&s[0], &s[0] + s.size());
        }
        else {
            facet_DE_UTF8.tolower(&s[0], &s[0] + s.size());
        }
        fileStr = converter.to_bytes(s);


        std::cout << fileStr << std::endl;
        std::cout << currFile << std::endl;
        std::cout << fileStr.size() << std::endl;
        std::cout << std::setlocale(LC_ALL, NULL) << std::endl;
        std::cout << "========================================================================================" << std::endl;
        // Process...
    }
    return;
}

As you can see in the code, I have tried with global and locale local variables but to no avail.

In addition, in How can I use std::imbue to set the locale for std::wcout? SO answer it states:

So it really looks like there was an underlying C library mechanizme that should be first enabled with setlocale to allow imbue conversion to work correctly.

Is this "obscure" mechanism the problem here?

Is it possible to alternate between the two locales while processing the files? What should I imbue (cout, ifstream, getline ?) and how?

Any suggestions?

PS: Why is everything related with locale so chaotic? :|


Solution

  • This works for me as expected on my Linux machine, but not on my Windows machine under Cygwin (the set of available locales is apparently the same on both machines, but std::locale::locale just fails with every imaginable locale string).

    #include <iostream>
    #include <fstream>
    #include <locale>
    #include <string>
    
    void printFile(const char* name, const char* loc)
    {
      try {
        std::wifstream inFile;
        inFile.imbue(std::locale(loc));
        inFile.open(name);
        std::wstring line;
        while (getline(inFile, line))
          std::wcout << line << '\n';
      } catch (std::exception& e) {
        std::cerr << e.what() << std::endl;
      }
    }
    
    int main()
    {
      std::locale::global(std::locale("en_US.utf8"));
    
      printFile ("gtext-u8.txt", "de_DE.utf8");       // utf-8 text: grüßen
      printFile ("gtext-legacy.txt", "de_DE@euro");   // iso8859-15 text: grüßen
    }
    

    Output:

    grüßen
    grüßen