I am trying to read and process multiple files that are in different encoding. I am supposed to only use STL for this. Suppose that we have iso-8859-15 and UTF-8 files.
In this SO answer it states:
In a nutshell the more interesting part for you:
std::stream
(stringstream
,fstream
,cin
,cout
) has an inner locale-object, which matches the value of the global C++ locale at the moment of the creation of the stream object. Asstd::in
is created long before your code in main is called, it has most probably the classical C locale, no matter what you do afterwards.- You can make sure, that a std::stream object has the desirable locale by invoking
std::stream::imbue(std::locale(your_favorite_locale))
.
The problem is that from the two types, only the files that match the locale that was created first are processed correctly. For example If locale_DE_ISO885915
precedes locale_DE_UTF8
then files that are in UTF-8
are not appended correctly in string s
and when I cout
them out i only see a couple of lines from the file.
void processFiles() {
//setup locales for file decoding
std::locale locale_DE_ISO885915("de_DE.iso885915@euro");
std::locale locale_DE_UTF8("de_DE.UTF-8");
//std::locale::global(locale_DE_ISO885915);
//std::cout.imbue(std::locale());
const std::ctype<wchar_t>& facet_DE_ISO885915 = std::use_facet<std::ctype<wchar_t>>(locale_DE_ISO885915);
//std::locale::global(locale_DE_UTF8);
//std::cout.imbue(std::locale());
const std::ctype<wchar_t>& facet_DE_UTF8 = std::use_facet<std::ctype<wchar_t>>(locale_DE_UTF8);
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
std::string currFile, fileStr;
std::wifstream inFile;
std::wstring s;
for (std::vector<std::string>::const_iterator fci = files.begin(); fci != files.end(); ++fci) {
currFile = *fci;
//check file and set locale
if (currFile.find("-8.txt") != std::string::npos) {
std::locale::global(locale_DE_ISO885915);
std::cout.imbue(locale_DE_ISO885915);
}
else {
std::locale::global(locale_DE_UTF8);
std::cout.imbue(locale_DE_UTF8);
}
inFile.open(path + currFile, std::ios_base::binary);
if (!inFile) {
//TODO specific file report
std::cerr << "Failed to open file " << *fci << std::endl;
exit(1);
}
s.clear();
//read file content
std::wstring line;
while( (inFile.good()) && std::getline(inFile, line) ) {
s.append(line + L"\n");
}
inFile.close();
//remove punctuation, numbers, tolower...
for (unsigned int i = 0; i < s.length(); ++i) {
if (ispunct(s[i]) || isdigit(s[i]))
s[i] = L' ';
}
if (currFile.find("-8.txt") != std::string::npos) {
facet_DE_ISO885915.tolower(&s[0], &s[0] + s.size());
}
else {
facet_DE_UTF8.tolower(&s[0], &s[0] + s.size());
}
fileStr = converter.to_bytes(s);
std::cout << fileStr << std::endl;
std::cout << currFile << std::endl;
std::cout << fileStr.size() << std::endl;
std::cout << std::setlocale(LC_ALL, NULL) << std::endl;
std::cout << "========================================================================================" << std::endl;
// Process...
}
return;
}
As you can see in the code, I have tried with global
and locale local variables
but to no avail.
In addition, in How can I use std::imbue to set the locale for std::wcout? SO answer it states:
So it really looks like there was an underlying C library mechanizme that should be first enabled with setlocale to allow imbue conversion to work correctly.
Is this "obscure" mechanism the problem here?
Is it possible to alternate between the two locales while processing the files? What should I imbue (cout
, ifstream
, getline
?) and how?
Any suggestions?
PS: Why is everything related with locale so chaotic? :|
This works for me as expected on my Linux machine, but not on my Windows machine under Cygwin (the set of available locales is apparently the same on both machines, but std::locale::locale
just fails with every imaginable locale string).
#include <iostream>
#include <fstream>
#include <locale>
#include <string>
void printFile(const char* name, const char* loc)
{
try {
std::wifstream inFile;
inFile.imbue(std::locale(loc));
inFile.open(name);
std::wstring line;
while (getline(inFile, line))
std::wcout << line << '\n';
} catch (std::exception& e) {
std::cerr << e.what() << std::endl;
}
}
int main()
{
std::locale::global(std::locale("en_US.utf8"));
printFile ("gtext-u8.txt", "de_DE.utf8"); // utf-8 text: grüßen
printFile ("gtext-legacy.txt", "de_DE@euro"); // iso8859-15 text: grüßen
}
Output:
grüßen
grüßen