I'm writing a C++ program that processes large delimited files.
I have a UTF-8 csv file that contains a row with the (emoji?) character 🌟. It looks something like this:
123,"james","piotrj🌟","1996-01-28"
When I call getline()
on this row, it reads up to the emoji and then stops. So the resulting string from getline()
is 123,"james","piotrj
. I'm not sure exactly why it is happening. If I had to guess, I'm using locale improperly and this emoji (or part of it) is being read as an EOF
.
I would like to read this row in as is, do some string operations, and then write it out to another file.
I have some example code here:
locale loc("en_US.UTF8");
wifstream inFile;
inFile.imbue(loc);
inFile.open("MyFile.csv");
if(inFile.is_open()){
wstring str;
if (getline(inFile, str)) {
wcout << str << endl;
}
if (getline(inFile, str)) {
wcout << str << endl;
}
inFile.close();
}
The output of this code is : 123,"james","piotrj
. The second if statements body does not execute because the second getline()
did not grab anything.
To try some things, I changed the locale to this:
locale loc = locale();
The name of the locale is "C" and that will get the entire line. The output of this program is: 123,"james","piotrj🌟","1996-01-28"
. This is a step in the right direction, but without the proper locale the wstring will not store it properly. In my program I do some individual character checking to see if the string could be represented in ANSI, thus I would really like the wstring to have that emoji as one character.
It looks like you are using libc++. Wide streams in this implementation do not support UTF-8 at all.
Should you use libstdc++ instead, your program would work, except you would get transliterated text on the output. I am getting
123,"james","piotrj?","1996-01-28"
That's because the locale is not imbued in wcout
. To get normal text, you would need to do either
ios_base::sync_with_stdio(false);
wcout.imbue(loc);
(you cannot imbue a locale in a standard stream if it is synched with stdio)
or, alternatively,
locale::global(loc);
Then your program would fully work.
If you are tied to libc++, your only alternative is to use narrow character streams.
Edit: with MSVC this code doesn't work either. Don't know why Microsoft claims UTF-8 support in newer versions of Windows, apparently it's not there at all. On Windows one can install gcc (one of several flavours, I recommend the UCRT flavour available with MSYS2). I cannot guarantee it will work though because ultimately the control flow passes through Microsoft runtime libraries. The proper solution is to never, ever use any wchar_t
APIs except for calling specific WinAPI functions that require wchar_t
. Use narrow characters, read UTF-8 from your file, store and manipulate strings as UTF-8, output them as UTF-8. I have tested this code converted to narrow characters with MSVC, and it works as expected for me.