I am trying to read a UTF-8 encoded file into a UTF-32 (UCS-4) string. Basically internally I want a fixed size character internally to the application.
Here I want to make sure the translation is done as part of the stream processes (because that is what the Locale is supposed to be used for). Alternative questions have been posted to do the translation on the string (but this is wasteful as you have to do a translation phase in memory then you have to do a second pass to send it to the stream). By doing it with the locale in the stream you only have to do a single pass and there is not requirement for a copy to made (assuming you want to maintain the original).
This is what I tried.
#include <iostream>
#include <fstream>
#include <locale>
#include <codecvt>
int main()
{
std::locale converter(std::locale(), new std::codecvt_utf8<char32_t>);
std::basic_ifstream<char32_t> iFile;
iFile.imbue(converter);
iFile.open("test.data");
std::u32string line;
while(std::getline(iFile, line))
{
}
}
Since thes are all standard types I was surprized with this compilation error:
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/istream:275:41:
error: no matching function for call to 'use_facet'
const ctype<_CharT>& __ct = use_facet<ctype<_CharT> >(__is.getloc());
^~~~~~~~~~~~~~~~~~~~~~~~~
Compiled with:
g++ -std=c++14 test.cpp
Seems like char32_t
is not what I wanted. Simply moving to wchar_t
worked for me. I suspect that this only works the way I want on Linux
like system and Windows this conversion will be to UTF-16 (UCS-2) (but I can't test that).
int main()
{
std::locale utf8_to_utf32(std::locale(), new std::codecvt_utf8<wchar_t>);
// Input stream reads UTF-8 and converts to UTF-32 (UCS-4) String
std::wifstream iFile("test.data");
iFile.imbue(utf8_to_utf32);
// Output UTF-32 (UCS-4) string converts to UTF-8 stream
std::wofstream oFile("test.res");
oFile.imbue(utf8_to_utf32);
// Now just read like you would normally.
std::wstring line;
while(std::getline(iFile, line))
{
// UTF-32 characters are fixed size.
// So reverse is simple just do it in-place.
std::reverse(std::begin(line), std::end(line));
// UTF-32 unfortunately also has grapheme clusters (these are groups of characters
// that are displayed as a single glyph). By doing the reverse above we have split
// these incorrectly. We need to do a second pass to reverse the characters inside
// each cluster. This is beyond the scope of this question and left as an excursive
// (but I may come back to it later).
oFile << line << "\n";
}
}
A comment above suggested this would be slower than reading the data than translating it inline. So I did some tests:
// read1.cpp Translation in stream using codecvt and Locale
#include <iostream>
#include <fstream>
#include <locale>
#include <codecvt>
int main()
{
std::locale utf8_to_utf32(std::locale(), new std::codecvt_utf8<wchar_t>);
std::wifstream iFile("test.data");
iFile.imbue(utf8_to_utf32);
std::wofstream oFile("test.res1");
oFile.imbue(utf8_to_utf32);
std::wstring line;
while(std::getline(iFile, line))
{
std::reverse(std::begin(line), std::end(line));
oFile << line << "\n";
}
}
// read2.cpp Translation using codecvt after reading.
#include <iostream>
#include <fstream>
#include <locale>
#include <codecvt>
#include <string>
int main()
{
std::ifstream iFile("test.data");
std::ofstream oFile("test.res2");
std::wstring_convert<std::codecvt_utf8<wchar_t>> utf8_to_utf32;
std::string line;
std::wstring wideline;
while(std::getline(iFile, line))
{
wideline = utf8_to_utf32.from_bytes(line);
std::reverse(std::begin(wideline), std::end(wideline));
oFile << utf8_to_utf32.to_bytes(wideline) << "\n";
}
}
// read3.cpp Using UTF-8
#include <algorithm>
#include <iostream>
#include <string>
#include <fstream>
static bool is_lead(uint8_t ch) { return ch < 0x80 || ch >= 0xc0; }
/* Reverse a utf-8 string in-place */
void reverse_utf8(std::string& s) {
std::reverse(s.begin(), s.end());
for (auto p = s.begin(), end = s.end(); p != end; ) {
auto q = p;
p = std::find_if(p, end, is_lead);
std::reverse(q, ++p);
}
}
int main(int argc, char** argv)
{
std::ifstream iFile("test.data");
std::ofstream oFile("test.res3");
std::string line;
while(std::getline(iFile, line))
{
reverse_utf8(line);
oFile << line << "\n";
}
return 0;
}
The test file was 58M of unicode japanese
> ls -lah test.data
-rw-r--r-- 1 loki staff 58M Jan 28 11:28 test.data
> g++ -O3 -std=c++14 read1.cpp -o a1
> g++ -O3 -std=c++14 read2.cpp -o a2
> g++ -O3 -std=c++14 read3.cpp -o a3
>
> # This is the one using Locale in stream
> time ./a1
real 0m0.645s
user 0m0.521s
sys 0m0.108s
>
> # This is the one doing translation after reading.
> time ./a2
real 0m1.058s
user 0m0.916s
sys 0m0.123s
>
> # This is the one using UTF-8
> time ./a3
real 0m0.785s
user 0m0.663s
sys 0m0.104s
Doing the translation in stream is faster but not significantly so (not it was a lot of data). So choose the one that is easies to read.