I have a bunch of txt-files and want to read them into std::string and some of them are UCS-2, UTF-8 encoded. How to read them into a std::string. I just want to read any text-file into std::string. Do i have to convert them?
How they are read depends on what your OS supports and the locale you're using.
If you just naïvely read in files without touching your locale, and their locale does not match the locale that your std c++ library is using, you may encounter difficulties. Similar issue for single vs multiple byte character sets.
There's no reliable way to tell what the locale of the file is prior to reading it (meta-data may be wrong), so the general strategy is to attempt to read in the most common formats first, and then re-try with different formats if that fails (i.e. an invalid character is encountered). Even then it may be ambiguous. This is a deceivingly complex problem, you run into the same issue parsing HTML with crazy character sets.
In general, there are two sets of file I/O functions available, one for multibyte character sets and one for single byte character sets. Support for this functionality is deeply platform specific though, so if you're using an English localized OS with no special character support added, then multibyte sets may not be supported by C++ directly without the use of an external library.
Microsoft specifies non-standard extensions to cin and cout. By prefixing them with a w, they separate the streams by their byte width.
wcout << "儫";
This works as you'd expect, but you'll have to #define _UNICODE
for it to compile. As a side note, Windows separates many of its system API calls into two versions, one that takes a single byte string, and one that takes a multi-byte string. I.e. CreateProcessA
vs CreateProcessW
.
So to summarize, IO functionality is split along character set's byte width and locale. In order to give you a more targeted answer to your question, I'd need to know more about your goals. Take a look at C++'s locale support to get a better idea about this. Specifically the locale functions in ios_base
, imbue
and getloc
. There isn't currently a good way to handle these problems with widely deployed versions of C++, though I understand these issues have been alleviated in upcoming versions of C++.