c++localefstreamcodecvt

fstream file size in codepoints


There are many questions on getting the file size of an std::fstream's file, but they all return the file size in bytes and are error prone if the file is open in another stream.

I want to know the file size in codepoints, not bytes.

Now std::fstream::seekg(0,std::ios::end) followed by std::fstream::tellg() only returns the length in bytes. This doesn't tell me how many UTF-16/32 characters are in the file. Divide the result by sizeof(wchar_t) I hear you say. Doesn't work for UTF-8 files and IS NOT portable.

Now, for the more technical minded, I have imbued the stream with my own std::codecvt class. std::codecvt has a member length() which, given two pointers into the stream calculates the length and returns either max or number of output characters. I would have thought that seeking on the file would seek by codecvt::intern_type rather than by the base char type.

I've looked into the fstream header and found that seek infact doesn't use the codecvt. And, on my version from VS2010, the codecvt::length() member is not even mentioned. Infact, on each call to codecvt::in(), a new string object is created and increased in size by 1 char each time in() returns partial. It doesn't instead call the codecvt::max_length() member and supply the call with an adequate buffer.

Is this just my implementation or can I expect others to do the same? Has std::fstream been rewritten for VS2012 to make full use of locales?

Basically, I'm fed up of having to write my own file handlers every time I use text files. I'm hoping to create an fstream derived class that will first read a files BOM, if present, and imbue the correct codecvt. Then convert those characters to char, wchar_t or whatever the code calls for. I'm also hoping to code it in such a way that if prior knowledge of the encoding is known, a locale can be specified on construction.

Would I be better off working directly on the internal buffer, in affect re-writing the fstream class or are there some tricks I'm unaware of?


Solution

  • If I understand you right, you expect that:

    `std::basic_fstream<CharT,Traits>::seekg`
    

    (which by inheritance is basic_istream<CharT,Traits>::seekg), ought to perform the stream-positioning operation in units that are the intern_type of whatever codecvt with which the stream is imbued.

    Template basic_istream is declared:

    template< 
        class CharT, 
        class Traits = std::char_traits<CharT>
    > class basic_istream;
    

    In the declaration of the member function:

    basic_istream & basic_istream<CharT,Traits>::seekg(pos_type pos)
    

    pos_type is std::char_traits<CharT>::pos_type which therefore is a type determined in any implementation solely by the CharT template argument of the basic_istream class and without reference to any codecvt.

    A basic_fstream<char>, for instance remains a basic_fstream<char>, and its pos_type remains basic_fstream<char>::pos_type, regardless of the encoding that is chosen to read or write it.

    The declarations above are respectively as per C++11 Standard § 27.7.1 and § 27.7.2.1. The fact that pos_type is invariant with respect to any imbued codecvt, and hence also the behaviour of seekg(pos_type), are therefore consequences of the Standard.

    Equivalent remarks apply for basic_istream& seekg( off_type off, std::ios_base::seekdir dir).

    The std::codecvt::intern_type is the type of the elements of the internal sequence to which or from which the specified encoding will translate an external sequence of elements of type extern_type. The intern_type is the element type of the "in-program" sequence and the extern_type is the type of "in-file" sequence. The intern_type has got nothing to do with positioning operations on the file.

    If you must find out the size of a file in codepoints, and presuming that the possible encodings of interest are UTF-8, UTF-16 and UTF-32, then for the first two of these you have no choice but to read the entire file, because they are variable-length encodings, with a UTF-8 codepoint consuming 1-4 bytes and a UTF-16 codepoint consuming 2 or 4 bytes. UTF-32 is a fixed-length 4-byte encoding, so in that case you might compute the number of complete codepoints as the byte-length of the file, minus BOM-length if any, divided by 4, if you discount the possibility of encoding errors except at end-of-file.

    For the variable length encodings, the simplest way of counting the codepoints will be with a template function parameterized by an indicator of the presumed encoding. It will read the file, first consuming the BOM, if any, in units of char or char16_t as appropriate, identifying each unit that is the lead element of a codepoint in the presumed encoding; verifying the presence of the number of subsequent elements required by the lead element, and incrementing the codepoint count if they are found.