There are many questions on getting the file size of an std::fstream's file, but they all return the file size in bytes and are error prone if the file is open in another stream.
I want to know the file size in codepoints, not bytes.
Now std::fstream::seekg(0,std::ios::end)
followed by std::fstream::tellg()
only returns the length in bytes. This doesn't tell me how many UTF-16/32 characters are in the file. Divide the result by sizeof(wchar_t)
I hear you say. Doesn't work for UTF-8 files and IS NOT portable.
Now, for the more technical minded, I have imbued
the stream with my own std::codecvt
class. std::codecvt
has a member length()
which, given two pointers into the stream calculates the length and returns either max or number of output characters. I would have thought that seeking on the file would seek by codecvt::intern_type
rather than by the base char
type.
I've looked into the fstream
header and found that seek infact doesn't use the codecvt
. And, on my version from VS2010, the codecvt::length()
member is not even mentioned. Infact, on each call to codecvt::in()
, a new string object is created and increased in size by 1 char each time in()
returns partial
. It doesn't instead call the codecvt::max_length()
member and supply the call with an adequate buffer.
Is this just my implementation or can I expect others to do the same? Has std::fstream
been rewritten for VS2012 to make full use of locales?
Basically, I'm fed up of having to write my own file handlers every time I use text files. I'm hoping to create an fstream
derived class that will first read a files BOM, if present, and imbue the correct codecvt
. Then convert those characters to char
, wchar_t
or whatever the code calls for. I'm also hoping to code it in such a way that if prior knowledge of the encoding is known, a locale
can be specified on construction.
Would I be better off working directly on the internal buffer, in affect re-writing the fstream class or are there some tricks I'm unaware of?
If I understand you right, you expect that:
`std::basic_fstream<CharT,Traits>::seekg`
(which by inheritance is basic_istream<CharT,Traits>::seekg
), ought to
perform the stream-positioning operation in units that are the
intern_type
of whatever codecvt
with which the stream is imbued.
Template basic_istream
is declared:
template<
class CharT,
class Traits = std::char_traits<CharT>
> class basic_istream;
In the declaration of the member function:
basic_istream & basic_istream<CharT,Traits>::seekg(pos_type pos)
pos_type
is std::char_traits<CharT>::pos_type
which therefore is
a type determined in any implementation solely by the CharT
template
argument of the basic_istream
class and without reference to any codecvt
.
A basic_fstream<char>
, for instance remains a basic_fstream<char>
,
and its pos_type
remains basic_fstream<char>::pos_type
,
regardless of the encoding that is chosen to read or write it.
The declarations above are respectively as per C++11 Standard § 27.7.1
and § 27.7.2.1. The fact that pos_type
is invariant with
respect to any imbued codecvt
, and hence also the behaviour of seekg(pos_type)
,
are therefore consequences of the Standard.
Equivalent remarks apply for basic_istream& seekg( off_type off, std::ios_base::seekdir dir)
.
The std::codecvt::intern_type
is the type of the elements of the internal
sequence to which or from which the specified encoding will
translate an external sequence of elements of type extern_type
. The
intern_type
is the element type of the "in-program" sequence and
the extern_type
is the type of "in-file" sequence. The intern_type
has got nothing to do with positioning operations on the file.
If you must find out the size of a file in codepoints, and presuming that the possible encodings of interest are UTF-8, UTF-16 and UTF-32, then for the first two of these you have no choice but to read the entire file, because they are variable-length encodings, with a UTF-8 codepoint consuming 1-4 bytes and a UTF-16 codepoint consuming 2 or 4 bytes. UTF-32 is a fixed-length 4-byte encoding, so in that case you might compute the number of complete codepoints as the byte-length of the file, minus BOM-length if any, divided by 4, if you discount the possibility of encoding errors except at end-of-file.
For the variable length encodings, the simplest way of counting
the codepoints will be with a template function parameterized by an
indicator of the presumed encoding. It will read the file, first
consuming the BOM, if any, in units of char
or char16_t
as appropriate,
identifying each unit that is the lead element of a codepoint in
the presumed encoding; verifying the presence of the number of subsequent
elements required by the lead element, and incrementing the codepoint count
if they are found.