Reading files with istreambuf_iterator vs copy_n

I want to parse different kinds of chunks in a file with varying length, so I created a function to read out a chunk by passing in the ifstream, like this:

void parse_next(std::ifstream& input_file, std::vector<uint8_t>& data, size_t count)
{
    std::copy_n(
        std::istreambuf_iterator<char>(input_file),
        count,
        std::back_inserter(data)
    );
}

I expected the file position to increment count, i.e.,

// some init code
size_t const pos_before{input_file.tellg()};
parse_next(input_file, data, count);
size_t const pos_after{input_file.tellg()};

// this assumption is _not_ correct! 
assert(count == (pos_after - pos_before)); 

// but this is!
assert((count - 1) == (pos_after - pos_before));

However, using the input_file.read() with count instead of std::copy_n gives the right count.

So what's going on here? I can't see anywhere in the documentation of istreambuf_iterator where this is mentioned. Or is it the std::copy_n that is messing with me?

Note that in the example above, we can assume that there is plenty of data left to read, so it is not because the file is empty. Further, the file is opened as binary.

Solution

You're using istreambuf_iterator. It is an input-only iterator. Imagine that you have a file with 5 bytes and you read count=2:

It calls sgetc to read the first byte. This does not advance the stream position.
Since count=2, copy_n needs one more byte. So it increments the stream position.
It reads the second byte using sgetc.
Since count=2, no more byte are required. copy_n returns.

Note that only step 2 increments the stream position, and it only needs to be called once when reading two characters.

Yes, this is strange. But most people would just use input_file.read(). I've almost never seen people use istreambuf_iterator in production code...not least of all because it is inefficient for your type of use case.

We could say hey, let's change copy_n to increment the iterator before returning. That would fix this 0.1% use case, at the cost of slowing down other use cases.