Is there a pythonic way in the standard library for parsing raw binary files using for ... in ...
syntax (i.e., __iter__
/__next__
) that yields blocks that respect the buffersize
parameter, without having to subclass IOBase
or its child classes?
I'd like to open a raw file for parsing, making use of the for ... in ...
syntax, and I'd like that syntax to yield predictably shaped objects. This wasn't happening as expected for a problem I was working on, so I tried the following test (import numpy as np
required):
In [271]: with open('tinytest.dat', 'wb') as f:
...: f.write(np.random.randint(0, 256, 16384, dtype=np.uint8).tobytes())
...:
In [272]: np.array([len(b) for b in open('tinytest.dat', 'rb', 16)])
Out[272]:
array([ 13, 138, 196, 263, 719, 98, 476, 3, 266, 63, 51,
241, 472, 75, 120, 137, 14, 342, 148, 399, 366, 360,
41, 9, 141, 282, 7, 159, 341, 355, 470, 427, 214,
42, 1095, 84, 284, 366, 117, 187, 188, 54, 611, 246,
743, 194, 11, 38, 196, 1368, 4, 21, 442, 169, 22,
207, 226, 227, 193, 677, 174, 110, 273, 52, 357])
I could not understand why this random behavior was arising, and why it was not respecting the buffersize
argument. Using read1
gave the expected number of bytes:
In [273]: with open('tinytest.dat', 'rb', 16) as f:
...: b = f.read1()
...: print(len(b))
...: print(b)
...:
16
b'M\xfb\xea\xc0X\xd4U%3\xad\xc9u\n\x0f8}'
And there it is: A newline near the end of the first block.
In [274]: with open('tinytest.dat', 'rb', 2048) as f:
...: print(f.readline())
...:
b'M\xfb\xea\xc0X\xd4U%3\xad\xc9u\n'
Sure enough, readline
was being called to produce each block of the file, and it was tripping up on the newline value (corresponding to 10). I verified this reading through the code, lines in the definition of IOBase:
571 def __next__(self):
572 line = self.readline()
573 if not line:
574 raise StopIteration
575 return line
So my question is this: is there some more pythonic way to achieve buffersize
-respecting raw file behavior that allows for ... in ...
syntax, without having to subclass IOBase
or its child classes (and thus, not being part of the standard library)? If not, does this unexpected behavior warrant a PEP? (Or does it warrant learning to expect the behavior?:)
This behavior isn't unexpected, it is documented that all objects derived from IOBase
iterate over lines. The only thing that changes between binary vs text mode is how a line terminator is defined, it is always defined as b"\n"
in binary mode.
The docs:
IOBase (and its subclasses) supports the iterator protocol, meaning that an IOBase object can be iterated over yielding the lines in a stream. Lines are defined slightly differently depending on whether the stream is a binary stream (yielding bytes), or a text stream (yielding character strings). See
readline()
below.
The problem is that there used to historically be ambiguity between text and binary data in the type system, this was a major motivating factor of the Python 2 -> 3 transition breaking backwards-compatibility.
I think it would certainly be reasonable to have the iterator protocol respect the buffer size for file objects opened in binary mode in Python 3. Why it was decided to keep the old behavior is something I can only speculate about.
In any case, you should just define your own iterator, that is common in Python. Iterators are a basic building block, like built-in types.
You can actually use the 2-argument iter(callable, sentinel)
form to construct a super basic wrapper:
>>> from functools import partial
>>> def iter_blocks(f, n):
... return iter(partial(f.read, n), b'')
...
>>> np.array([len(b) for b in iter_blocks(open('tinytest.dat', 'rb'), 16)])
array([16, 16, 16, ..., 16, 16, 16])
Of course, you could have just used a generator:
def iter_blocks(bin_file, n):
result = bin_file.read(n)
while result:
yield result
result = bin_file.read(n)
There are tons of ways of approaching this. Again, iterators are a core type for writing idiomatic Python.
Python is a pretty dyanamic language, and "duck typing" is the name of the game. Generally, your first instinct shouldn't be "how to subclass some built-in type to extend functionality". I mean, often that is possible, but you'll find that there are a lot of language features geared towards not having to do that, and often, it is simply better expressed that way to begin with, at least, usually to my eyes.