In a Python script, I've written:
# etc. etc.
input_file = args.input_file_path or sys.stdin
arr = numpy.fromfile(input_file, dtype=numpy.dtype('f32'))
when I run the script, I get:
$ cat nums.fp32.bin | ./myscript
File "./myscript", line 123, in main
arr = numpy.fromfile(input_file, dtype=numpy.dtype('f32'))
OSError: obtaining file position failed
why does NumPy need the file position? And - can I circumvent this somehow?
This error happens because np.fromfile()
is implemented in a fairly counterintuitive way.
You might assume that this is implemented by repeatedly calling e.g. file.read(4096)
, then copying the resulting buffer to the appropriate place in the array. It does not work like this.
Instead, it is following roughly this process:
os.dup()
in Python.f.tell()
in Python.fseek()
in C.At the end of this process, NumPy has a C-level file that it owns, and can copy data from without the overhead of calling a Python method. It then reads the file and copies it into the array.
(You may be asking why steps 3 and 4 are necessary. Doesn't copying a file descriptor copy its read position? This is true, but it won't work if the Python file is buffered, as the C-level read position and Python-level read position may not match.)
To clean up this file descriptor, NumPy does the following.
In order for np.fromfile()
to work, your file-like object must support all of the following:
seek()
. This rules out the use of pipes.tell()
.flush()
.In practice, this rules out most file-like objects that are not really files.
To learn more about this, I recommend reading the source code.
And - can I circumvent this somehow?
No. All four of those things are mandatory.
You can work around it, however. Assuming the file f
is open in binary mode, you could do f.read()
and obtain a bytes
object. You can then pass this to object to np.frombuffer()
to obtain an array.