pythonnumpyfile-iopipe

Why does np.fromfile fail when reading from a pipe?


In a Python script, I've written:

# etc. etc.
input_file = args.input_file_path or sys.stdin
arr = numpy.fromfile(input_file, dtype=numpy.dtype('f32'))

when I run the script, I get:

$ cat nums.fp32.bin | ./myscript
  File "./myscript", line 123, in main
    arr = numpy.fromfile(input_file, dtype=numpy.dtype('f32'))
OSError: obtaining file position failed

why does NumPy need the file position? And - can I circumvent this somehow?


Solution

  • This error happens because np.fromfile() is implemented in a fairly counterintuitive way.

    You might assume that this is implemented by repeatedly calling e.g. file.read(4096), then copying the resulting buffer to the appropriate place in the array. It does not work like this.

    Instead, it is following roughly this process:

    1. Find the file descriptor number of the file object.
    2. Copy that file descriptor using os.dup() in Python.
    3. Find the read position within the original file by calling f.tell() in Python.
    4. Set the copied file descriptor to the same read position using fseek() in C.

    At the end of this process, NumPy has a C-level file that it owns, and can copy data from without the overhead of calling a Python method. It then reads the file and copies it into the array.

    (You may be asking why steps 3 and 4 are necessary. Doesn't copying a file descriptor copy its read position? This is true, but it won't work if the Python file is buffered, as the C-level read position and Python-level read position may not match.)

    To clean up this file descriptor, NumPy does the following.

    1. Find the seek position of the C-level file.
    2. Copy the seek position to the Python-level file.
    3. Close the C-level file.

    In order for np.fromfile() to work, your file-like object must support all of the following:

    In practice, this rules out most file-like objects that are not really files.

    To learn more about this, I recommend reading the source code.

    And - can I circumvent this somehow?

    No. All four of those things are mandatory.

    You can work around it, however. Assuming the file f is open in binary mode, you could do f.read() and obtain a bytes object. You can then pass this to object to np.frombuffer() to obtain an array.