python-3.xpython-2.7numpymemoryviewstructured-array

How to efficiently write raw bytes to numpy array data in python 3


While migrating some old python 2 code to python 3, I ran into some problems populating structured numpy arrays from bytes objects.

I have a parser that defines a specific dtype for each type of data structure I might encounter. Since, in general, a given data structure may have variable-length or variable-type fields, these have been represented in the numpy array as fields of object dtype (np.object #alternatively np.dtype('O')).

The array is obtained from bytes (or a bytearray) by first populating the fixed-dtype fields. After this, the dtype of any sub-arrays (contained in 'object' fields) can be built using information from the fixed fields that precede it.

Here is a partial example of this process (dealing only with the fixed-dtype fields) that works in python 2. Note that we have a field named 'nSamples', which will presumably tell us the length of the array pointed to by the 'samples' field of the array, which would be interpreted as a numpy array with shape (2,) and dtype sampleDtype:

fancyDtype = np.dtype([('blah',     '<u4'),
                       ('bleh',      'S5'),
                       ('nSamples', '<u8'),
                       ('samples',    'O')])

sampleDtype = np.dtype([('sampleId', '<u2'),
                        ('val',      '<f4')])

bytesFromFile = bytearray(
    b'*\x00\x00\x00hello\x02\x00\x00\x00\x00\x00\x00\x00\xd0\xb5'
    b'\x14_\xa1\x7f\x00\x00"\x00\x00\x00\x80?]\x00\x00\x00\xa0@')

arr = np.zeros((1,), dtype=fancyDtype)
numBytesFixedPortion = 17

# Start out by just reading the fixed-type portion of the array
arr.data[:numBytesFixedPortion] = bytesFromFile[:numBytesFixedPortion]
memoryview(arr.data)[:numBytesFixedPortion] = bytesFromFile[:numBytesFixedPortion]

Both of the last two statements here that work in python 2.7.

Of note is that if I type

arr.data

I get <read-write buffer for 0x7f7a93bb7080, size 25, offset 0 at 0x7f7a9339cf70>, which tells me this is a buffer. Obviously, memoryview(arr.data) returns a memoryview object.

Both of these statements raise the following exception in python 3.6:

NotImplementedError: memoryview: unsupported format T{I:blah:5s:bleh:=Q:nSamples:O:samples:}

This tells me that numpy is returning a different type with its data attribute access, a memoryview rather than a buffer. It also tells me that memoryviews worked in python 2.7 but don't in python 3.6 for this purpose.

I found a similar issue in numpy's issue tracker: https://github.com/numpy/numpy/issues/13617 However, the issue was closed quickly, with the numpy developer indicating that it is a bug in ctypes. Since ctypes is a builtin, I kind of gave up hope on just updating it to get a fix.

I did finally stumble upon a solution that works, though it takes roughly twice as long as the python 2.7 method. It is:

import struct
struct.pack_into(
    'B' * numBytesFixedPortion,  # fmt
    arr.data,                    # buffer
    0,                           # offset
    *buf[:numBytesFixedPortion]  # unpacked byte values
)

A coworker also suggested attempting to use this solution:

arrView = arr.view('u1')
arrView[:numBytesFixedPortion] = buf[:numBytesFixedPortion]

However, on doing this, I get the exception:

File "/home/tintedFrantic/anaconda2/envs/py3/lib/python3.6/site-packages/numpy/core/_internal.py", line 461, in _view_is_safe
    raise TypeError("Cannot change data-type for object array.")
TypeError: Cannot change data-type for object array.

Note that I get this exception in both python 2.7 and 3.6. It appears numpy disallows views on arrays with any object fields. (Aside: I was able to get numpy to do this correctly by commenting out the check for object-type fields in the numpy code, though that seems a dangerous solution (and not a very portable one either)).

I've also tried creating separate arrays, one with the fixed-dtype fields and one with the object-dtype field and then using numpy.lib.recfunctions.merge_arrays to merge them. That fails with a cryptic message that I can't remember.

I am at a bit of a loss. I just want to write some arbitrary bytes to the numpy array's underlying memory and do it efficiently. This doesn't seem like it should be too hard to do, but I haven't come across a good way to do it. I would like a solution that isn't a hack either, as this is going into systems that need high reliability. If nothing better exists, I will use the struct.pack_into() solution, but I am hoping someone out there knows a better way. By the way, NOT using object-dtype fields is NOT a viable option, as the cost of doing so would be prohibitive.

If it matters, I am using numpy 1.16.2 in python 2.7 and 1.17.4 for python 3.6.


Solution

  • Per the suggestion of @nawsleahcimnoraa, I found out that in python 3.3+ (so not in python 2.7), the memoryview object, which is returned by arr.data in my python 3 environment, has a cast() method. Thus, I can do

    arr.data.cast('B')[startIdx:endIdx] = buf[:numBytes]
    

    This is much more like what I had in python 2.7. It is a lot more concise and also performs a little better than the struct method above.

    One thing I noticed in testing these solutions is that, in general, the python 3 solutions were slower than the python 2 versions. For example, I tried the struct solution both using python 2 and python 3 and found a significant increase in processing time for python 3.

    I also found fairly sizable discrepancies between different python environments of the same version. For example, I found that my system install of python 3.6 performed better than a virtual environment install of python 3.6, so it seems that the results will likely depend largely on a given environment's configuration.

    Overall, I am happy with the results of using the cast() method of the memoryview object returned by arr.data and will use that for now. However, if someone discovers something that works better, I would still love to hear about it.