While migrating some old python 2 code to python 3, I ran into some problems populating structured numpy arrays from bytes objects.
I have a parser that defines a specific dtype for each type of data structure I might encounter. Since, in general, a given data structure may have variable-length or variable-type fields, these have been represented in the numpy array as fields of object dtype (np.object #alternatively np.dtype('O')
).
The array is obtained from bytes (or a bytearray
) by first populating the fixed-dtype fields. After this, the dtype of any sub-arrays (contained in 'object' fields) can be built using information from the fixed fields that precede it.
Here is a partial example of this process (dealing only with the fixed-dtype fields) that works in python 2. Note that we have a field named 'nSamples'
, which will presumably tell us the length of the array pointed to by the 'samples'
field of the array, which would be interpreted as a numpy array with shape (2,)
and dtype sampleDtype
:
fancyDtype = np.dtype([('blah', '<u4'),
('bleh', 'S5'),
('nSamples', '<u8'),
('samples', 'O')])
sampleDtype = np.dtype([('sampleId', '<u2'),
('val', '<f4')])
bytesFromFile = bytearray(
b'*\x00\x00\x00hello\x02\x00\x00\x00\x00\x00\x00\x00\xd0\xb5'
b'\x14_\xa1\x7f\x00\x00"\x00\x00\x00\x80?]\x00\x00\x00\xa0@')
arr = np.zeros((1,), dtype=fancyDtype)
numBytesFixedPortion = 17
# Start out by just reading the fixed-type portion of the array
arr.data[:numBytesFixedPortion] = bytesFromFile[:numBytesFixedPortion]
memoryview(arr.data)[:numBytesFixedPortion] = bytesFromFile[:numBytesFixedPortion]
Both of the last two statements here that work in python 2.7.
Of note is that if I type
arr.data
I get <read-write buffer for 0x7f7a93bb7080, size 25, offset 0 at 0x7f7a9339cf70>
, which tells me this is a buffer. Obviously, memoryview(arr.data)
returns a memoryview
object.
Both of these statements raise the following exception in python 3.6:
NotImplementedError: memoryview: unsupported format T{I:blah:5s:bleh:=Q:nSamples:O:samples:}
This tells me that numpy is returning a different type with its data
attribute access, a memoryview
rather than a buffer
. It also tells me that memoryviews
worked in python 2.7 but don't in python 3.6 for this purpose.
I found a similar issue in numpy's issue tracker: https://github.com/numpy/numpy/issues/13617
However, the issue was closed quickly, with the numpy developer indicating that it is a bug in ctypes
. Since ctypes
is a builtin, I kind of gave up hope on just updating it to get a fix.
I did finally stumble upon a solution that works, though it takes roughly twice as long as the python 2.7 method. It is:
import struct
struct.pack_into(
'B' * numBytesFixedPortion, # fmt
arr.data, # buffer
0, # offset
*buf[:numBytesFixedPortion] # unpacked byte values
)
A coworker also suggested attempting to use this solution:
arrView = arr.view('u1')
arrView[:numBytesFixedPortion] = buf[:numBytesFixedPortion]
However, on doing this, I get the exception:
File "/home/tintedFrantic/anaconda2/envs/py3/lib/python3.6/site-packages/numpy/core/_internal.py", line 461, in _view_is_safe
raise TypeError("Cannot change data-type for object array.")
TypeError: Cannot change data-type for object array.
Note that I get this exception in both python 2.7 and 3.6. It appears numpy disallows views on arrays with any object
fields. (Aside: I was able to get numpy to do this correctly by commenting out the check for object-type fields in the numpy code, though that seems a dangerous solution (and not a very portable one either)).
I've also tried creating separate arrays, one with the fixed-dtype fields and one with the object-dtype field and then using numpy.lib.recfunctions.merge_arrays
to merge them. That fails with a cryptic message that I can't remember.
I am at a bit of a loss. I just want to write some arbitrary bytes to the numpy array's underlying memory and do it efficiently. This doesn't seem like it should be too hard to do, but I haven't come across a good way to do it. I would like a solution that isn't a hack either, as this is going into systems that need high reliability. If nothing better exists, I will use the struct.pack_into()
solution, but I am hoping someone out there knows a better way. By the way, NOT using object-dtype fields is NOT a viable option, as the cost of doing so would be prohibitive.
If it matters, I am using numpy 1.16.2 in python 2.7 and 1.17.4 for python 3.6.
Per the suggestion of @nawsleahcimnoraa, I found out that in python 3.3+ (so not in python 2.7), the memoryview
object, which is returned by arr.data
in my python 3 environment, has a cast()
method. Thus, I can do
arr.data.cast('B')[startIdx:endIdx] = buf[:numBytes]
This is much more like what I had in python 2.7. It is a lot more concise and also performs a little better than the struct
method above.
One thing I noticed in testing these solutions is that, in general, the python 3 solutions were slower than the python 2 versions. For example, I tried the struct
solution both using python 2 and python 3 and found a significant increase in processing time for python 3.
I also found fairly sizable discrepancies between different python environments of the same version. For example, I found that my system install of python 3.6 performed better than a virtual environment install of python 3.6, so it seems that the results will likely depend largely on a given environment's configuration.
Overall, I am happy with the results of using the cast()
method of the memoryview object returned by arr.data
and will use that for now. However, if someone discovers something that works better, I would still love to hear about it.