pythonnumpystructured-array

Numpy structured array creation not working as intended when using another numpy array


I am trying to create a numpy structured array from other arrays in python. However, this does not work as I would expect it:

# this does what I want
In [3]: x = np.array([(1, 2), (3, 4)], dtype=[('foo', 'i8'), ('bar', 'f4')])

In [4]: x['foo']
Out[4]: array([1, 3])

In [5]: x['foo'].shape
Out[5]: (2,)

# when creating the array from another array, the structure is different
In [6]:  y = np.array( np.array([(1, 2), (3, 4)]), dtype=[('foo', 'i8'), ('bar', 'f4')])

In [7]: y['foo'].shape
Out[7]: (2, 2)

# unpacking and packing into a list does not work either
In [8]:   z = np.array([zz for zz in np.array([(1, 2), (3, 4)])], dtype=[('foo', 'i8'), ('bar', 'f4')])

In [9]: z['foo'].shape
Out[9]: (2, 2)

So the structured array for x does what I expect and want. But when you use another numpy array, which I need for my application, the structure is different. And you actually do not access the axes as for x.

Unpacking the values and packing them back does not work either.

Unfortunately the documentation is not clear enough (at least for me), on how to do this. Cheers


Solution

  • There might be a more clever way (I have never been fond of ragged nor structured array. I use numpy for good old monolitic array of uniformly typed data. When I need different types or field, I fall back to other things like pandas. So, again, there are probably better ways). But here, I am just translating your attempt into a working one:

    arr=np.array([(1, 2), (3, 4)])
    y=np.array([tuple(zz) for zz in arr], dtype=[('foo', 'i8'), ('bar', 'f4')])
    

    Idea is quite rudimentary: "if it works with tuple, let them have tuples" :D

    But again, maybe there are better ideas

    For example

    y=np.empty((len(arr),), dtype=[('foo', 'i8'), ('bar', 'f4')])
    for i,k in enumerate(y.dtype.names):
        y[k]=arr[:,i]
    

    Also works. And is probably faster. There is still a pure python loop. But it is done only over the fields, when the previous is over the rows. And usually you have way more rows than fields.

    As for why it doesn't work from arrays: understand that your wanted result is not a 2x2 2D array, as is your input array. It is a 2×1 array, with each cell being a structure. Reason why it is still quite efficient (numpy can iterate through each fields, with shape and strides, as efficiently as in another array.

    So your first line starts from data, 1D list of tuples, from which you build a 1D array of "structure".

    Your other attempts start from 2D arrays.

    Timing

    So, edit, in the mean time I've tested some timings, and I confirm my first opinion: my second code is faster. Under the hypothesis I've made. That is way more rows than fields. Starting from a 10000×2 array, first code runs in 8 ms, while the second runs in 29 μs.

    Edit after hpaulj's answer

    So, as I was supposing, there is indeed a smarter way. Even tho my python wouldn't let me run his code directly, because it needs a proper type, not [('foo', 'i8'), ('bar', 'f4')]. But that is easily solved

    rf.unstructured_to_structured(arr, dtype=np.dtype([('foo', 'i8'), ('bar', 'f4')]))
    

    Nevertheless, timingwise, tho almost as fast, that method seem (strangely) slower than my second. In my same example, it takes 39μs instead of the 29μs of the "empty then for over fields".

    still, that is in the same order of magnitude (compared to the 8ms of building a list of tuples), and it might be better to use standard functions rather than reinvented wheels.