I am doing some work with structured arrays in numpy (that I will eventually convert to a pandas dataframe).
Now, I generate this structured array by reading in some data (actually memmapping some data) and then filtering it by user specified constraints. I then want to convert this data out of the form that I read it in as (everything is an int to conserve space in the file I read it from) into a more useable format so I can do some unit conversions (i.e. upconvert it to a float).
I noticed an interesting artifact (or something) along the way which changing a structured data type. Say that reading in the data results in the same structured array as is created by the following (note that in the actual code the dtype is much longer and much more complex but this suffices for a mwe):
import numpy as np
names = ['foo', 'bar']
formats = ['i4', 'i4']
dtype = np.dtype({'names': names, 'formats': formats})
data = np.array([(1, 2), (3, 4)], dtype=dtype)
print(data)
print(data.dtype)
This creates
[(1, 2) (3, 4)]
[('foo', '<i4'), ('bar', '<i4')]
as the structured array
Now, say I want to upconvert both of these dtypes to double while also renaming the second component. That seems like it should be easy
names[1] = 'baz'
formats[0] = np.float
formats[1] = np.float
dtype_new = np.dtype({'names': names, 'formats': formats})
data2 = data.copy().astype(dtype_new)
print(data2)
print(data2.dtype)
but the result is unexpected
(1.0, 0.0) (3.0, 0.0)]
[('foo', '<f8'), ('baz', '<f8')]
What happened to the data from the second component? We can do this conversion however if we split things up
dtype_new3 = np.dtype({'names': names, 'formats': formats})
data3 = data.copy().astype(dtype_new3)
print(data3)
print(data3.dtype)
names[1] = 'baz'
data4 = data3.copy()
data4.dtype.names = names
print(data4)
print(data4.dtype)
which results in the correct output
[(1.0, 2.0) (3.0, 4.0)]
[('foo', '<f8'), ('bar', '<f8')]
[(1.0, 2.0) (3.0, 4.0)]
[('foo', '<f8'), ('baz', '<f8')]
It appears that when astype
is called with a structured dtype, numpy matches the names for each component and then applies the specified type to the contents (just guessing here, didn't look at the source code). Is there anyway to do this conversion all at once (i.e. the name and the upconversion of the format) or does it simply need to be done it steps. (It's not a huge deal if it needs to be done in steps, but it seems odd to me that there's not a single step way to do this.)
There is a library of functions designed to work with recarray
(and thus structured arrays). It's kind of hidden so I'll have do a search to find it. It has functions for renaming fields, adding and deleting fields, etc. The general pattern of action is to make a new array with the target dtype, and then copy fields one by one. Since an array usually has many elements and a small number of fields, this doesn't slow things down much.
It looks like this astype
method is using some of that code, or maybe compiled code that behaves the same way.
So yes, it does look like we need change field dtypes and names in separate steps.
In [1279]: data=np.array([(1,2),(3,4)],dtype='i,i')
In [1280]: data
Out[1280]:
array([(1, 2), (3, 4)],
dtype=[('f0', '<i4'), ('f1', '<i4')])
In [1281]: dataf=data.astype('f8,f8') # change dtype, same default names
In [1282]: dataf
Out[1282]:
array([(1.0, 2.0), (3.0, 4.0)],
dtype=[('f0', '<f8'), ('f1', '<f8')])
Easy name change:
In [1284]: dataf.dtype.names=['one','two']
In [1285]: dataf
Out[1285]:
array([(1.0, 2.0), (3.0, 4.0)],
dtype=[('one', '<f8'), ('two', '<f8')])
In [1286]: data.astype(dataf.dtype)
Out[1286]:
array([(0.0, 0.0), (0.0, 0.0)],
dtype=[('one', '<f8'), ('two', '<f8')])
The astype
with no match in names produces a zero
array, same as np.zeros(data.shape,dataf.dtype)
. By matching names, rather than position in the dtype, I can reorder values, and even add fields.
In [1291]: data.astype([('f1','f8'),('f0','f'),('f3','i')])
Out[1291]:
array([(2.0, 1.0, 0), (4.0, 3.0, 0)],
dtype=[('f1', '<f8'), ('f0', '<f4'), ('f3', '<i4')])