Suppose I have data in a numpy.recarray
, and I want to extract some of its columns. I want this to be an effective copy since data may be huge (I don't want to copy everything) but I will likely change these features without wanting to change data
(I don't want a view).
Today, I would do the following:
data = np.array([(1.0, 2.0, 0), (3.0, 4.0, 1)],
dtype=[('feature_1', float), ('feature_2', float), ('result', int)])
data = data.view(np.recarray)
features = data[['feature_1', 'feature_2']]
However, it raises the following FutureWarning
from NumPy:
/path/to/numpy/core/records.py:513: FutureWarning: Numpy has detected that you may be viewing or writing to an array returned by selecting multiple fields in a structured array.
This code may break in numpy 1.15 because this will return a view instead of a copy -- see release notes for details.
return obj.view(dtype=(self.dtype.type, obj.dtype))
This warning is very welcomed because I don't want to have a breaking change when I update NumPy. However, even going through the release notes, it is not clear what is the best solution to write something which implements this copy behavior while extracting columns as of today, and which will be stable through the upcoming releases.
In my particular case, near-optimal efficiency is required, and Pandas is unavailable. In these conditions, what would be the best workaround for this situation?
As noted, multifield selection is in a state of flux. I recently up dated to 1.14.2, and behavior is back to what it was before 1.14.0.
In [114]: data = np.array([(1.0, 2.0, 0), (3.0, 4.0, 1)],
...: dtype=[('feature_1', float), ('feature_2', float), ('resul
...: t', int)])
...:
In [115]: data
Out[115]:
array([(1., 2., 0), (3., 4., 1)],
dtype=[('feature_1', '<f8'), ('feature_2', '<f8'), ('result', '<i8')])
In [116]: features = data[['feature_1', 'feature_2']]
In [117]: features
Out[117]:
array([(1., 2.), (3., 4.)],
dtype=[('feature_1', '<f8'), ('feature_2', '<f8')])
(I'm omitting the extra layer of recarray
conversion.)
In 1.14.0 this dtype would include an offset
value, indicating that features
was a view, not a copy.
I can change values of features
without changing data
:
In [124]: features['feature_1']
Out[124]: array([1., 3.])
In [125]: features['feature_1'] = [4,5]
In [126]: features
Out[126]:
array([(4., 2.), (5., 4.)],
dtype=[('feature_1', '<f8'), ('feature_2', '<f8')])
In [127]: data
Out[127]:
array([(1., 2., 0), (3., 4., 1)],
dtype=[('feature_1', '<f8'), ('feature_2', '<f8'), ('result', '<i8')])
But without delving into the development discussion, I can't say what the long term solution will be. Ideally it should have both the ability to fetch a view
(which maintains a link to the original databuffer), and a copy, an array that is independent and freely modifiable.
I suspect the copy
version will follow a recfunctions
practice of constructing a new array with the new dtype, and then copying data field by field.
In [132]: data.dtype.descr
Out[132]: [('feature_1', '<f8'), ('feature_2', '<f8'), ('result', '<i8')]
In [133]: dt = data.dtype.descr[:-1]
In [134]: dt
Out[134]: [('feature_1', '<f8'), ('feature_2', '<f8')]
In [135]: arr = np.zeros(data.shape, dtype=dt)
In [136]: arr
Out[136]:
array([(0., 0.), (0., 0.)],
dtype=[('feature_1', '<f8'), ('feature_2', '<f8')])
In [137]: for name in arr.dtype.fields:
...: arr[name] = data[name]
...:
In [138]: arr
Out[138]:
array([(1., 2.), (3., 4.)],
dtype=[('feature_1', '<f8'), ('feature_2', '<f8')])
or another recfunctions function:
In [159]: rf.drop_fields(data, 'result')
Out[159]:
array([(1., 2.), (3., 4.)],
dtype=[('feature_1', '<f8'), ('feature_2', '<f8')])
recfunctions
has code that can copy complex dtypes, ones with nested dtypes and such. But for simple one-layered dtype like this, simple field name iteration is enough.
In general, structured arrays (and recarray) have many records, and a limited number of fields. So copying fields by name is relatively efficient.
In [150]: import numpy.lib.recfunctions as rf
In [154]: arr = np.zeros(data.shape, dtype=dt)
In [155]: rf.recursive_fill_fields(data, arr)
Out[155]:
array([(1., 2.), (3., 4.)],
dtype=[('feature_1', '<f8'), ('feature_2', '<f8')])
but note its code ends with:
output = np.empty(base.shape, dtype=newdtype)
output = recursive_fill_fields(base, output)
Development notes at some point alluded to a recfunctions.compress_fields
function, but that apparently was never actually added.