python-3.xnumpyrecarray

Copy a sub-recarray in stable NumPy


Suppose I have data in a numpy.recarray, and I want to extract some of its columns. I want this to be an effective copy since data may be huge (I don't want to copy everything) but I will likely change these features without wanting to change data (I don't want a view).

Today, I would do the following:

data = np.array([(1.0, 2.0, 0), (3.0, 4.0, 1)], 
            dtype=[('feature_1', float), ('feature_2', float), ('result', int)])
data = data.view(np.recarray)

features = data[['feature_1', 'feature_2']]

However, it raises the following FutureWarning from NumPy:

/path/to/numpy/core/records.py:513: FutureWarning: Numpy has detected that you may be viewing or writing to an array returned by selecting multiple fields in a structured array.

This code may break in numpy 1.15 because this will return a view instead of a copy -- see release notes for details.

return obj.view(dtype=(self.dtype.type, obj.dtype))

This warning is very welcomed because I don't want to have a breaking change when I update NumPy. However, even going through the release notes, it is not clear what is the best solution to write something which implements this copy behavior while extracting columns as of today, and which will be stable through the upcoming releases.

In my particular case, near-optimal efficiency is required, and Pandas is unavailable. In these conditions, what would be the best workaround for this situation?


Solution

  • As noted, multifield selection is in a state of flux. I recently up dated to 1.14.2, and behavior is back to what it was before 1.14.0.

    In [114]: data = np.array([(1.0, 2.0, 0), (3.0, 4.0, 1)], 
         ...:             dtype=[('feature_1', float), ('feature_2', float), ('resul
         ...: t', int)])
         ...:             
    In [115]: data
    Out[115]: 
    array([(1., 2., 0), (3., 4., 1)],
          dtype=[('feature_1', '<f8'), ('feature_2', '<f8'), ('result', '<i8')])
    In [116]: features = data[['feature_1', 'feature_2']]
    In [117]: features
    Out[117]: 
    array([(1., 2.), (3., 4.)],
          dtype=[('feature_1', '<f8'), ('feature_2', '<f8')])
    

    (I'm omitting the extra layer of recarray conversion.)

    In 1.14.0 this dtype would include an offset value, indicating that features was a view, not a copy.

    I can change values of features without changing data:

    In [124]: features['feature_1']
    Out[124]: array([1., 3.])
    In [125]: features['feature_1'] = [4,5]
    In [126]: features
    Out[126]: 
    array([(4., 2.), (5., 4.)],
          dtype=[('feature_1', '<f8'), ('feature_2', '<f8')])
    In [127]: data
    Out[127]: 
    array([(1., 2., 0), (3., 4., 1)],
          dtype=[('feature_1', '<f8'), ('feature_2', '<f8'), ('result', '<i8')])
    

    But without delving into the development discussion, I can't say what the long term solution will be. Ideally it should have both the ability to fetch a view (which maintains a link to the original databuffer), and a copy, an array that is independent and freely modifiable.

    I suspect the copy version will follow a recfunctions practice of constructing a new array with the new dtype, and then copying data field by field.

    In [132]: data.dtype.descr
    Out[132]: [('feature_1', '<f8'), ('feature_2', '<f8'), ('result', '<i8')]
    In [133]: dt = data.dtype.descr[:-1]
    In [134]: dt
    Out[134]: [('feature_1', '<f8'), ('feature_2', '<f8')]
    In [135]: arr = np.zeros(data.shape, dtype=dt)
    In [136]: arr
    Out[136]: 
    array([(0., 0.), (0., 0.)],
          dtype=[('feature_1', '<f8'), ('feature_2', '<f8')])
    In [137]: for name in arr.dtype.fields:
         ...:     arr[name] = data[name]
         ...:     
    In [138]: arr
    Out[138]: 
    array([(1., 2.), (3., 4.)],
          dtype=[('feature_1', '<f8'), ('feature_2', '<f8')])
    

    or another recfunctions function:

    In [159]: rf.drop_fields(data, 'result')
    Out[159]: 
    array([(1., 2.), (3., 4.)],
          dtype=[('feature_1', '<f8'), ('feature_2', '<f8')])
    

    recfunctions has code that can copy complex dtypes, ones with nested dtypes and such. But for simple one-layered dtype like this, simple field name iteration is enough.

    In general, structured arrays (and recarray) have many records, and a limited number of fields. So copying fields by name is relatively efficient.

    In [150]: import numpy.lib.recfunctions as rf
    In [154]: arr = np.zeros(data.shape, dtype=dt)
    In [155]: rf.recursive_fill_fields(data, arr)
    Out[155]: 
    array([(1., 2.), (3., 4.)],
          dtype=[('feature_1', '<f8'), ('feature_2', '<f8')])
    

    but note its code ends with:

    output = np.empty(base.shape, dtype=newdtype)
    output = recursive_fill_fields(base, output)
    

    Development notes at some point alluded to a recfunctions.compress_fields function, but that apparently was never actually added.