numpymasked-array

What is the best way to initialise a NumPy masked array with an existing mask?


I was expecting to just say something like

ma.zeros(my_shape, mask=my_mask, hard_mask=True)

(where the mask is the correct shape) but ma.zeros (or ma.ones or ma.empty) rather surprisingly doesn't recognise the mask argument. The simplest I've come up with is

ma.array(np.zeros(my_shape), mask=my_mask, hard_mask=True)

which seems to involve unnecessary copying of lots of zeros. Is there a better way?


Solution

  • Make a masked array:

    In [162]: x = np.arange(5); mask=np.array([1,0,0,1,0],bool)    
    In [163]: M = np.ma.MaskedArray(x,mask)
    
    In [164]: M
    Out[164]: 
    masked_array(data=[--, 1, 2, --, 4],
                 mask=[ True, False, False,  True, False],
           fill_value=999999)
    

    Modify x, and see the result in M:

    In [165]: x[-1] = 10
    
    In [166]: M
    Out[166]: 
    masked_array(data=[--, 1, 2, --, 10],
                 mask=[ True, False, False,  True, False],
           fill_value=999999)
    
    In [167]: M.data
    Out[167]: array([ 0,  1,  2,  3, 10])
    
    In [169]: M.data.base
    Out[169]: array([ 0,  1,  2,  3, 10])
    

    The M.data is a view of the array used in creating it. No unnecessary copies.

    I haven't used functions like np.ma.zeros, but

    In [177]: np.ma.zeros
    Out[177]: <numpy.ma.core._convert2ma at 0x1d84a052af0>
    

    _convert2ma is a Python class, that takes a funcname and returns new callable. It does not add mask-specific parameters. Study that yourself if necessary.

    np.ma.MaskedArray, the function that actually subclasses ndarray takes a copy parameter

    copy : bool, optional
            Whether to copy the input data (True), or to use a reference instead.
            Default is False.
    

    and the first line of its __new__ is

        _data = np.array(data, dtype=dtype, copy=copy,
                         order=order, subok=True, ndmin=ndmin)
    

    I haven't quite sorted out whether M._data is just a reference to the source data, or a view. In either case, it isn't a copy, unless you say so.

    I haven't worked a lot with masked arrays, but my impression is that, while they can be convenient, they shouldn't be used where you are concerned about performance. There's a lot of extra work required to maintain both the mask and the data. The extra time involved in copying the data array, if any, will be minor.