pythonnumpymasked-array

in numpy, what is the difference between calling MA.masked_where and MA.masked_array?


Calling masked_array (the class constructor) and the masked_where function both seem to do exactly the same thing, in terms of being able to construct a numpy masked array given the data and mask values. When would you use one or the other?

>>> import numpy as np
>>> import numpy.ma as MA

>>> vals = np.array([0,1,2,3,4,5])
>>> cond = vals > 3

>>> vals
array([0, 1, 2, 3, 4, 5])

>>> cond
array([False, False, False, False,  True,  True], dtype=bool)

>>> MA.masked_array(data=vals, mask=cond)
masked_array(data = [0 1 2 3 -- --],
             mask = [False False False False  True  True],
       fill_value = 999999)

>>> MA.masked_where(cond, vals)
masked_array(data = [0 1 2 3 -- --],
             mask = [False False False False  True  True],
       fill_value = 999999)

The optional argument copy to masked_where (its only documented optional argument) is also supported by masked_array, so I don't see any options that are unique to masked_where. Although the converse is not true (e.g. masked_where doesn't support dtype), I don't understand the purpose of masked_where as a separate function.


Solution

  • You comment:

    If I call them with inconsistently shaped value and masked arrays, I get the same error message in both cases.

    I don't think we can help you without more details on what's different.

    For example if I try the obvious inconsistency, that of length, I get different error messages:

    In [121]: np.ma.masked_array(vals, cond[:-1])
    MaskError: Mask and data not compatible: data size is 5, mask size is 4.
    In [122]: np.ma.masked_where(cond[:-1], vals)
    IndexError: Inconsistent shape between the condition and the input (got (4,) and (5,))
    

    The test for the where message is obvious from the code that Corralien shows.

    The Masked_Array class definition has this test:

            # Make sure the mask and the data have the same shape
            if mask.shape != _data.shape:
                (nd, nm) = (_data.size, mask.size)
                if nm == 1:
                    mask = np.resize(mask, _data.shape)
                elif nm == nd:
                    mask = np.reshape(mask, _data.shape)
                else:
                    msg = "Mask and data not compatible: data size is %i, " + \
                          "mask size is %i."
                    raise MaskError(msg % (nd, nm))
    

    I'd expect the same message only if the shapes made it past the where test, but were caught by the Class's test. If so that should be obvious in the full error traceback.

    Here's an example that fails on the where, but passes the base.

    In [138]: np.ma.masked_where(cond[:,None],vals)
    IndexError: Inconsistent shape between the condition and the input (got (5, 1) and (5,))
    In [139]: np.ma.masked_array(vals, cond[:,None])
    Out[139]: 
    masked_array(data=[--, 1, --, 3, --],
                 mask=[ True, False,  True, False,  True],
           fill_value=999999)
    

    The base class can handle cases where the cond differs in shape, but matches in size (total number of elements). It tries to reshape it. A scalar cond passes both though the exact test differs.

    Based on my reading of the code, I can't conceive of a difference that passes the where, but not the base.

    All the Masked Array code is python readable (see the link the other answer). While there is one base class definition, there are a number of constructor or helper functions, as the where docs makes clear. I won't worry too much about which function(s) to use, especially if you aren't trying to push the boundaries of what's logical.

    Masked arrays, while a part of numpy for a long time, does not get a whole lot of use, at least judging by relative lack of SO questions. I suspect pandas has largely replaced it when dealing with data that can have missing values (e.g. time series).