numpynanrecarray

Removing rows with nan values in recarrays of object datatype


Here is my input:

data = np.array ( [ ( 'a2', 'b1', 'c1' ), ( 'a1', 'b1', 'c1' ), ( 'a2', np.NaN, 'c2' ) ], dtype = [ ( 'A', 'O' ), ( 'B', 'O' ), ( 'C', 'O' ) ] ) . view ( np.recarray)

I want this as the output:

rec.array ( [ ( 'a2', 'b1', 'c1' ), ( 'a1', 'b1', 'c1' ) ], dtype = [ ( 'A', 'O'), ( 'B', 'O' ), ( 'C', 'O' )  ] )

I have tried:

data [ data [ 'B' ] ! = np.NaN ] . view ( np.recarray )

but it doesn't work.

data [ data [ 'A' ] ! = 'a2' ] . view ( np.recarray ) 

gives the desired output.

Why is this method not working for np.NaN? How do I remove rows containing np.NaN values in recarrays of object datatype? Also, ~np.isnan() doesn't work with object datatype.


Solution

  • Define a function that applies np.isnan, but does not choke on a string):

    def foo(item):
        try:
            return np.isnan(item)
        except TypeError:
            return False
    

    And use vectorize to make a function that will apply this to the elements of an array, and return a boolean array:

    f=np.vectorize(foo, otypes=[bool])
    

    With your data:

    In [240]: data = np.array ( [ ( 'a2', 'b1', 'c1' ), ( 'a1', 'b1', 'c1' ), ( 'a2' , np.NaN, 'c2' ) ], dtype = [ ( 'A', 'O' ), ( 'B', 'O' ), ( 'C', 'O' ) ] )
    In [241]: data
    Out[241]: 
    array([('a2', 'b1', 'c1'), ('a1', 'b1', 'c1'), ('a2', nan, 'c2')], 
          dtype=[('A', 'O'), ('B', 'O'), ('C', 'O')])
    In [242]: data['B']
    Out[242]: array(['b1', 'b1', nan], dtype=object)
    
    In [243]: f(data['B'])
    Out[243]: array([False, False,  True], dtype=bool)
    
    In [244]: data[~f(data['B'])]
    Out[244]: 
    array([('a2', 'b1', 'c1'), ('a1', 'b1', 'c1')], 
          dtype=[('A', 'O'), ('B', 'O'), ('C', 'O')])
    

    ==============

    The simplest way to perform this test removeal over all fields is to just iterate on field names:

    In [429]: data    # expanded with more nan
    Out[429]: 
    array([('a2', 'b1', 'c1'), ('a1', 'b1', 'c1'), ('a2', nan, 'c2'),
           ('a2', 'b1', nan), (nan, 'b1', 'c1')], 
          dtype=[('A', 'O'), ('B', 'O'), ('C', 'O')])
    

    The f function applied to each field and collected into an array:

    In [441]: np.array([f(data[name]) for name in data.dtype.names])
    Out[441]: 
    array([[False, False, False, False,  True],
           [False, False,  True, False, False],
           [False, False, False,  True, False]], dtype=bool)
    

    Use any to get the columns where any item is True:

    In [442]: np.any(_, axis=0)
    Out[442]: array([False, False,  True,  True,  True], dtype=bool)
    In [443]: data[_]    # the ones with nan
    Out[443]: 
    array([('a2', nan, 'c2'), ('a2', 'b1', nan), (nan, 'b1', 'c1')], 
          dtype=[('A', 'O'), ('B', 'O'), ('C', 'O')])
    In [444]: data[~__]   # the ones without
    Out[444]: 
    array([('a2', 'b1', 'c1'), ('a1', 'b1', 'c1')], 
          dtype=[('A', 'O'), ('B', 'O'), ('C', 'O')])
    

    (In Ipython _ and __ contain the results shown in the previous Out lines.)

    tolist converts the array into a list of tuples (the records of a structured array are displayed as tuples):

    In [448]: data.tolist()
    Out[448]: 
    [('a2', 'b1', 'c1'),
     ('a1', 'b1', 'c1'),
     ('a2', nan, 'c2'),
     ('a2', 'b1', nan),
     (nan, 'b1', 'c1')]
    

    f as a vectorized function is able to apply foo to each element (apparently it does np.array(data.tolist(), dtype=object))

    In [449]: f(data.tolist())
    Out[449]: 
    array([[False, False, False],
           [False, False, False],
           [False,  True, False],
           [False, False,  True],
           [ True, False, False]], dtype=bool)
    In [450]: np.any(_, axis=1)
    Out[450]: array([False, False,  True,  True,  True], dtype=bool)
    

    I've never tried this combination of tolist and vectorize before. Vectorized functions iterate over their inputs, so they don't offer much of a speed advantage over explicit iterations, but for tasks like this it sure simplifies the coding.

    Another possibility is to define foo to operate across the fields of a record. In fact I discovered the tolist trick when I tried to apply f to a single record:

    In [456]: f(data[2])
    Out[456]: array(False, dtype=bool)
    In [458]: f(list(data[2]))
    Out[458]: array([False,  True, False], dtype=bool)
    In [459]: f(data[2].tolist())
    Out[459]: array([False,  True, False], dtype=bool)