Here is my input:
data = np.array ( [ ( 'a2', 'b1', 'c1' ), ( 'a1', 'b1', 'c1' ), ( 'a2', np.NaN, 'c2' ) ], dtype = [ ( 'A', 'O' ), ( 'B', 'O' ), ( 'C', 'O' ) ] ) . view ( np.recarray)
I want this as the output:
rec.array ( [ ( 'a2', 'b1', 'c1' ), ( 'a1', 'b1', 'c1' ) ], dtype = [ ( 'A', 'O'), ( 'B', 'O' ), ( 'C', 'O' ) ] )
I have tried:
data [ data [ 'B' ] ! = np.NaN ] . view ( np.recarray )
but it doesn't work.
data [ data [ 'A' ] ! = 'a2' ] . view ( np.recarray )
gives the desired output.
Why is this method not working for np.NaN
? How do I remove rows containing np.NaN
values in recarrays of object datatype? Also, ~np.isnan()
doesn't work with object datatype.
Define a function that applies np.isnan
, but does not choke on a string):
def foo(item):
try:
return np.isnan(item)
except TypeError:
return False
And use vectorize
to make a function that will apply this to the elements of an array, and return a boolean array:
f=np.vectorize(foo, otypes=[bool])
With your data
:
In [240]: data = np.array ( [ ( 'a2', 'b1', 'c1' ), ( 'a1', 'b1', 'c1' ), ( 'a2' , np.NaN, 'c2' ) ], dtype = [ ( 'A', 'O' ), ( 'B', 'O' ), ( 'C', 'O' ) ] )
In [241]: data
Out[241]:
array([('a2', 'b1', 'c1'), ('a1', 'b1', 'c1'), ('a2', nan, 'c2')],
dtype=[('A', 'O'), ('B', 'O'), ('C', 'O')])
In [242]: data['B']
Out[242]: array(['b1', 'b1', nan], dtype=object)
In [243]: f(data['B'])
Out[243]: array([False, False, True], dtype=bool)
In [244]: data[~f(data['B'])]
Out[244]:
array([('a2', 'b1', 'c1'), ('a1', 'b1', 'c1')],
dtype=[('A', 'O'), ('B', 'O'), ('C', 'O')])
==============
The simplest way to perform this test removeal over all fields is to just iterate on field names:
In [429]: data # expanded with more nan
Out[429]:
array([('a2', 'b1', 'c1'), ('a1', 'b1', 'c1'), ('a2', nan, 'c2'),
('a2', 'b1', nan), (nan, 'b1', 'c1')],
dtype=[('A', 'O'), ('B', 'O'), ('C', 'O')])
The f
function applied to each field and collected into an array:
In [441]: np.array([f(data[name]) for name in data.dtype.names])
Out[441]:
array([[False, False, False, False, True],
[False, False, True, False, False],
[False, False, False, True, False]], dtype=bool)
Use any
to get the columns where any item is True:
In [442]: np.any(_, axis=0)
Out[442]: array([False, False, True, True, True], dtype=bool)
In [443]: data[_] # the ones with nan
Out[443]:
array([('a2', nan, 'c2'), ('a2', 'b1', nan), (nan, 'b1', 'c1')],
dtype=[('A', 'O'), ('B', 'O'), ('C', 'O')])
In [444]: data[~__] # the ones without
Out[444]:
array([('a2', 'b1', 'c1'), ('a1', 'b1', 'c1')],
dtype=[('A', 'O'), ('B', 'O'), ('C', 'O')])
(In Ipython _
and __
contain the results shown in the previous Out
lines.)
tolist
converts the array into a list of tuples (the records of a structured array are displayed as tuples):
In [448]: data.tolist()
Out[448]:
[('a2', 'b1', 'c1'),
('a1', 'b1', 'c1'),
('a2', nan, 'c2'),
('a2', 'b1', nan),
(nan, 'b1', 'c1')]
f
as a vectorized
function is able to apply foo
to each element (apparently it does np.array(data.tolist(), dtype=object)
)
In [449]: f(data.tolist())
Out[449]:
array([[False, False, False],
[False, False, False],
[False, True, False],
[False, False, True],
[ True, False, False]], dtype=bool)
In [450]: np.any(_, axis=1)
Out[450]: array([False, False, True, True, True], dtype=bool)
I've never tried this combination of tolist
and vectorize
before. Vectorized functions iterate over their inputs, so they don't offer much of a speed advantage over explicit iterations, but for tasks like this it sure simplifies the coding.
Another possibility is to define foo
to operate across the fields of a record. In fact I discovered the tolist
trick when I tried to apply f
to a single record:
In [456]: f(data[2])
Out[456]: array(False, dtype=bool)
In [458]: f(list(data[2]))
Out[458]: array([False, True, False], dtype=bool)
In [459]: f(data[2].tolist())
Out[459]: array([False, True, False], dtype=bool)