I have a dataframe where each column starts and finished with certain number of nan values. Somewhere in the middle of a column there is a continuous list of values. It can happen that a nan value "interrupts" the data. I want to iterate over each column, find such values and then remove the whole row.
For example, I want to find the np.nan
between 9
and 13
and remove it:
[np.nan, np.nan, np.nan, 1, 4, 6, 6, 9, np.nan, 13, np.nan, np.nan]
Conditions for removal:
I wrote code that does this already, but it's slow and kind of wordy.
import pandas as pd
import numpy as np
data = {'A': [np.nan, np.nan, np.nan, 1, 4, 6, 6, 9, np.nan, 13, np.nan, np.nan], 'B': [np.nan, np.nan, np.nan, 11, 3, 16, 13, np.nan, np.nan, 12, np.nan, np.nan]}
df = pd.DataFrame(data)
def get_nans(column):
output = []
for index_to_check, value in column.items():
has_value_before = not column[:index_to_check].isnull().all()
has_value_after = not column[index_to_check + 1:].isnull().all()
is_nan = np.isnan(value)
output.append(not( has_value_before and has_value_after and is_nan))
return output
for column in df.columns:
df = df[get_nans(df[column])]
print(df)
How can I improve my code, vectorize it etc?
You could use a vectorial approach with isna
and cummin
to perform boolean indexing.
First let's use one column as example:
# identify NaNs
m1 = df['A'].isna()
# Identify external NaNs
m2 = (m1.cummin()|m1[::-1].cummin())
out= df.loc[m2 | ~m1, 'A']
Output:
0 NaN
1 NaN
2 NaN
3 1.0
4 4.0
5 6.0
6 6.0
7 9.0
9 13.0
10 NaN
11 NaN
Name: A, dtype: float64
Then you can vectorize to the whole DataFrame and aggregate with all
:
m1 = df.isna()
m2 = (m1.cummin()|m1[::-1].cummin())
out= df.loc[(m2 | ~m1).all(axis=1)]
Output:
A B
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 1.0 11.0
4 4.0 3.0
5 6.0 16.0
6 6.0 13.0
9 13.0 12.0
10 NaN NaN
11 NaN NaN
Another option would be to leverage interpolate
with limit_area='inside'
:
# is the cell not NaN?
m1 = df.notna()
# is the cell an external NaN?
m2 = df.interpolate(limit_area='inside').isna()
out = df[(m1|m2).all(axis=1)]