pythonpandasdataframenumpy

Improve code that finds nan values with a condition and removes them


I have a dataframe where each column starts and finished with certain number of nan values. Somewhere in the middle of a column there is a continuous list of values. It can happen that a nan value "interrupts" the data. I want to iterate over each column, find such values and then remove the whole row.

For example, I want to find the np.nan between 9 and 13 and remove it:

[np.nan, np.nan, np.nan, 1, 4, 6, 6, 9, np.nan, 13, np.nan, np.nan]

Conditions for removal:

  1. if value has at least one data point before
  2. if value has at least one data point after
  3. if value is nan

I wrote code that does this already, but it's slow and kind of wordy.

import pandas as pd
import numpy as np


data = {'A': [np.nan, np.nan, np.nan, 1, 4, 6, 6, 9, np.nan, 13, np.nan, np.nan], 'B': [np.nan, np.nan, np.nan, 11, 3, 16, 13, np.nan, np.nan, 12, np.nan, np.nan]}
df = pd.DataFrame(data)

def get_nans(column):
    output = []
    for index_to_check, value in column.items():
        has_value_before = not column[:index_to_check].isnull().all()
        has_value_after = not column[index_to_check + 1:].isnull().all()
        is_nan = np.isnan(value)
        output.append(not( has_value_before and has_value_after and is_nan))
    return output

for column in df.columns:
    df = df[get_nans(df[column])]
    
print(df)

How can I improve my code, vectorize it etc?


Solution

  • You could use a vectorial approach with isna and cummin to perform boolean indexing.

    First let's use one column as example:

    # identify NaNs
    m1 = df['A'].isna()
    # Identify external NaNs
    m2 = (m1.cummin()|m1[::-1].cummin())
    
    out= df.loc[m2 | ~m1, 'A']
    

    Output:

    0      NaN
    1      NaN
    2      NaN
    3      1.0
    4      4.0
    5      6.0
    6      6.0
    7      9.0
    9     13.0
    10     NaN
    11     NaN
    Name: A, dtype: float64
    

    Then you can vectorize to the whole DataFrame and aggregate with all:

    m1 = df.isna()
    m2 = (m1.cummin()|m1[::-1].cummin())
    
    out= df.loc[(m2 | ~m1).all(axis=1)]
    

    Output:

           A     B
    0    NaN   NaN
    1    NaN   NaN
    2    NaN   NaN
    3    1.0  11.0
    4    4.0   3.0
    5    6.0  16.0
    6    6.0  13.0
    9   13.0  12.0
    10   NaN   NaN
    11   NaN   NaN
    

    Another option would be to leverage interpolate with limit_area='inside':

    # is the cell not NaN?
    m1 = df.notna()
    # is the cell an external NaN?
    m2 = df.interpolate(limit_area='inside').isna()
    
    out = df[(m1|m2).all(axis=1)]