pythonpandasdataframe

remove all rows in pandas dataframe with N or more consecutive NaNs


corollary to this question: replace values in pandas column when N number of NaNs exist in another column

         a         b         c     d           e
2018-05-25  0.000381  0.264318     land    2018-05-25
2018-05-26  0.000000  0.264447     land    2018-05-26
2018-05-27  0.000000  0.264791     NaN           NaT
2018-05-28  0.000000  0.265253     NaN           NaT
2018-05-29  0.000000  0.265720     NaN           NaT
2018-05-30  0.000000  0.266066     land    2018-05-30
2018-05-31  0.000000  0.266150     NaN           NaT
2018-06-01  0.000000  0.265816     NaN           NaT
2018-06-02  0.000000  0.264892     land    2018-06-02
2018-06-03  0.000000  0.263191     NaN           NaT
2018-06-04  0.000000  0.260508     land    2018-06-04
2018-06-05  0.000000  0.256619     NaN           NaT
2018-06-06  0.000000  0.251286     NaN           NaT
2018-06-07  0.000000  0.244250     NaN           NaT
2018-06-08  0.000000  0.235231     NaN           NaT
2018-06-09  0.000000  0.223932     land    2018-06-09

I want to remove all rows where there is a NaN in the 4th column (d) 3 or more times. The output should be:

         a         b         c     d           e
2018-05-25  0.000381  0.264318     land    2018-05-25
2018-05-26  0.000000  0.264447     land    2018-05-26
2018-05-30  0.000000  0.266066     land    2018-05-30
2018-05-31  0.000000  0.266150     NaN           NaT
2018-06-01  0.000000  0.265816     NaN           NaT
2018-06-02  0.000000  0.264892     land    2018-06-02
2018-06-03  0.000000  0.263191     NaN           NaT
2018-06-04  0.000000  0.260508     land    2018-06-04
2018-06-09  0.000000  0.223932     land    2018-06-09

From that question, I tried this:

    threshold = 3
    mask = df.d.notna()
    df.loc[(~mask).groupby(mask.cumsum()).transform('cumsum') < threshold, 'c'] = np.nan
    df = df[np.isfinite(df['c'])]

but it does not work


Solution

  • Create helper Series a by consecutive values and transform size, last filter by boolean indexing:

    mask = df.d.notna()
    a = mask.ne(mask.shift()).cumsum()
    
    df = df[(a.groupby(a).transform('size') < 3) | mask]
    print (df)
                 a         b         c     d           e
    0   2018-05-25  0.000381  0.264318  land  2018-05-25
    1   2018-05-26  0.000000  0.264447  land  2018-05-26
    5   2018-05-30  0.000000  0.266066  land  2018-05-30
    6   2018-05-31  0.000000  0.266150   NaN         NaT
    7   2018-06-01  0.000000  0.265816   NaN         NaT
    8   2018-06-02  0.000000  0.264892  land  2018-06-02
    9   2018-06-03  0.000000  0.263191   NaN         NaT
    10  2018-06-04  0.000000  0.260508  land  2018-06-04
    15  2018-06-09  0.000000  0.223932  land  2018-06-09
    

    Detail:

    print (a)
    0     1
    1     1
    2     2
    3     2
    4     2
    5     3
    6     4
    7     4
    8     5
    9     6
    10    7
    11    8
    12    8
    13    8
    14    8
    15    9
    Name: d, dtype: int32
    

    print (a.groupby(a).transform('size'))
    0     2
    1     2
    2     3
    3     3
    4     3
    5     1
    6     2
    7     2
    8     1
    9     1
    10    1
    11    4
    12    4
    13    4
    14    4
    15    1
    Name: d, dtype: int64