pythonpandasdataframenan

Stripna: Pandas dropna but behaves like strip


It is a pretty common occurrence to have leading and trailing NaN values in a table or DataFrame. This is particularly true after joins and in timeseries data.

import numpy as np
import pandas as pd

df1 = pd.DataFrame({
    'a': [1, 2, 3, 4, 5, 6, 7],
    'b': [np.NaN, 2, np.NaN, 4, 5, np.NaN, np.NaN],
})

Out[0]:
    a   b
0   1   NaN
1   2   2.0
2   3   NaN
3   4   4.0
4   5   5.0
5   6   NaN
6   7   NaN

Let's remove these with dropna.

df1.dropna()

Out[1]:
    a   b
1   2   2.0
3   4   4.0
4   5   5.0

Oh no!! We lost all the missing values that showed in column b. I want to keep the middle (inner) ones.

How do I strip the rows with leading and trailing NaN values in a quick, clean and efficient way? The results should look as follows:

df1.stripna() 
# obviously I'm not asking you to create a new pandas method...
# I just thought it was a good name.

Out[3]:
    a   b
1   2   2.0
2   3   NaN
3   4   4.0
4   5   5.0

Some of the answers so far are pretty nice but I think this is important enough functionality that I raised a feature request with Pandas here if anyone is interested. Let's see how it goes!


Solution

  • Another way, that might be a little more readable is using the pd.Series.first_valid_index and pd.Series.last_valid_index with index slicing using loc:

    df1.loc[df1['b'].first_valid_index():df1['b'].last_valid_index()]
    

    Output:

       a    b
    1  2  2.0
    2  3  NaN
    3  4  4.0
    4  5  5.0
    

    And, this should be really fast. Using @LittleBobbyTables input dataframe.

    %timeit df1.loc[df1['b'].ffill().notna()&df1['b'].bfill().notna()]
    

    24.2 ms ± 610 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

    verus:

    %timeit df1.loc[df1['b'].first_valid_index():df1['b'].last_valid_index()]
    

    1.43 ms ± 34.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)