It is a pretty common occurrence to have leading and trailing NaN
values in a table or DataFrame. This is particularly true after joins and in timeseries data.
import numpy as np
import pandas as pd
df1 = pd.DataFrame({
'a': [1, 2, 3, 4, 5, 6, 7],
'b': [np.NaN, 2, np.NaN, 4, 5, np.NaN, np.NaN],
})
Out[0]:
a b
0 1 NaN
1 2 2.0
2 3 NaN
3 4 4.0
4 5 5.0
5 6 NaN
6 7 NaN
Let's remove these with dropna
.
df1.dropna()
Out[1]:
a b
1 2 2.0
3 4 4.0
4 5 5.0
Oh no!! We lost all the missing values that showed in column b
. I want to keep the middle (inner) ones.
How do I strip the rows with leading and trailing NaN
values in a quick, clean and efficient way? The results should look as follows:
df1.stripna()
# obviously I'm not asking you to create a new pandas method...
# I just thought it was a good name.
Out[3]:
a b
1 2 2.0
2 3 NaN
3 4 4.0
4 5 5.0
Some of the answers so far are pretty nice but I think this is important enough functionality that I raised a feature request with Pandas here if anyone is interested. Let's see how it goes!
Another way, that might be a little more readable is using the pd.Series.first_valid_index
and pd.Series.last_valid_index
with index slicing using loc
:
df1.loc[df1['b'].first_valid_index():df1['b'].last_valid_index()]
Output:
a b
1 2 2.0
2 3 NaN
3 4 4.0
4 5 5.0
And, this should be really fast. Using @LittleBobbyTables input dataframe.
%timeit df1.loc[df1['b'].ffill().notna()&df1['b'].bfill().notna()]
24.2 ms ± 610 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
verus:
%timeit df1.loc[df1['b'].first_valid_index():df1['b'].last_valid_index()]
1.43 ms ± 34.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)