pythonpandas

thresh in dropna for DataFrame in pandas in python


df1 = pd.DataFrame(np.arange(15).reshape(5,3))
df1.iloc[:4,1] = np.nan
df1.iloc[:2,2] = np.nan
df1.dropna(thresh=1 ,axis=1)

It seems that no nan value has been deleted.

    0     1     2
0   0   NaN   NaN
1   3   NaN   NaN
2   6   NaN   8.0
3   9   NaN  11.0
4  12  13.0  14.0

if i run

df1.dropna(thresh=2,axis=1)

why it gives the following?

    0     2
0   0   NaN
1   3   NaN
2   6   8.0
3   9  11.0
4  12  14.0

i just dont understand what thresh is doing here. If a column has more than one nan value, should the column be deleted?


Solution

  • thresh=N requires that a column has at least N non-NaNs to survive. In the first example, both columns have at least one non-NaN, so both survive. In the second example, only the last column has at least two non-NaNs, so it survives, but the previous column is dropped.

    Try setting thresh to 4 to get a better sense of what's happening.