pythonpandasdataframe

How can I find the column containing the third NaN value in each row of a DataFrame in pandas?


I have been given a problem to solve for my assignment. Here's the description for my problem:

In the cell below, you have a DataFrame df that consists of 10 columns of floating-point numbers. Exactly 5 entries in each row are NaN values. For each row of the DataFrame, find the column which contains the third NaN value.

You should return a Series of column labels: e, c, d, h, d

nan = np.nan

data = [[0.04,  nan,  nan, 0.25,  nan, 0.43, 0.71, 0.51,  nan,  nan],
        [ nan,  nan,  nan, 0.04, 0.76,  nan,  nan, 0.67, 0.76, 0.16],
        [ nan,  nan, 0.5 ,  nan, 0.31, 0.4 ,  nan,  nan, 0.24, 0.01],
        [0.49,  nan,  nan, 0.62, 0.73, 0.26, 0.85,  nan,  nan,  nan],
        [ nan,  nan, 0.41,  nan, 0.05,  nan, 0.61,  nan, 0.48, 0.68]]

columns = list('abcdefghij')

df = pd.DataFrame(data, columns=columns)

# write a solution to the question here

This is my solution:

result = df.apply(lambda row: row.isna().idxmax(), axis=1)
print(result)

My code returns b, a, a, a, b, a, whereas the desired output is e, c, d, h, d. My solution returns the index of the first NaN but according to the question I want the third NaN in each row. How can i do that? Or is there an alternate solution that gives the desired output.


Solution

  • idxmax was a good approach, you can combine this with a mask that indicates the 3rd NaN, for this use cumsum:

    m = df.isna()
    
    out = (m & m.cumsum(axis=1).eq(3)).idxmax(axis=1)
    

    Output:

    0    e
    1    c
    2    d
    3    h
    4    d
    dtype: object
    

    Intermediates:

    # m
           a     b      c      d      e      f      g      h      i      j
    0  False  True   True  False   True  False  False  False   True   True
    1   True  True   True  False  False   True   True  False  False  False
    2   True  True  False   True  False  False   True   True  False  False
    3  False  True   True  False  False  False  False   True   True   True
    4   True  True  False   True  False   True  False   True  False  False
    
    # m.cumsum(axis=1)
       a  b  c  d  e  f  g  h  i  j
    0  0  1  2  2  3  3  3  3  4  5
    1  1  2  3  3  3  4  5  5  5  5
    2  1  2  2  3  3  3  4  5  5  5
    3  0  1  2  2  2  2  2  3  4  5
    4  1  2  2  3  3  4  4  5  5  5
    
    # m.cumsum(axis=1).eq(3)
           a      b      c      d      e      f      g      h      i      j
    0  False  False  False  False   True   True   True   True  False  False
    1  False  False   True   True   True  False  False  False  False  False
    2  False  False  False   True   True   True  False  False  False  False
    3  False  False  False  False  False  False  False   True  False  False
    4  False  False  False   True   True  False  False  False  False  False
    
    # m & m.cumsum(axis=1).eq(3)
           a      b      c      d      e      f      g      h      i      j
    0  False  False  False  False   True  False  False  False  False  False
    1  False  False   True  False  False  False  False  False  False  False
    2  False  False  False   True  False  False  False  False  False  False
    3  False  False  False  False  False  False  False   True  False  False
    4  False  False  False   True  False  False  False  False  False  False
    

    Alternatively, if you can't assume a specific number of NaNs in each row, using melt + groupby.nth:

    N = 3
    out = (df.melt(ignore_index=False)              # reshape to long
             .loc[lambda x: x['value'].isna()]      # only keep NaNs
             .groupby(level=0)['variable'].nth(N-1) # keep 3rd row per group
          )
    

    Output:

    1    c
    2    d
    4    d
    0    e
    3    h
    Name: variable, dtype: object