I have been given a problem to solve for my assignment. Here's the description for my problem:
In the cell below, you have a DataFrame df that consists of 10 columns of floating-point numbers. Exactly 5 entries in each row are NaN values. For each row of the DataFrame, find the column which contains the third NaN value.
You should return a Series of column labels: e, c, d, h, d
nan = np.nan
data = [[0.04, nan, nan, 0.25, nan, 0.43, 0.71, 0.51, nan, nan],
[ nan, nan, nan, 0.04, 0.76, nan, nan, 0.67, 0.76, 0.16],
[ nan, nan, 0.5 , nan, 0.31, 0.4 , nan, nan, 0.24, 0.01],
[0.49, nan, nan, 0.62, 0.73, 0.26, 0.85, nan, nan, nan],
[ nan, nan, 0.41, nan, 0.05, nan, 0.61, nan, 0.48, 0.68]]
columns = list('abcdefghij')
df = pd.DataFrame(data, columns=columns)
# write a solution to the question here
This is my solution:
result = df.apply(lambda row: row.isna().idxmax(), axis=1)
print(result)
My code returns b, a, a, a, b, a
, whereas the desired output is e, c, d, h, d
. My solution returns the index of the first NaN but according to the question I want the third NaN in each row. How can i do that? Or is there an alternate solution that gives the desired output.
idxmax
was a good approach, you can combine this with a mask that indicates the 3rd NaN, for this use cumsum
:
m = df.isna()
out = (m & m.cumsum(axis=1).eq(3)).idxmax(axis=1)
Output:
0 e
1 c
2 d
3 h
4 d
dtype: object
Intermediates:
# m
a b c d e f g h i j
0 False True True False True False False False True True
1 True True True False False True True False False False
2 True True False True False False True True False False
3 False True True False False False False True True True
4 True True False True False True False True False False
# m.cumsum(axis=1)
a b c d e f g h i j
0 0 1 2 2 3 3 3 3 4 5
1 1 2 3 3 3 4 5 5 5 5
2 1 2 2 3 3 3 4 5 5 5
3 0 1 2 2 2 2 2 3 4 5
4 1 2 2 3 3 4 4 5 5 5
# m.cumsum(axis=1).eq(3)
a b c d e f g h i j
0 False False False False True True True True False False
1 False False True True True False False False False False
2 False False False True True True False False False False
3 False False False False False False False True False False
4 False False False True True False False False False False
# m & m.cumsum(axis=1).eq(3)
a b c d e f g h i j
0 False False False False True False False False False False
1 False False True False False False False False False False
2 False False False True False False False False False False
3 False False False False False False False True False False
4 False False False True False False False False False False
Alternatively, if you can't assume a specific number of NaNs in each row, using melt
+ groupby.nth
:
N = 3
out = (df.melt(ignore_index=False) # reshape to long
.loc[lambda x: x['value'].isna()] # only keep NaNs
.groupby(level=0)['variable'].nth(N-1) # keep 3rd row per group
)
Output:
1 c
2 d
4 d
0 e
3 h
Name: variable, dtype: object