I have a huge dataframe but sharing only sample below. Its a CSV with sample header column names as shown below.
sample.csv
cnum,sup1,sup2,sup3,sup4
285414459,1,0,1,1
445633709,1,0,0,0
556714736,0,0,1,0
1089852074,0,1,0,1
A cnum can have 0 or 1 set in all sup* columns. I want to select and print the column name where first 1 is encountered for that cnum. All other 1 after that should be ignored and no column name should be printed in output.
expected output:
cnum,supcol
285414459,sup1
445633709,sup1
556714736,sup3
1089852074,sup2
Currently I tried this code:
import pandas as pd
df=pd.read_csv('sample.csv')
df_union=pd.DataFrame(columns=['cnum','supcol'])
for col in df.columns:
df1=df.filter(['cnum']).loc[df[col] == 1]
df1['supcol']=col
df_union=df_union.append(df1)
print(df_union)
However it is printing all column names where 1 is set for the column name. I want only the first one. Kindly help
It seems like you can use idxmax
here:
df.set_index('cnum').idxmax(axis=1).reset_index(drop=True)
0 sup1
1 sup1
2 sup3
3 sup2
dtype: object
df['output'] = df.set_index('cnum').idxmax(axis=1).reset_index(drop=True)
# Slightly faster,
# df['output'] = df.set_index('cnum').idxmax(axis=1).to_numpy()
df
cnum sup1 sup2 sup3 sup4 output
0 285414459 1 0 1 1 sup1
1 445633709 1 0 0 0 sup1
2 556714736 0 0 1 0 sup3
3 1089852074 0 1 0 1 sup2
Another option with dot
(will give you all non-zero columns):
d = df.set_index('cnum')
d.dot(d.columns + ',').str.rstrip(',').reset_index(drop=True)
0 sup1,sup3,sup4
1 sup1
2 sup3
3 sup2,sup4
dtype: object
Or,
(d.dot(d.columns + ',')
.str.rstrip(',')
.str.split(',', 1).str[0]
.reset_index(drop=True))
0 sup1
1 sup1
2 sup3
3 sup2
dtype: object