pythonpandasdataframe

Get the column name for the first non-zero value in that row with pandas


I have a huge dataframe but sharing only sample below. Its a CSV with sample header column names as shown below.

sample.csv
cnum,sup1,sup2,sup3,sup4
285414459,1,0,1,1
445633709,1,0,0,0
556714736,0,0,1,0
1089852074,0,1,0,1

A cnum can have 0 or 1 set in all sup* columns. I want to select and print the column name where first 1 is encountered for that cnum. All other 1 after that should be ignored and no column name should be printed in output.

expected output:
cnum,supcol
285414459,sup1
445633709,sup1
556714736,sup3
1089852074,sup2

Currently I tried this code:

import pandas as pd
df=pd.read_csv('sample.csv')
df_union=pd.DataFrame(columns=['cnum','supcol'])
for col in df.columns: 
    df1=df.filter(['cnum']).loc[df[col] == 1]
    df1['supcol']=col
    df_union=df_union.append(df1)
print(df_union)

However it is printing all column names where 1 is set for the column name. I want only the first one. Kindly help


Solution

  • It seems like you can use idxmax here:

    df.set_index('cnum').idxmax(axis=1).reset_index(drop=True)
    
    0    sup1
    1    sup1
    2    sup3
    3    sup2
    dtype: object
    
    df['output'] = df.set_index('cnum').idxmax(axis=1).reset_index(drop=True) 
    # Slightly faster,
    # df['output'] = df.set_index('cnum').idxmax(axis=1).to_numpy() 
    
    df
             cnum  sup1  sup2  sup3  sup4 output
    0   285414459     1     0     1     1   sup1
    1   445633709     1     0     0     0   sup1
    2   556714736     0     0     1     0   sup3
    3  1089852074     0     1     0     1   sup2
    

    Another option with dot (will give you all non-zero columns):

    d = df.set_index('cnum') 
    d.dot(d.columns + ',').str.rstrip(',').reset_index(drop=True)
    
    0    sup1,sup3,sup4
    1              sup1
    2              sup3
    3         sup2,sup4
    dtype: object
    

    Or,

    (d.dot(d.columns + ',')
      .str.rstrip(',')
      .str.split(',', 1).str[0] 
      .reset_index(drop=True))
    
    0    sup1
    1    sup1
    2    sup3
    3    sup2
    dtype: object