pandasdataframe

How to generate a new column in the dataframe that indicates the columns with positive results?


I have a dataframe that looks like:

ID   f_1   f_2   f_3
1    1     0     1
2    0     1     1
3    1     1     0
4    1     0     1
5    0     1     1   

I have completely no idea as to even how to begin. And also my original dataframe is rather large (~1M rows) and hence a fast method would be highly appreciated.

and I would like to generate a new column Result which records the pair of f's that have 1 in them, i.e.

ID   f_1   f_2   f_3   Result
1    1     0     1     1_3
2    0     1     1     2_3
3    1     1     0     1_2
4    1     0     1     1_3
5    0     1     1     2_3

Solution

  • You could use a dot product after renaming the columns with str.replace/str.removeprefix:

    tmp = df.drop(columns='ID')
    df['Result'] = (tmp @ tmp.columns.str.replace('^f', '', regex=True)).str[1:]
    
    # variant
    df['Result'] = (tmp @ tmp.columns.str.removeprefix('f')).str[1:]
    

    Alternatively, a more classical pandas (much slower) approach with reshaping (melt), filtering (with query), and groupby.agg:

    df['Result'] = (df.melt('ID', ignore_index=False).query('value == 1')
                      .groupby(level=0)['variable']
                      .agg(lambda x: '_'.join(x.str.extract('_(\d+)', expand=False)))
                   )
    

    Or with stack:

    s = df.drop(columns='ID').stack()
    df['Result'] = (s[s==1].reset_index(-1).groupby(level=0)['level_1']
                    .agg(lambda x: '_'.join(x.str.extract('_(\d+)', expand=False)))
                   )
    

    Output:

       ID  f_1  f_2  f_3 Result
    0   1    1    0    1    1_3
    1   2    0    1    1    2_3
    2   3    1    1    0    1_2
    3   4    1    0    1    1_3
    4   5    0    1    1    2_3
    

    Timings

    On 20K rows:

    # dot product
    2.96 ms ± 161 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    # melt + groupby.agg
    965 ms ± 59.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    # stack + groupby.agg
    928 ms ± 43.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    On 1M rows:

    # dot product
    359 ms ± 56.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    # melt/stack + groupby.agg
    did not run under a few minutes