I have a dataframe that looks like:
ID f_1 f_2 f_3
1 1 0 1
2 0 1 1
3 1 1 0
4 1 0 1
5 0 1 1
I have completely no idea as to even how to begin. And also my original dataframe is rather large (~1M rows) and hence a fast method would be highly appreciated.
and I would like to generate a new column Result
which records the pair of f
's that have 1 in them, i.e.
ID f_1 f_2 f_3 Result
1 1 0 1 1_3
2 0 1 1 2_3
3 1 1 0 1_2
4 1 0 1 1_3
5 0 1 1 2_3
You could use a dot
product after renaming the columns with str.replace
/str.removeprefix
:
tmp = df.drop(columns='ID')
df['Result'] = (tmp @ tmp.columns.str.replace('^f', '', regex=True)).str[1:]
# variant
df['Result'] = (tmp @ tmp.columns.str.removeprefix('f')).str[1:]
Alternatively, a more classical pandas (much slower) approach with reshaping (melt
), filtering (with query
), and groupby.agg
:
df['Result'] = (df.melt('ID', ignore_index=False).query('value == 1')
.groupby(level=0)['variable']
.agg(lambda x: '_'.join(x.str.extract('_(\d+)', expand=False)))
)
Or with stack
:
s = df.drop(columns='ID').stack()
df['Result'] = (s[s==1].reset_index(-1).groupby(level=0)['level_1']
.agg(lambda x: '_'.join(x.str.extract('_(\d+)', expand=False)))
)
Output:
ID f_1 f_2 f_3 Result
0 1 1 0 1 1_3
1 2 0 1 1 2_3
2 3 1 1 0 1_2
3 4 1 0 1 1_3
4 5 0 1 1 2_3
On 20K rows:
# dot product
2.96 ms ± 161 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# melt + groupby.agg
965 ms ± 59.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# stack + groupby.agg
928 ms ± 43.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
On 1M rows:
# dot product
359 ms ± 56.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# melt/stack + groupby.agg
did not run under a few minutes