Context: Data transformation for a logistic regression problem. I have the following data structure:
df = pd.DataFrame({"group": ["A", "B"], "total": [3, 5], "occurrence": [2, 1]})
I want to do sth. like pd.explode
, but creating one row for item of total
, i.e. 5+6 rows where occurrence
number of rows hold 1
and the rest 0
(either in the occurrence
columns or a new target column).
Currently I'm doing it iteratively which is quite slow on large data:
expanded = []
for ix, row in df.iterrows():
for i in range(row["total"]):
row["y"] = 1 if i < row["occurrence"] else 0
expanded.append(row.copy())
df_out = pd.DataFrame(expanded).reset_index(drop=True)
df_out.drop(["total", "occurrence"], axis=1, inplace=True)
df_out
group y
0 A 1
1 A 1
2 A 0
3 B 1
4 B 0
5 B 0
6 B 0
7 B 0
You could repeat
the rows, then assign
a new column based on the output from groupby.cumcount
:
out = (df.loc[df.index.repeat(df['total']), ['group', 'occurrence']]
.assign(y=lambda x: x.groupby(level=0).cumcount()
.lt(x.pop('occurrence'))
.astype(int))
)
import numpy as np
out = pd.DataFrame({'group': np.repeat(df['group'], df['total']),
'y': np.repeat(np.tile([1, 0], len(df)),
np.r_[df['occurrence'],
df['total']-df['occurrence']])
})
Output:
group y
0 A 1
0 A 1
0 A 0
1 B 1
1 B 0
1 B 0
1 B 0
1 B 0
If you want to use explode
:
out = (df.assign(y= lambda x: [[1 if i<o else 0 for i in range(t)] for
t, o in zip(x['total'], x['occurrence'])])
.explode('y')
)
Output:
group total occurrence y
0 A 3 2 1
0 A 3 2 1
0 A 3 2 0
1 B 5 1 1
1 B 5 1 0
1 B 5 1 0
1 B 5 1 0
1 B 5 1 0
On 20k rows. The pure numpy approach is the fastest.
# original approach
9.96 s ± 248 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# repetition
7.17 ms ± 242 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# numpy tile+repeat
1.31 ms ± 23.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# explode
18.2 ms ± 959 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# numpy + list comprehension (PaulS)
125 ms ± 2.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# numba (PaulS)
3.88 ms ± 144 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)