I have a table with rows in this format where the integers are a count:
A B C D E
0 a 2 0 3 x
1 b 1 2 0 y
I'd like to convert it into a format where each count is a one hot encoded row:
A B C D E
0 a 1 0 0 x
1 a 1 0 0 x
2 a 0 0 1 x
3 a 0 0 1 x
4 a 0 0 1 x
5 b 1 0 0 y
6 b 0 1 0 y
7 b 0 1 0 y
I wrote inefficient code which achieves this
# Sample DataFrame
data = {
'A': ['a', 'b'],
'B': [2, 1],
'C': [0, 2],
'D': [3, 0],
'E': ['x', 'y']
}
df = pd.DataFrame(data)
new_df = pd.DataFrame(columns=df.columns)
for index, row in df.iterrows():
first_val = row.iloc[0]
last_val = row.iloc[-1]
middle_vals = row.iloc[1:-1].astype(int)
for i in range(len(middle_vals)):
new_data = [first_val] + [1 if i == j else 0 for j in range(len(middle_vals))] + [last_val]
new_rows = pd.DataFrame([new_data] * middle_vals.iloc[i], columns=df.columns)
new_df = pd.concat([new_df, new_rows], ignore_index=True)
Any tips for vectorizing this operation which is incredibly slow? I realize a concat operation per iteration is a big issue, so I did try a batching solution where I collect chunks of new_rows
and then concat. This remains slow.
Here is a full numpy solution, I would expect this to be faster than reshaping:
import numpy as np
num_cols = ['B', 'C', 'D']
# convert to numpy array
a = df[num_cols].to_numpy()
# build indices to repeat
idx = np.repeat(np.arange(a.shape[0]), a.sum(1))
# array([0, 0, 0, 0, 0, 1, 1, 1])
# build column indices to repeat
cols = np.repeat(np.tile(np.arange(a.shape[1]), a.shape[0]), a.flat)
# array([0, 0, 2, 2, 2, 0, 1, 1])
# assign 1s
b = np.zeros((len(idx), len(num_cols)), dtype=int)
b[np.arange(len(idx)), cols] = 1
# repeat and update DataFrame
out = df.iloc[idx]
out.loc[:, num_cols] = b
Output:
A B C D E
0 a 1 0 0 x
0 a 1 0 0 x
0 a 0 0 1 x
0 a 0 0 1 x
0 a 0 0 1 x
1 b 1 0 0 y
1 b 0 1 0 y
1 b 0 1 0 y
# setup
N = 10_000
df = (pd.DataFrame(np.clip(np.random.randint(-10, 15, (N, 10)), 0, 15)).assign(A=np.arange(N), E=np.arange(N))
[['A', 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 'E']]
)
# numpy approach (this one)
45.2 ms ± 6.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# list approach (@EmiOB)
249 ms ± 36.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# pure pandas melt+repeat+pivot approach
3.68 s ± 146 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)