This seems a little to me like one-hot encoding, but notably different. What I want to do is take a row of integers from a Pandas DataFrame and produce a binary column with 1's at the index locations specified by the integers and 0's everywhere else. If possible, I would like to do this for many rows at the same time. So a trivial example would be given as taking
index | A | B | C |
---|---|---|---|
0 | 1 | 4 | 7 |
1 | 2 | 5 | 8 |
2 | 3 | 6 | 9 |
and producing
index | .0 | .1 | .2 |
---|---|---|---|
0 | 0 | 0 | 0 |
1 | 1 | 0 | 0 |
2 | 0 | 1 | 0 |
3 | 0 | 0 | 1 |
4 | 1 | 0 | 0 |
5 | 0 | 1 | 0 |
6 | 0 | 0 | 1 |
7 | 1 | 0 | 0 |
8 | 0 | 1 | 0 |
9 | 0 | 0 | 1 |
I have tried
new_df = pandas.DataFrame(range(old_df.max(axis=None)+1)).isin(list(old_df.iloc[0]))
which works for a single row (the first row of the old_df in this case), but doesn't seem easily scalable to an arbitrary number of rows. Is there a built in function that does something similar to this?
Another possible solution:
pd.get_dummies(df.stack()).groupby(level=0).sum().T
First, it uses stack
to pivot the data from a wide to a long format, creating a series with a MultiIndex that pairs original row and column positions. This long-format series is then fed into get_dummies
to perform the one-hot encoding, converting the integer values into new binary columns. Subsequently, groupby(level=0).sum()
aggregates these results by summing the binary indicators for each of the original rows (the first level of the MultiIndex). Finally, the .T
attribute pivots the resulting dataframe.
Output:
0 1 2
1 1 0 0
2 0 1 0
3 0 0 1
4 1 0 0
5 0 1 0
6 0 0 1
7 1 0 0
8 0 1 0
9 0 0 1
Intermediates:
# (df.stack(),
# pd.get_dummies(df.stack()),
# [g for g in pd.get_dummies(df.stack()).groupby(level=0)],
# pd.get_dummies(df.stack()).groupby(level=0).sum())
(0 A 1
B 4
C 7
1 A 2
B 5
C 8
2 A 3
B 6
C 9
dtype: int64,
1 2 3 4 5 6 7 8 9
0 A True False False False False False False False False
B False False False True False False False False False
C False False False False False False True False False
1 A False True False False False False False False False
B False False False False True False False False False
C False False False False False False False True False
2 A False False True False False False False False False
B False False False False False True False False False
C False False False False False False False False True,
[(0,
1 2 3 4 5 6 7 8 9
0 A True False False False False False False False False
B False False False True False False False False False
C False False False False False False True False False),
(1,
1 2 3 4 5 6 7 8 9
1 A False True False False False False False False False
B False False False False True False False False False
C False False False False False False False True False),
(2,
1 2 3 4 5 6 7 8 9
2 A False False True False False False False False False
B False False False False False True False False False
C False False False False False False False False True)],
1 2 3 4 5 6 7 8 9
0 1 0 0 1 0 0 1 0 0
1 0 1 0 0 1 0 0 1 0
2 0 0 1 0 0 1 0 0 1)