pythonpandasdataframe

Convert (many) integer-valued rows into binary indicator columns using Pandas


This seems a little to me like one-hot encoding, but notably different. What I want to do is take a row of integers from a Pandas DataFrame and produce a binary column with 1's at the index locations specified by the integers and 0's everywhere else. If possible, I would like to do this for many rows at the same time. So a trivial example would be given as taking

index A B C
0 1 4 7
1 2 5 8
2 3 6 9

and producing

index .0 .1 .2
0 0 0 0
1 1 0 0
2 0 1 0
3 0 0 1
4 1 0 0
5 0 1 0
6 0 0 1
7 1 0 0
8 0 1 0
9 0 0 1

I have tried

new_df = pandas.DataFrame(range(old_df.max(axis=None)+1)).isin(list(old_df.iloc[0]))

which works for a single row (the first row of the old_df in this case), but doesn't seem easily scalable to an arbitrary number of rows. Is there a built in function that does something similar to this?


Solution

  • Another possible solution:

    pd.get_dummies(df.stack()).groupby(level=0).sum().T
    

    First, it uses stack to pivot the data from a wide to a long format, creating a series with a MultiIndex that pairs original row and column positions. This long-format series is then fed into get_dummies to perform the one-hot encoding, converting the integer values into new binary columns. Subsequently, groupby(level=0).sum() aggregates these results by summing the binary indicators for each of the original rows (the first level of the MultiIndex). Finally, the .T attribute pivots the resulting dataframe.

    Output:

       0  1  2
    1  1  0  0
    2  0  1  0
    3  0  0  1
    4  1  0  0
    5  0  1  0
    6  0  0  1
    7  1  0  0
    8  0  1  0
    9  0  0  1
    

    Intermediates:

    
    # (df.stack(), 
    #  pd.get_dummies(df.stack()),  
    #  [g for g in pd.get_dummies(df.stack()).groupby(level=0)], 
    #  pd.get_dummies(df.stack()).groupby(level=0).sum())
    
    (0  A    1
        B    4
        C    7
     1  A    2
        B    5
        C    8
     2  A    3
        B    6
        C    9
     dtype: int64,
              1      2      3      4      5      6      7      8      9
     0 A   True  False  False  False  False  False  False  False  False
       B  False  False  False   True  False  False  False  False  False
       C  False  False  False  False  False  False   True  False  False
     1 A  False   True  False  False  False  False  False  False  False
       B  False  False  False  False   True  False  False  False  False
       C  False  False  False  False  False  False  False   True  False
     2 A  False  False   True  False  False  False  False  False  False
       B  False  False  False  False  False   True  False  False  False
       C  False  False  False  False  False  False  False  False   True,
     [(0,
                1      2      3      4      5      6      7      8      9
       0 A   True  False  False  False  False  False  False  False  False
         B  False  False  False   True  False  False  False  False  False
         C  False  False  False  False  False  False   True  False  False),
      (1,
                1      2      3      4      5      6      7      8      9
       1 A  False   True  False  False  False  False  False  False  False
         B  False  False  False  False   True  False  False  False  False
         C  False  False  False  False  False  False  False   True  False),
      (2,
                1      2      3      4      5      6      7      8      9
       2 A  False  False   True  False  False  False  False  False  False
         B  False  False  False  False  False   True  False  False  False
         C  False  False  False  False  False  False  False  False   True)],
        1  2  3  4  5  6  7  8  9
     0  1  0  0  1  0  0  1  0  0
     1  0  1  0  0  1  0  0  1  0
     2  0  0  1  0  0  1  0  0  1)