pythonpandasnumpy

How to convert the column with lists into one hot encoded columns?


Assume, there is one DataFrame such as following

import pandas as pd 
import numpy as np 

df = pd.DataFrame({'id':range(1,4), 
                   'items':[['A', 'B'], ['A', 'B', 'C'], ['A', 'C']]})
df
        id  items
        1   [A, B]
        2   [A, B, C]
        3   [A, C]

Is there an efficient way to convert above DataFrame into the following (one-hot encoded columns)? Many Thanks in advance!

   id   items       A   B   C
    1   [A, B]      1   1   0
    2   [A, B, C]   1   1   1
    3   [A, C]      1   0   1

Solution

  • SOLUTION 1

    A possible solution, whose steps are:

    df.merge(
        pd.crosstab(*df.explode('items').to_numpy().T)
        .reset_index(names='id'))
    

    SOLUTION 2

    Another possible solution, whose steps are:

    df.merge(
        df.explode('items')
        .pivot_table(index='id', columns='items', values='id', aggfunc=len, 
                     fill_value=0)
        .rename_axis(None, axis=1).reset_index())
    

    Output:

       id      items  A  B  C
    0   1     [A, B]  1  1  0
    1   2  [A, B, C]  1  1  1
    2   3     [A, C]  1  0  1