pythonpandasdataframesparse-matrixbinary-matrix

Python Pandas: How to create a binary matrix from column of lists?


I have a Python Pandas DataFrame like the following:

      1
0  a, b
1     c
2     d
3     e

a, b is a string representing a list of user features

How can I convert this into a binary matrix of the user features like the following:

     a    b    c    d    e
0    1    1    0    0    0
1    0    0    1    0    0
2    0    0    0    1    0
3    0    0    0    0    1

I saw a similar question Creating boolean matrix from one column with pandas but the column does not contain entries which are lists.

I have tried these approaches, is there a way to merge the two:

pd.get_dummies()

pd.get_dummies(df[1])


   a, b  c  d  e
0     1  0  0  0
1     0  1  0  0
2     0  0  1  0
3     0  0  0  1

df[1].apply(lambda x: pd.Series(x.split()))

      1
0  a, b
1     c
2     d
3     e

Also interested in different ways to create this type of binary matrix!

Any help is appreciated!

Thanks


Solution

  • I think you can use:

    df = df.iloc[:,0].str.split(', ', expand=True)
           .stack()
           .reset_index(drop=True)
           .str.get_dummies()
    
    print df
       a  b  c  d  e
    0  1  0  0  0  0
    1  0  1  0  0  0
    2  0  0  1  0  0
    3  0  0  0  1  0
    4  0  0  0  0  1
    

    EDITED:

    print df.iloc[:,0].str.replace(' ','').str.get_dummies(sep=',')
       a  b  c  d  e
    0  1  1  0  0  0
    1  0  0  1  0  0
    2  0  0  0  1  0
    3  0  0  0  0  1