pythonpandasmarket-basket-analysis

Market Basket Association Analysis python or SQL


I am using a dataset as below. Rows show invoice numbers, columns show products. I want to show the number of products on the same invoice as a matrix (i.e. there will be products in both rows and columns, the intersection of the row and column will show how many times those 2 products are on the same invoice. How can I do that? Thanks.

Note: '1' indicates that the product is included in that invoice, and '0' indicates that it is not. finally i want to get a matrix like in the picture. intersecting cells show the number of sales of the relevant product pair.

import pandas as pd
ids = ['invoice_1','invoice_2','invoice_3','invoice_4','invoice_5','invoice_6']
A= [0,0,1,0,1,1]
B= [0,1,1,0,1,1]
C= [1,1,1,0,1,0]
D= [1,0,0,1,1,0]
df=pd.DataFrame.from_dict({'A':A, 'B':B, 'C':C, 'D':D})
df.index=ids

Actually I want to get Table 2 from Table 1. AA=3 because product A is included in 3 invoices (rows) totally. AB=4 because A and B are included in 4 invoices (rows) together. Note: Even if the AA, BB, CC , DD cells are not full, it does not matter. Binary products (like AB, DC etc.) are important to me.

Table 1
                A   B   C   D
    invoice_1   0   0   1   1
    invoice_2   0   1   1   0
    invoice_3   1   1   1   0
    invoice_4   0   0   0   1
    invoice_5   1   1   1   1
    invoice_6   1   1   0   0
    invoice_7   1   1   0   0
    
Table 2    
        A   B   C   D
    A   4   4   2   1
    B   4   4   3   1
    C   2   3   4   2
    D   1   1   2   3

Solution

  • The OP made two mistakes here. The first one is the input to generate the intended Table 1 should be:

    import pandas as pd
    
    ids = ['invoice_1', 'invoice_2', 'invoice_3', 'invoice_4', 'invoice_5', 'invoice_6', 'invoice_7']
    A = [0, 0, 1, 0, 1, 1, 1]
    B = [0, 1, 1, 0, 1, 1, 1]
    C = [1, 1, 1, 0, 1, 0, 0]
    D = [1, 0, 0, 1, 1, 0, 0]
    df = pd.DataFrame(data={'A': A, 'B': B, 'C': C, 'D': D}, index=ids)
    
    Table 1
                    A   B   C   D
        invoice_1   0   0   1   1
        invoice_2   0   1   1   0
        invoice_3   1   1   1   0
        invoice_4   0   0   0   1
        invoice_5   1   1   1   1
        invoice_6   1   1   0   0
        invoice_7   1   1   0   0
    

    Table 2 is actually the dot product of 2 matrices, i.e. matrix df and its transpose. One can use numpy.dot to produce it.

    import numpy as np
    
    pd.DataFrame(data=np.dot(df.T, df), index=df.columns, columns=df.columns)
    
    Table 2    
            A   B   C   D
        A   4   4   2   1
        B   4   5   3   1
        C   2   3   4   2
        D   1   1   2   3
    

    The second mistake I pointed out is BB where it should be 5 instead of 4.