I am using a dataset as below. Rows show invoice numbers, columns show products. I want to show the number of products on the same invoice as a matrix (i.e. there will be products in both rows and columns, the intersection of the row and column will show how many times those 2 products are on the same invoice. How can I do that? Thanks.
Note: '1' indicates that the product is included in that invoice, and '0' indicates that it is not. finally i want to get a matrix like in the picture. intersecting cells show the number of sales of the relevant product pair.
import pandas as pd
ids = ['invoice_1','invoice_2','invoice_3','invoice_4','invoice_5','invoice_6']
A= [0,0,1,0,1,1]
B= [0,1,1,0,1,1]
C= [1,1,1,0,1,0]
D= [1,0,0,1,1,0]
df=pd.DataFrame.from_dict({'A':A, 'B':B, 'C':C, 'D':D})
df.index=ids
Actually I want to get Table 2 from Table 1. AA=3 because product A is included in 3 invoices (rows) totally. AB=4 because A and B are included in 4 invoices (rows) together. Note: Even if the AA, BB, CC , DD cells are not full, it does not matter. Binary products (like AB, DC etc.) are important to me.
Table 1
A B C D
invoice_1 0 0 1 1
invoice_2 0 1 1 0
invoice_3 1 1 1 0
invoice_4 0 0 0 1
invoice_5 1 1 1 1
invoice_6 1 1 0 0
invoice_7 1 1 0 0
Table 2
A B C D
A 4 4 2 1
B 4 4 3 1
C 2 3 4 2
D 1 1 2 3
The OP made two mistakes here. The first one is the input to generate the intended Table 1 should be:
import pandas as pd
ids = ['invoice_1', 'invoice_2', 'invoice_3', 'invoice_4', 'invoice_5', 'invoice_6', 'invoice_7']
A = [0, 0, 1, 0, 1, 1, 1]
B = [0, 1, 1, 0, 1, 1, 1]
C = [1, 1, 1, 0, 1, 0, 0]
D = [1, 0, 0, 1, 1, 0, 0]
df = pd.DataFrame(data={'A': A, 'B': B, 'C': C, 'D': D}, index=ids)
Table 1
A B C D
invoice_1 0 0 1 1
invoice_2 0 1 1 0
invoice_3 1 1 1 0
invoice_4 0 0 0 1
invoice_5 1 1 1 1
invoice_6 1 1 0 0
invoice_7 1 1 0 0
Table 2 is actually the dot product of 2 matrices, i.e. matrix df
and its transpose. One can use numpy.dot to produce it.
import numpy as np
pd.DataFrame(data=np.dot(df.T, df), index=df.columns, columns=df.columns)
Table 2
A B C D
A 4 4 2 1
B 4 5 3 1
C 2 3 4 2
D 1 1 2 3
The second mistake I pointed out is BB where it should be 5 instead of 4.