pythonpandas-groupbysparse-matrixadjacency-matrixsku

Creating adjacency matrix from sparse SKU data in Python


I have ecommerce data with about 6000 SKUs and 250,000 obs. Simple version below but a lot more sparse. There is only one SKU per line as each line is a transaction.

What I have:

|Index| ID  | SKU1| SKU2 | SKU3| 
|:----|:----|:----|:-----|:----|
| 1   | 55  |  1  |  0   |  0  |
| 2   | 55  |  0  |  1   |  0  |
| 3   | 55  |  0  |  0   |  1  |
| 4   | 66  |  0  |  1   |  0  |
| 5   | 66  |  1  |  0   |  0  |
| 6   | 77  |  0  |  1   |  0  |

I want to create a weighted undirected adjacency matrix so that I can do some graph analysis on the market baskets. It would look like the below, where SKU2 and SKU1 were bought together in baskets 55 and 66 and therefore have a total weight of 2.

What I want:

|Index| SKU1| SKU2| SKU3 | SKU4| 
|:----|:----|:----|:-----|:----|
| SKU1|  0  |  2  |  1   |  0  |
| SKU2|  2  |  0  |  0   |  0  |
| SKU3|  1  |  0  |  0   |  0  |
| SKU4|  0  |  0  |  0   |  0  |

I have tried a for loop iterating through the original DF but it crashes immediately.

Ideally I would collapse the first dataframe by the ID column but without aggregating, as there are no duplicate transactions for the same item and same ID. However, when I try to collapse using df.groupby(['ID']).count() I get the following. When I remove .count() there is no output. I'm sure there is another way to do this but can't seem to find it in the documentation.

What I tried: df.groupby(['ID']).count()

| ID  | SKU1| SKU2 | SKU3| 
|:----|:----|:---- |:----|
| 55  |  3  |  3   |  3  |
| 66  |  2  |  2   |  2  |
| 77  |  1  |  1   |  1  |

Anyone know how I can generate the sparse matrix without immediately crashing my computer?


Solution

  • Count also counts zeros. Aggregate by sum instead and then convert to 0s and 1s.

    agg = df.groupby('ID').agg('sum')
    agg = (agg > 0).astype(int)
    
        SKU1    SKU2    SKU3
    ID          
    55  1       1       1
    66  1       1       0
    77  0       1       0
    

    Turn it into a occurrence table and fill the diagonal with 0s for whatever reason.

    occurrence = np.dot(agg.T, agg)
    np.fill_diagonal(occurrence, 0)
    

    Turn it back into a dataframe

    pd.DataFrame(occurrence, columns=df.columns[1:], index=df.columns[1:])
    
            SKU1    SKU2    SKU3
    SKU1    0       2       1
    SKU2    2       0       1
    SKU3    1       1       0