I have ecommerce data with about 6000 SKUs and 250,000 obs. Simple version below but a lot more sparse. There is only one SKU per line as each line is a transaction.
What I have:
|Index| ID | SKU1| SKU2 | SKU3|
|:----|:----|:----|:-----|:----|
| 1 | 55 | 1 | 0 | 0 |
| 2 | 55 | 0 | 1 | 0 |
| 3 | 55 | 0 | 0 | 1 |
| 4 | 66 | 0 | 1 | 0 |
| 5 | 66 | 1 | 0 | 0 |
| 6 | 77 | 0 | 1 | 0 |
I want to create a weighted undirected adjacency matrix so that I can do some graph analysis on the market baskets. It would look like the below, where SKU2 and SKU1 were bought together in baskets 55 and 66 and therefore have a total weight of 2.
What I want:
|Index| SKU1| SKU2| SKU3 | SKU4|
|:----|:----|:----|:-----|:----|
| SKU1| 0 | 2 | 1 | 0 |
| SKU2| 2 | 0 | 0 | 0 |
| SKU3| 1 | 0 | 0 | 0 |
| SKU4| 0 | 0 | 0 | 0 |
I have tried a for loop iterating through the original DF but it crashes immediately.
Ideally I would collapse the first dataframe by the ID column but without aggregating, as there are no duplicate transactions for the same item and same ID. However, when I try to collapse using df.groupby(['ID']).count()
I get the following. When I remove .count() there is no output. I'm sure there is another way to do this but can't seem to find it in the documentation.
What I tried: df.groupby(['ID']).count()
| ID | SKU1| SKU2 | SKU3|
|:----|:----|:---- |:----|
| 55 | 3 | 3 | 3 |
| 66 | 2 | 2 | 2 |
| 77 | 1 | 1 | 1 |
Anyone know how I can generate the sparse matrix without immediately crashing my computer?
Count also counts zeros. Aggregate by sum instead and then convert to 0s and 1s.
agg = df.groupby('ID').agg('sum')
agg = (agg > 0).astype(int)
SKU1 SKU2 SKU3
ID
55 1 1 1
66 1 1 0
77 0 1 0
Turn it into a occurrence table and fill the diagonal with 0s for whatever reason.
occurrence = np.dot(agg.T, agg)
np.fill_diagonal(occurrence, 0)
Turn it back into a dataframe
pd.DataFrame(occurrence, columns=df.columns[1:], index=df.columns[1:])
SKU1 SKU2 SKU3
SKU1 0 2 1
SKU2 2 0 1
SKU3 1 1 0