My dataset is an adjacency matrix comparable with customer buying information. An example toy dataset:
p = {'A': [0,1,0,1], 'B': [1,1,1,1], 'C': [0,0,1,1], 'D': [1,1,1,0]}
df = pd.DataFrame(data=p)
df
Now I am interested in the frequent itemset so I used an apriori fim:
from mlxtend.frequent_patterns import apriori
frequent_itemsets = apriori(df, min_support=0.1, use_colnames=True)
frequent_itemsets
Now we see that itemset (D,B) occurs in 75% of the dataset. But I am actually interested in which rows this itemset occurs since the index has some information (which customer bought these items).
Shortly, I am curious how I could filter in my dataset to see which rows correspond with a specific itemset. Is there such a feature within this package/library. So that I could filter that itemset (D,B) occurs in row 0,1 and 2?
It doesn't appear that there's a direct way to do this via apriori
. However, one way would be as follows:
from mlxtend.frequent_patterns import apriori
frequent_itemsets = apriori(df, min_support=0.1, use_colnames=True)
# lists of columns where value is 1 per row
cols = df.dot(df.columns).map(set).values.tolist()
# use sets to see which rows are a superset of the sets in cols
set_itemsets = map(set,frequent_itemsets.itemsets.values.tolist())
frequent_itemsets['indices'] = [[ix for ix,j in enumerate(cols) if i.issubset(j)]
for i in set_itemsets]
print(frequent_itemsets)
support itemsets indices
0 0.50 (A) [1, 3]
1 1.00 (B) [0, 1, 2, 3]
2 0.50 (C) [2, 3]
3 0.75 (D) [0, 1, 2]
4 0.50 (A, B) [1, 3]
5 0.25 (A, C) [3]
6 0.25 (A, D) [1]
7 0.50 (C, B) [2, 3]
8 0.75 (B, D) [0, 1, 2]
9 0.25 (C, D) [2]
10 0.25 (A, B, C) [3]
11 0.25 (A, B, D) [1]
12 0.25 (C, B, D) [2]