pythonapriorimlxtend

Find corresponding rows with frequent itemsets


My dataset is an adjacency matrix comparable with customer buying information. An example toy dataset:

p = {'A': [0,1,0,1], 'B': [1,1,1,1], 'C': [0,0,1,1], 'D': [1,1,1,0]}
df = pd.DataFrame(data=p)
df

Now I am interested in the frequent itemset so I used an apriori fim:

from mlxtend.frequent_patterns import apriori
frequent_itemsets = apriori(df, min_support=0.1, use_colnames=True)
frequent_itemsets

Now we see that itemset (D,B) occurs in 75% of the dataset. But I am actually interested in which rows this itemset occurs since the index has some information (which customer bought these items).

Shortly, I am curious how I could filter in my dataset to see which rows correspond with a specific itemset. Is there such a feature within this package/library. So that I could filter that itemset (D,B) occurs in row 0,1 and 2?


Solution

  • It doesn't appear that there's a direct way to do this via apriori. However, one way would be as follows:

    from mlxtend.frequent_patterns import apriori
    
    frequent_itemsets = apriori(df, min_support=0.1, use_colnames=True)
    # lists of columns where value is 1 per row
    cols = df.dot(df.columns).map(set).values.tolist()
    # use sets to see which rows are a superset of the sets in cols
    set_itemsets = map(set,frequent_itemsets.itemsets.values.tolist())
    frequent_itemsets['indices'] = [[ix for ix,j in enumerate(cols) if i.issubset(j)] 
                                     for i in set_itemsets]
    

    print(frequent_itemsets)
    
        support   itemsets       indices
    0      0.50        (A)        [1, 3]
    1      1.00        (B)  [0, 1, 2, 3]
    2      0.50        (C)        [2, 3]
    3      0.75        (D)     [0, 1, 2]
    4      0.50     (A, B)        [1, 3]
    5      0.25     (A, C)           [3]
    6      0.25     (A, D)           [1]
    7      0.50     (C, B)        [2, 3]
    8      0.75     (B, D)     [0, 1, 2]
    9      0.25     (C, D)           [2]
    10     0.25  (A, B, C)           [3]
    11     0.25  (A, B, D)           [1]
    12     0.25  (C, B, D)           [2]