javaapriori

Candidate set generation in Apriori algorithm


I am trying to implement Apriori algorithm in Java, and have problems with generating Candidate itemsets. To create candidates for k-itemset I use all combinations of k-1 and 1-itemsets. For example, for Frequent 1-itemset: bread:9, milk:9, coffee:9, sugar:10.

Candidate 2-itemsets generated should be: bread milk, bread coffee, bread sugar, milk coffee, milk sugar, coffee sugar.

But my code returns: bread coffee, bread milk, bread sugar, coffee bread, coffee milk, coffee sugar, milk bread, milk coffee, milk sugar, sugar bread, sugar coffee, sugar milk (all permutations; returns both bread milk and milk bread, however, these two are the same thing).

My code:

public static ArrayList<ArrayList<String>> getCandidates(Map<String, Long> one_itemset_1, Map<String, Long> n_itemset_1){
    ArrayList<ArrayList<String>> tuples = new ArrayList<ArrayList<String>>();

    
    Map<String, Long> one_itemset = sortbykey(one_itemset_1);
    Map<String, Long> n_itemset = sortbykey(n_itemset_1);
    
    for(String item : one_itemset.keySet()) {
        
        for(String item2 : n_itemset.keySet()) {
            if(!item.equals(item2) && !item2.contains(item)) {
                ArrayList<String> singleList = new ArrayList<String>();
                singleList.add(item);
                String item2_sep[] = item2.split(" ");
                for(int i = 0; i < item2_sep.length; i++)
                    singleList.add(item2_sep[i]);
                //singleList.add(item2);
                tuples.add(singleList);
            }
            //index2++;
        }
        //index2 = 0;
        //index1++;
    }
    
    return tuples;
}

Is there a way to modify this method to get rid of repetitive itemsets?


Solution

  • The Apriori algorithm does not require you to do a join between Lk and all 1-itemsets to get Ck+1. This approach will have a much higher runtime. Rather we join Lk with itself under the condition that the first (k-1) elements are the same in both the k-itemsets: see here

    As for having duplicate items in your code, you could just define an ordering in the 1-items (I1 < I2 < I3 ... In) and then only consider the items in ascending (or descending) order. This will keep the itemsets sorted -> (I1, I3, I5) is a valid itemset but (I1, I5, I3) is not.

    This can be easily done with two for loops for your example:

    for (int i=0; i<k-itemsets.size()-1; i++){
        for(int j=i+1; j<k-itemsets.size(); j++){
            // compare k-itemsets here and join in ascending order to produce k+1-itemset
        }
    }
    

    I would also advise you to go through other sources if you do not understand the algorithm at first. For e.g., the SPMF library has the implementation of most popular algorithms related to pattern mining.