pythonapriori

Parallel-processing efficient_apriori code in Python


I have 12 millions of data from an eshop. I would like to compute association rules using efficient_apriori package. The problem is that 12 millions observations are too many, so the computation tooks too much time. Is there a way how to speed up the algorithm? I am thinking about some Parallel-processing or compile python code into C. I tried PYPY, but PYPY does not support pandas package. Thank you for any help or idea.

If you want to see my code:

import pandas as pd

from efficient_apriori import apriori

orders = pd.read_csv("orders.csv", sep=";")

customer = orders.groupby("id_customer")["name"].agg(tuple).tolist()

itemsets, rules = apriori(
            customer, min_support=100/len(customer), min_confidence=0
        )


Solution

  • can you this approach to run this task parallel:

    from multiprocessing import Pool
    
    length_of_input_file=len(raw_data_min)
    total_offset_count=4  # number of parallel process to run
    offset=int(length_of_input_file/total_offset_count // 1)
    dataNew1=customer[0:offset-1]
    dataNew2=customer[offset:2*offset-1]
    dataNew3=customer[2*offset:3*offset-1]
    dataNew4=customer[3*offset:4*offset-1]
    
    def calculate_frequent_itemset(fractional_data):
        """Function that calculated the frequent dataset parallely"""
        itemsets, rules = apriori(fractional_data, min_support=MIN_SUPPORT, 
        min_confidence=MIN_CONFIDENCE)
        return itemsets, rules
            
    p=Pool()
    frequent_itemsets=p.map(calculate_frequent_itemset,(dataNew1,dataNew2,dataNew3,dataNew4))
    p.close()
    p.join()
    
    itemsets1, rules1 =frequent_itemsets[0]
    itemsets2, rules2=frequent_itemsets[1]
    itemsets3, rules3=frequent_itemsets[2]
    itemsets4, rules4=frequent_itemsets[3]