pythonapriorimarket-basket-analysis

Python: Optimise the function to find frequent itemsets of size k given candidate itemsets


I have written a function to find frequency of itemsets of size k given candidate itemsets. Dataset contains more than 16000 transactions. Can someone please help me in optimizing this function as with current form it is taking about 45 minutes to execute with minSupport=1.

Sample dataset

Dataset


Solution

  • Algorithm 0 (See other algorithms below)

    Implemented boost of your algorithm using Numba. Numba is a JIT compiler that converts Python code to very highly optimized C++ code and then compiles to machine code. For many algorithms Numba achieves speed boost of 50-200x times.

    To use numba you have to install it through pip install numba, notice that Numba is only supported for Python <= 3.8, for 3.9 it is not yet released!

    I have rewritten your code a bit to satisfy Numba compilation requirements, my code should be identical by behaviour to yours, please do some tests.

    My numba optimized code should give you very good speedup!

    I created some artificial short example input data too, to make tests.

    Try it online!

    import numba, numpy as np, pandas as pd
    
    @numba.njit(cache = True)
    def selectLkNm(dataSet,Ck,minSupport):
        dict_data = {}
        transactions = dataSet.shape[0]
        for items in Ck:
            count = 0
            while count < transactions:
                if items not in dict_data:
                    dict_data[items] = 0
                for item in items:
                    for e in dataSet[count, :]:
                        if item == e:
                            break
                    else:
                        break
                else:
                    dict_data[items] += 1
                count += 1
        Lk = {}
        for k, v in dict_data.items():
            if v >= minSupport:
                Lk[k] = v
        return Lk
        
    def selectLk(dataSet, Ck, minSupport):
        tCk = numba.typed.List()
        for e in Ck:
            tCk.append(e)
        return selectLkNm(dataSet.values, tCk, minSupport)
    
    dataset = pd.DataFrame([[100,160,100,160],[170,180,190,200],[100,160,190,200]])
    C1 = set()
    C1.add((100, 160))
    C1.add((170, 180))
    C1.add((190, 200))
    Lk = selectLk(dataset, C1, 2)
    print(Lk)
    

    Output:

    {(100, 160): 2, (190, 200): 2}
    

    Algorithm 1 (See other algorithms below)

    I improved Algorithm 0 (above) by sorting your data, it will give a good speedup if you have many values inside your Ck or each tuple inside Ck is quite long.

    Try it online!

    import numba, numpy as np, pandas as pd
    
    @numba.njit(cache = True)
    def selectLkNm(dataSet,Ck,minSupport):
        assert dataSet.ndim == 2
        dataSet2 = np.empty_like(dataSet)
        for i in range(dataSet.shape[0]):
            dataSet2[i] = np.sort(dataSet[i])
        dataSet = dataSet2
        dict_data = {}
        transactions = dataSet.shape[0]
        for items in Ck:
            count = 0
            while count < transactions:
                if items not in dict_data:
                    dict_data[items] = 0
                for item in items:
                    ix = np.searchsorted(dataSet[count, :], item)
                    if not (ix < dataSet.shape[1] and dataSet[count, ix] == item):
                        break
                else:
                    dict_data[items] += 1
                count += 1
        Lk = {}
        for k, v in dict_data.items():
            if v >= minSupport:
                Lk[k] = v
        return Lk
        
    def selectLk(dataSet, Ck, minSupport):
        tCk = numba.typed.List()
        for e in Ck:
            tCk.append(e)
        return selectLkNm(dataSet.values, tCk, minSupport)
    
    dataset = pd.DataFrame([[100,160,100,160],[170,180,190,200],[100,160,190,200]])
    C1 = set()
    C1.add((100, 160))
    C1.add((170, 180))
    C1.add((190, 200))
    Lk = selectLk(dataset, C1, 2)
    print(Lk)
    

    Output:

    {(100, 160): 2, (190, 200): 2}
    

    Algorithm 2 (See other algorithms below)

    If you're not allowed to use Numba, then I suggest you next improvements to your algorithm. I pre-sort your dataset to make search of each item not in O(N) time but in O(Log(N)) time which is much much faster.

    I see in your code you used pandas dataframe, it means you have installed pandas, and if you installed pandas then you definitely have Numpy, so I decided to use it. You can't have no Numpy if you're dealing with pandas dataframe.

    Try it online!

    import numpy as np, pandas as pd, collections
    
    def selectLk(dataSet,Ck,minSupport):
        dataSet = np.sort(dataSet.values, axis = 1)
        dict_data = collections.defaultdict(int)
        transactions = dataSet.shape[0]
        for items in Ck:
            count = 0
            while count < transactions:
                for item in items:
                    ix = np.searchsorted(dataSet[count, :], item)
                    if not (ix < dataSet.shape[1] and dataSet[count, ix] == item):
                        break
                else:
                    dict_data[items] += 1
                count += 1
        Lk = {k : v for k, v in dict_data.items() if v >= minSupport}
        return Lk
        
    dataset = pd.DataFrame([[100,160,100,160],[170,180,190,200],[100,160,190,200]])
    C1 = set()
    C1.add((100, 160))
    C1.add((170, 180))
    C1.add((190, 200))
    Lk = selectLk(dataset, C1, 2)
    print(Lk)
    

    Output:

    {(100, 160): 2, (190, 200): 2}
    

    Algorithm 3

    I just had an idea that sorting part of Algorithm 2 may be not the bottleneck, probably transactions while loop can be a bottleneck.

    So to improve situation I decided to implement and use a faster algorithm with 2D searchsorted version (there is no built-in 2D version, so it had to be implemented separately), which doesn't have any long pure-python loops, most time is spent in Numpy functions.

    Please try if this Algo 3 will be faster, it should be only faster if not sorting was a bottleneck but inner while loop.

    Try it online!

    import numpy as np, pandas as pd, collections
    
    def selectLk(dataSet, Ck, minSupport):
        def searchsorted2d(a, bs):
            s = np.r_[0, (np.maximum(a.max(1) - a.min(1) + 1, bs.ravel().max(0)) + 1).cumsum()[:-1]]
            a_scaled = (a + s[:, None]).ravel()
            def sub(b):
                b_scaled = b + s
                return np.searchsorted(a_scaled, b_scaled) - np.arange(len(s)) * a.shape[1]
            return sub
    
        assert dataSet.values.ndim == 2, dataSet.values.ndim
        dataSet = np.sort(dataSet.values, axis = 1)
        dict_data = collections.defaultdict(int)
        transactions = dataSet.shape[0]
        Ck = np.array(list(Ck))
        assert Ck.ndim == 2, Ck.ndim
        ss = searchsorted2d(dataSet, Ck)
        for items in Ck:
            cnts = np.zeros((dataSet.shape[0],), dtype = np.int64)
            for item in items:
                bs = item.repeat(dataSet.shape[0])
                ixs = np.minimum(ss(bs), dataSet.shape[1] - 1)
                cnts[...] += (dataSet[(np.arange(dataSet.shape[0]), ixs)] == bs).astype(np.uint8)
            dict_data[tuple(items)] += int((cnts == len(items)).sum())
        return {k : v for k, v in dict_data.items() if v >= minSupport}
        
    dataset = pd.DataFrame([[100,160,100,160],[170,180,190,200],[100,160,190,200]])
    C1 = set()
    C1.add((100, 160))
    C1.add((170, 180))
    C1.add((190, 200))
    Lk = selectLk(dataset, C1, 2)
    print(Lk)
    

    Output:

    {(100, 160): 2, (190, 200): 2}