pythondata-miningapriori

Count frequency of itemsets in the given data frame


I have following data frame,

data = pd.read_csv('sample.csv', sep=',')

Dataframe

I need to search the frequency of itemsets present in a set. For example:

itemsets = {(143, 157), (143, 166), (175, 178), (175, 190)}

This should search the frequency of each tuple in the data frame (Trying to implement Apriori's algorithm). I'm particularly having trouble with how to individually address the tuples in the data frame and to search the tuple instead of individual entries in the data.

Update-1

For example data frame is like this:

39, 120, 124, 205, 401, 581, 704, 814, 825, 834
35, 39,  205, 712, 733, 759, 854, 950
39, 422, 449, 704, 825, 857, 895, 937, 954, 964

Update-2

Function should increment the count for a tuple only if all the values in that tuple are present in a particular row. For example, if I search for (39, 205), it should return the frequency of 2 because 2 of the rows include both 39 and 205 (the first and second rows).


Solution

  • This function will returns a dictionary which contains the occurrences of the tuple's count in the entire rows of the data frame.

    from collections import defaultdict
    def count(df, sequence):
        dict_data = defaultdict(int)
        shape = df.shape[0]
        for items in sequence:
            for row in range(shape):
                dict_data[items] += all([item in df.iloc[row, :].values for item in items])
        return dict_data
    

    You can pass in the data frame and the set to the count() function and it will return the occurrences of the tuples in the entire rows of the data frame for you i.e

    >>> count(data, itemsets)
    defaultdict(<class 'int'>, {(39, 205): 2})
    

    And you can easily change it from defaultdict to dictionary by using the dict() method i.e.

    >>> dict(count(data, itemsets))
    {(39, 205): 2}
    

    But both of them still works the same.