pythonlistdictionarydataframesymmetric-difference

How to efficiently find identical indices in multiple dataframes


I have a process that collects reports generated throughout the week and consolidates the collection to eliminate identical reports.

I've written a function that identifies identical reports by finding those that have identical indices, then it excludes all but one of those that are identical and moves on. While it works fine for 5000-10,000 reports, it starts to take a serious amount of time to process through, say, 50,000+ reports, which as time goes on will be more and more common.

It would be nice if I could pre-emptively eliminate the reports and so avoid this step, but the process of generating the reports doesn't allow for that. So, I want to find a way to make this or a similar function more efficient.

The code is below:

def report_diff_index(self,dnc_data,folders):
    master_report_dict, master_val_dict = self.report_orderer(folders)
    sorts = self.report_sorter(dnc_data,master_report_dict)
    keys = [k for k in sorts.keys()]
    consolidated_sorts = keys
    print('Original Report Size: ', len(consolidated_sorts))
    for k in keys:
        if k in consolidated_sorts:
            for j in keys[keys.index(k)+1:]:
                if j in consolidated_sorts:
                    if len(list(set(sorts[k].index).symmetric_difference(sorts[j].index))) == 0:
                        consolidated_sorts.remove(j)
    print('Consolidated Report Size: ', len(consolidated_sorts))
    consolidated_report = {}
    consolidated_val = {}
    for s in consolidated_sorts:
        consolidated_report[s] = master_report_dict[s]
        consolidated_val[s] = master_val_dict[s]
    return consolidated_report, consolidated_val

Solution

  • I do not know if I understand your problem correctly, and even if I do, I do not know if this is faster, but wouldn't it be possible to create a dict, where you use the unique report index as key (e.g. using frozenset) and then the report key as value. It feels like a quicker way to build the unique list, but I may be off:

    def report_diff_index(self,dnc_data,folders):
        master_report_dict, master_val_dict = self.report_orderer(folders)
        sorts = self.report_sorter(dnc_data,master_report_dict)
        print('Original Report Size: ', len(sorts))
        unique_reports = dict()
        for report_key, report in sorts.items:
            key = frozenset(report.index)
            # Alt 1. Replace with new (identical) repoirt
            unique_reports[key] = report_key
            # Alt 2. Keep first report
            if key not in unique_reports:
                unique_reports[key] = report_key
        consolidated_sorts = unique_reports.values()
        print('Consolidated Report Size: ', len(consolidated_sorts))
        consolidated_report = {}
        consolidated_val = {}
        for s in consolidated_sorts:
            consolidated_report[s] = master_report_dict[s]
            consolidated_val[s] = master_val_dict[s]
        return consolidated_report, consolidated_val
    

    As you can see, there is also two options in the dict update, which depends on whether you want to keep the first found report or if it does not matter.

    Insertion into a dict should approach O(1) therefore I would imagine this to be rather quick.