pythonperformancelistdeduplication

Improving Run Time for deduping lists based on only certain columns in Python


I have csv two files. I'm trying to remove all rows where certain columns match. I thought I'd use lists in Python to do this. I thought it'd be fast, but it's running way too slow.

I only want to compare the first 3 columns as the last 2 are unreliable. Yet, I want to export the last 2 columns.

Example:

A = [
(Jack, Smith, New York, USA, 100),
(Jim, Doe, Cleveland, UK, 200),
(Frank, Johnson, Chicago, USA, 300)
]

B = [
(Jack, Smith, New York, United States, blank),
(Jerry, Smith, Cleveland, USA, blank),
(Frank, Johnson, Chicago, America, blank)
]

Matched List = [
(Jack, Smith, New York, USA, 100)
(Frank, Johnson, Chicago, USA, 300)
]

Desired List = [
(Jim, Doe, Cleveland, UK, 200)
]

So I wrote two nested For Loops to compare the two lists and remove matched items. However, my list A is ~50,000 and list B is 600,000 rows. This is taking 3.5 hours. I need to run it on a set of 300,000 and 4,000,000 rows; but after seeing how long this takes it will run for days.

Here's the two For Loops (I'm comparing columns 0, 7, 9, and 10.)

for Acquisition_row in Acquisition_list[:]:
    for Leads_row in Leads_list:
        if (Acquisition_row[0] == Leads_row[0]) and (Acquisition_row[7] == Leads_row[7]) and (Acquisition_row[9] == Leads_row[9]) and (Acquisition_row[10] == Leads_row[10]):
            try:
                Acquisition_list.remove(Acquisition_row)
                Leads_list.append(Acquisition_row)
            except:
                print("Error!")

Is there any way to speed this up? Is there a better approach? Should I use a different programming language? Maybe upload these to a temp table in SQL db and use SQL?

Thanks!


Solution

  • @kindall is correct in suggesting set() or dict to keep track of what you've seen so far.

    def getKey(row):
        return (row[0], row[7], row[9], row[10])
    
    # create a set of all the keys you care about
    lead_keys = {getKey(r) for r in Leads_rows}
    
    # save this off due to reverse indexing gyration
    len_ac_list = len(Acquisition_list)
    
    for i, ac_row in enumerate(Acquisition_list[::-1]):
        ac_key = getKey(ac_row)
        if ac_key in lead_keys:   ## this look up is O(1)
            index = len_ac_list - i - 1
            Acquisition_list.pop(index)
            Leads_list.append(ac_row)
            ## maybe: lead_keys.add(ac_key)
    

    The benefits are: you only iterate over Leads_list once, when creating your set of keys (I chose Leads_list for this because it's the larger list, so will save you more time); and your look-up for Acquisition_list takes constant time, O(1) rather than O(n) where n is len(Leads_list).

    In your original setup, you were doing, at worst, (n*m) or (300000*4000000) operations which is... a ton. With sets, you will only do (n+m) or (30000+4000000) which is ...way fewer. Like 300,000 times fewer. That's the difference between 1.2 trillion things and .000004 trillion (4 million) things.