I have csv two files. I'm trying to remove all rows where certain columns match. I thought I'd use lists in Python to do this. I thought it'd be fast, but it's running way too slow.
I only want to compare the first 3 columns as the last 2 are unreliable. Yet, I want to export the last 2 columns.
Example:
A = [
(Jack, Smith, New York, USA, 100),
(Jim, Doe, Cleveland, UK, 200),
(Frank, Johnson, Chicago, USA, 300)
]
B = [
(Jack, Smith, New York, United States, blank),
(Jerry, Smith, Cleveland, USA, blank),
(Frank, Johnson, Chicago, America, blank)
]
Matched List = [
(Jack, Smith, New York, USA, 100)
(Frank, Johnson, Chicago, USA, 300)
]
Desired List = [
(Jim, Doe, Cleveland, UK, 200)
]
So I wrote two nested For Loops to compare the two lists and remove matched items. However, my list A is ~50,000 and list B is 600,000 rows. This is taking 3.5 hours. I need to run it on a set of 300,000 and 4,000,000 rows; but after seeing how long this takes it will run for days.
Here's the two For Loops (I'm comparing columns 0, 7, 9, and 10.)
for Acquisition_row in Acquisition_list[:]:
for Leads_row in Leads_list:
if (Acquisition_row[0] == Leads_row[0]) and (Acquisition_row[7] == Leads_row[7]) and (Acquisition_row[9] == Leads_row[9]) and (Acquisition_row[10] == Leads_row[10]):
try:
Acquisition_list.remove(Acquisition_row)
Leads_list.append(Acquisition_row)
except:
print("Error!")
Is there any way to speed this up? Is there a better approach? Should I use a different programming language? Maybe upload these to a temp table in SQL db and use SQL?
Thanks!
@kindall is correct in suggesting set()
or dict
to keep track of what you've seen so far.
def getKey(row):
return (row[0], row[7], row[9], row[10])
# create a set of all the keys you care about
lead_keys = {getKey(r) for r in Leads_rows}
# save this off due to reverse indexing gyration
len_ac_list = len(Acquisition_list)
for i, ac_row in enumerate(Acquisition_list[::-1]):
ac_key = getKey(ac_row)
if ac_key in lead_keys: ## this look up is O(1)
index = len_ac_list - i - 1
Acquisition_list.pop(index)
Leads_list.append(ac_row)
## maybe: lead_keys.add(ac_key)
The benefits are: you only iterate over Leads_list once, when creating your set of keys (I chose Leads_list for this because it's the larger list, so will save you more time); and your look-up for Acquisition_list takes constant time, O(1) rather than O(n) where n is len(Leads_list).
In your original setup, you were doing, at worst, (n*m) or (300000*4000000) operations which is... a ton. With set
s, you will only do (n+m) or (30000+4000000) which is ...way fewer. Like 300,000 times fewer. That's the difference between 1.2 trillion things and .000004 trillion (4 million) things.