We have a file with about 46 million records in CSV format. Each record has about 18 fields and one of them is a 64 byte ID. We have another file with about 167K unique IDs in it. The records corresponding to the IDs needs to be yanked out. So, we have written a python program that reads the 167K IDs into an array and processes the 46 million records file checking if the ID exists in each one of the those records. Here is the snippet of the code:
import csv
...
csvReadHandler = csv.reader(inputFile, delimiter=chr(1))
csvWriteHandler = csv.writer(outputFile, delimiter=chr(1), lineterminator='\n')
for fieldAry in csvReadHandler:
lineCounts['orig'] += 1
if fieldAry[CUSTOMER_ID] not in idArray:
csvWriteHandler.writerow(fieldAry)
lineCounts['mod'] += 1
Tested the program on a small set of data, here are the processing times:
lines: 117929 process time: 236.388447046 sec
lines: 145390 process time: 277.075321913 sec
We have started running the program on the 46 million records file (which is about 13 GB size) last night ~3:00am EST, now it is around 10am EST and it is still processing!
Questions:
if fieldAry[CUSTOMER_ID] not in idArray:
Is there a better alternative?
Thank you!
Update: This is processed on an EC2 instance with EBS attached volume.
You should must use a set
instead of a list
; before the for
loop do:
idArray = set(idArray)
csvReadHandler = csv.reader(inputFile, delimiter=chr(1))
csvWriteHandler = csv.writer(outputFile, delimiter=chr(1), lineterminator='\n')
for fieldAry in csvReadHandler:
lineCounts['orig'] += 1
if fieldAry[CUSTOMER_ID] not in idArray:
csvWriteHandler.writerow(fieldAry)
lineCounts['mod'] += 1
And see the unbelievable speed-up; you're using days worth of unnecessary processing time just because you chose the wrong data structure.
in
operator with set
has O(1) time complexity, whereas O(n) time complexity with list
. This might sound like "not a big deal" but actually it is the bottleneck in your script. Even though set
will have somewhat higher constants for that O. So your code is using something like 30000 more time on that single in
operation than is necessary. If in the optimal version it'd require 30 seconds, now you spend 10 days on that single line alone.
See the following test: I generate 1 million IDs and take 10000 aside into another list - to_remove
. I then do a for
loop like you do, doing the in
operation for each record:
import random
import timeit
all_ids = [random.randint(1, 2**63) for i in range(1000000)]
to_remove = all_ids[:10000]
random.shuffle(to_remove)
random.shuffle(all_ids)
def test_set():
to_remove_set = set(to_remove)
for i in all_ids:
if i in to_remove_set:
pass
def test_list():
for i in all_ids:
if i in to_remove:
pass
print('starting')
print('testing list', timeit.timeit(test_list, number=1))
print('testing set', timeit.timeit(test_set, number=1))
And the results:
testing list 227.91903045598883
testing set 0.14897623099386692
It took 149 milliseconds for the set
version; the list
version required 228 seconds. Now these were small numbers: in your case you have 50 million input records against my 1 million; thus there you need to multiply the testing set
time by 50: with your data set it would take about 7.5 seconds.
The list version, on the other hand, you need to multiply that time by 50 * 17 - not only are there 50 times more input records, but 17 times more records to match against. Thus we get 227 * 50 * 17 = 192950.
So your algorithm spends 2.2 days doing something that by using correct data structure can be done in 7.5 seconds. Of course this does not mean that you can scan the whole 50 GB document in 7.5 seconds, but it probably ain't more than 2.2 days either. So we changed from:
2 days 2.2 days
|reading and writing the files||------- doing id in list ------|
to something like
2 days 7.5 seconds (doing id in set)
|reading and writing the files||