pythonarrayscsvlookuplatency

Efficiently processing ~50 million record file in python


We have a file with about 46 million records in CSV format. Each record has about 18 fields and one of them is a 64 byte ID. We have another file with about 167K unique IDs in it. The records corresponding to the IDs needs to be yanked out. So, we have written a python program that reads the 167K IDs into an array and processes the 46 million records file checking if the ID exists in each one of the those records. Here is the snippet of the code:

import csv
...
csvReadHandler = csv.reader(inputFile, delimiter=chr(1))
csvWriteHandler = csv.writer(outputFile, delimiter=chr(1), lineterminator='\n')
for fieldAry in csvReadHandler:
    lineCounts['orig'] += 1
    if fieldAry[CUSTOMER_ID] not in idArray:
        csvWriteHandler.writerow(fieldAry)
        lineCounts['mod'] += 1

Tested the program on a small set of data, here are the processing times:

lines: 117929 process time: 236.388447046 sec
lines: 145390 process time: 277.075321913 sec

We have started running the program on the 46 million records file (which is about 13 GB size) last night ~3:00am EST, now it is around 10am EST and it is still processing!

Questions:

  1. Is there a better way to process these records to improve on processing times?
  2. Is python the right choice? Would awk or some other tool help?
  3. I am guessing 64 byte ID look-up on 167K array in the following statement is the culprit:
    if fieldAry[CUSTOMER_ID] not in idArray:

Is there a better alternative?

Thank you!

Update: This is processed on an EC2 instance with EBS attached volume.


Solution

  • You should must use a set instead of a list; before the for loop do:

    idArray = set(idArray)
    
    csvReadHandler = csv.reader(inputFile, delimiter=chr(1))
    csvWriteHandler = csv.writer(outputFile, delimiter=chr(1), lineterminator='\n')
    for fieldAry in csvReadHandler:
        lineCounts['orig'] += 1
        if fieldAry[CUSTOMER_ID] not in idArray:
            csvWriteHandler.writerow(fieldAry)
            lineCounts['mod'] += 1
    

    And see the unbelievable speed-up; you're using days worth of unnecessary processing time just because you chose the wrong data structure.


    in operator with set has O(1) time complexity, whereas O(n) time complexity with list. This might sound like "not a big deal" but actually it is the bottleneck in your script. Even though set will have somewhat higher constants for that O. So your code is using something like 30000 more time on that single in operation than is necessary. If in the optimal version it'd require 30 seconds, now you spend 10 days on that single line alone.

    See the following test: I generate 1 million IDs and take 10000 aside into another list - to_remove. I then do a for loop like you do, doing the in operation for each record:

    import random
    import timeit
    
    all_ids = [random.randint(1, 2**63) for i in range(1000000)]
    to_remove = all_ids[:10000]
    random.shuffle(to_remove)
    random.shuffle(all_ids)
    
    
    def test_set():
        to_remove_set = set(to_remove)
        for i in all_ids:
            if i in to_remove_set:
                pass
    
    def test_list():
        for i in all_ids:
            if i in to_remove:
                pass
    
    
    print('starting')
    print('testing list', timeit.timeit(test_list, number=1))
    print('testing set', timeit.timeit(test_set, number=1))
    

    And the results:

    testing list 227.91903045598883
    testing set 0.14897623099386692
    

    It took 149 milliseconds for the set version; the list version required 228 seconds. Now these were small numbers: in your case you have 50 million input records against my 1 million; thus there you need to multiply the testing set time by 50: with your data set it would take about 7.5 seconds.

    The list version, on the other hand, you need to multiply that time by 50 * 17 - not only are there 50 times more input records, but 17 times more records to match against. Thus we get 227 * 50 * 17 = 192950.

    So your algorithm spends 2.2 days doing something that by using correct data structure can be done in 7.5 seconds. Of course this does not mean that you can scan the whole 50 GB document in 7.5 seconds, but it probably ain't more than 2.2 days either. So we changed from:

                 2 days                           2.2 days 
     |reading and writing the files||------- doing id in list ------|
    

    to something like

                 2 days            7.5 seconds (doing id in set)
     |reading and writing the files||