pythonpython-3.xlistfile-comparisonfilecompare

Code is working slow - performance issue in python


I have file which has 4 columns with, separated values. I need only first column only so I have read file then split that line with, separated and store it in one list variable called first_file_list.

I have another file which has 6 columns with, separated values. My requirement is read first column of first row of file and check that string is exist in list called first_file_list. If that is exist then copy that line to new file.

My first file has approx. 6 million records and second file has approx. 4.5 million records. Just to check the performance of my code instead of 4.5 million I have put only 100k records in second file and to process the 100k record code takes approx. 2.5 hours.

Following is my logic for this:

first_file_list = []

with open("c:\first_file.csv") as first_f:
    next(first_f)  # Ignoring first row as it is header and I don't need that
    temp = first_f.readlines()
    for x in temp:
        first_file_list.append(x.split(',')[0])
first_f.close()

with open("c:\second_file.csv") as second_f:
    next(second_f)
    second_file_co = second_f.readlines()
second_f.close()

out_file = open("c:\output_file.csv", "a")
for x in second_file_co:
    if x.split(',')[0] in first_file_list:
        out_file.write(x)
out_file.close()

Can you please help me to get to know that what I am doing wrong here so that my code take this much time to compare 100k records? or can you suggest better way to do this in Python.


Solution

  • Use a set for fast membership checking. Also, there's no need to copy the contents of the entire file to memory. You can just iterate over the remaining contents of the file.

    first_entries = set()
    with open("c:\first_file.csv") as first_f:
        next(first_f)
        for line in first_f:
            first_entries.add(line.split(',')[0])
    
    with open("c:\second_file.csv") as second_f:
        with open("c:\output_file.csv", "a") as out_file:
            next(second_f)
            for line in second_f:
                if line.split(',')[0] in first_entries:
                    out_file.write(line)
    

    Additionally, I noticed you called .close() on file objects that were opened with the with statement. Using with (context managers) means all the clean up is done after you exit its context. So it handles the .close() for you.