pythonmemorycompareextractdifference

Compare strings from a very large text file (over 100 GB) with a small text file (about 30 lines) and print all the strings contained in both files


I have two text files. One contains a very long list of strings (100 GB), the other contains about 30 strings. I need to find which lines in the second file are also in the first file and write them to another,third text file. Manually searching for each line is a pain, so I wanted to write a script to do it automatically. For this I choose Python because it is the only language that I know even a little.

Essentially I tried copying this answer since I'm too inexperienced to write my own code: Compare 2 files in Python and extract differences as a strings

smallfile = 'smalllist.txt'
bigfile = 'biglist.txt'



def file_2_list(file):
    with open(file) as file:
        lines = file.readlines()
        lines = [line.rstrip() for line in lines]
        return lines


def diff_lists(lst1, lst2):
    differences = []
    both = []
    for element in lst1:
        if element not in lst2:
            differences.append(element)
        else:
            both.append(element)
    return(differences, both)


listbig = file_2_list(bigfile)
listsmall = file_2_list(smallfile)

diff, both = diff_lists(listbig, listsmall)

print(both)

I wanted it to print me the lines that are in both lists. However it gave me a "memory error". But I'm already using a 64-bit version of Python, so the memory limit shouldn't be an issue? (I have 16 GB RAM)

So how can you avoid this “memory error”? Or maybe there is a better way to accomplish this task?


Solution

  • The file.readlines method reads the entirety of a file into memory, which you should avoid when the file is that large.

    You can instead read the lines of the smaller file into a set, and then iterate over the lines of the larger file to find the common lines by testing if a line is in the set:

    def common_lines(small_file, big_file):
        small_lines = set(small_file)
        return [line for line in big_file if line in small_lines]
    
    with open(smallfile) as file1, open(bigfile) as file2:
        both = common_lines(file1, file2)