I have two text files. One contains a very long list of strings (100 GB), the other contains about 30 strings. I need to find which lines in the second file are also in the first file and write them to another,third text file. Manually searching for each line is a pain, so I wanted to write a script to do it automatically. For this I choose Python because it is the only language that I know even a little.
Essentially I tried copying this answer since I'm too inexperienced to write my own code: Compare 2 files in Python and extract differences as a strings
smallfile = 'smalllist.txt'
bigfile = 'biglist.txt'
def file_2_list(file):
with open(file) as file:
lines = file.readlines()
lines = [line.rstrip() for line in lines]
return lines
def diff_lists(lst1, lst2):
differences = []
both = []
for element in lst1:
if element not in lst2:
differences.append(element)
else:
both.append(element)
return(differences, both)
listbig = file_2_list(bigfile)
listsmall = file_2_list(smallfile)
diff, both = diff_lists(listbig, listsmall)
print(both)
I wanted it to print me the lines that are in both lists. However it gave me a "memory error". But I'm already using a 64-bit version of Python, so the memory limit shouldn't be an issue? (I have 16 GB RAM)
So how can you avoid this “memory error”? Or maybe there is a better way to accomplish this task?
The file.readlines
method reads the entirety of a file into memory, which you should avoid when the file is that large.
You can instead read the lines of the smaller file into a set, and then iterate over the lines of the larger file to find the common lines by testing if a line is in the set:
def common_lines(small_file, big_file):
small_lines = set(small_file)
return [line for line in big_file if line in small_lines]
with open(smallfile) as file1, open(bigfile) as file2:
both = common_lines(file1, file2)