This small script reads a file, tries to match each line with a regex, and appends matching lines to another file:
regex = re.compile(r"<http://dbtropes.org/resource/Film/.*?> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbtropes.org/resource/Main/.*?> \.")
with open("dbtropes-v2.nt", "a") as output, open("dbtropes.nt", "rb") as input:
for line in input.readlines():
if re.findall(regex,line):
output.write(line)
input.close()
output.close()
However, the script abruptly stops after about 5 minutes. The terminal says "Process stopped", and the output file stays blank.
The input file can be downloaded here: http://dbtropes.org/static/dbtropes.zip It's 4.3Go n-triples file.
Is there something wrong with my code? Is it something else? Any hint would be appreciated on this one!
It stopped because it ran out of memory. input.readlines()
reads the entire file into memory before returning a list of the lines.
Instead, use input
as an iterator. This only reads a few lines at a time, and returns them immediately.
Don't do this:
for line in input.readlines():
Do do this:
for line in input:
Taking everyone's advice into account, your program becomes:
regex = re.compile(r"<http://dbtropes.org/resource/Film/.*?> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbtropes.org/resource/Main/.*?> \.")
with open("dbtropes.nt", "rb") as input:
with open("dbtropes-v2.nt", "a") as output
for line in input:
if regex.search(line):
output.write(line)