pythonn-triples

Why does this python script to write to file abruptly stops?


This small script reads a file, tries to match each line with a regex, and appends matching lines to another file:

regex = re.compile(r"<http://dbtropes.org/resource/Film/.*?> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbtropes.org/resource/Main/.*?> \.")

with open("dbtropes-v2.nt", "a") as output, open("dbtropes.nt", "rb") as input:
    for line in input.readlines():
        if re.findall(regex,line):
            output.write(line)

input.close()
output.close()

However, the script abruptly stops after about 5 minutes. The terminal says "Process stopped", and the output file stays blank.

The input file can be downloaded here: http://dbtropes.org/static/dbtropes.zip It's 4.3Go n-triples file.

Is there something wrong with my code? Is it something else? Any hint would be appreciated on this one!


Solution

  • It stopped because it ran out of memory. input.readlines() reads the entire file into memory before returning a list of the lines.

    Instead, use input as an iterator. This only reads a few lines at a time, and returns them immediately.

    Don't do this:

    for line in input.readlines():
    

    Do do this:

    for line in input:
    

    Taking everyone's advice into account, your program becomes:

    regex = re.compile(r"<http://dbtropes.org/resource/Film/.*?> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbtropes.org/resource/Main/.*?> \.")
    
    with open("dbtropes.nt", "rb") as input:
        with open("dbtropes-v2.nt", "a") as output
            for line in input:
                if regex.search(line):
                    output.write(line)