[SOLVED] Parsing large NTriples File Python

Parsing large NTriples File Python

I am trying to parse a rather large NTriples file using the code from Parse large RDF in Python

I installed raptor and the redland-bindings for python.

import RDF
parser=RDF.Parser(name="ntriples") #as name for parser you can use ntriples, turtle, rdfxml, ...
model=RDF.Model()
stream=parser.parse_into_model(model,"file:./mybigfile.nt")
for triple in model:
    print triple.subject, triple.predicate, triple.object

However the program hangs and I suspect it is trying to load the entire file into memory or something because it does not start right away.

Anybody know how to resolve this?

Solution

It's slow because you are reading into an in-memory store (RDF.Model() default) which has no indexing. So it gets slower and slower. The parsing of N-Triples does stream from the file, it never sucks it all into memory.

See the Redland storage modules documentation for an overview of the storage models. Here you probably want storage type 'hashes' and hash-type memory.

s = RDF.HashStorage("abc", options="hash-type='memory'")
model = RDF.Model(s)

(not tested)