pythonrdfigraphredland

Loading very large RDF triples into iGraph -> Fast vertices lookup?


I need to load a subset of the DBPedia graph into iGraph in order to compute some graph statistics (such as node centrality, ...). I load DBPedia triples using the Redlands libRDF python library. Each node is associated with an URI (unique identifier).

I have some trouble loading the graph into iGraph. This is what I do:

1) Read a triple line (subject, predicate, object)

2) Use the following algorithm to get or create a vertex (with attribute)

def add_or_find_vertex (self, g, uri):
    try:
        return g.vs.find(name=uri)
    except (KeyError, ValueError):
        g.add_vertex(name=uri)
        return g.vs.find(name=uri)

subjVertex = self.add_or_find_vertex(self.g, subject)
objVertex = self.add_or_find_vertex(self.g, object)
self.g.add_edge(subjVertex, objVertex, uri=predicate)

The problem is that my script is very slow and I need to load 25M triples. Each node is unique but is found several time in the triple file. I thus need to perform a lookup before creating the edge. Can you tell me if the "find" method is using an index for lookups (Hashtable, ...)? What is the complexity of vertex lookups ? How would you do ?

Thank you very much


Solution

  • Already answered here. For sake of completeness, I'm copying my answer here as well:

    Vertex lookups are usually O(|V|) since vertex attributes are not indexed by default - except the name vertex attribute, which is indexed. However g.vs.find is using this index only if you do this: g.vs.find(url) but not if you do this: g.vs.find(name=url). This is sort of a bug as the index could be used in both cases. Also see yesterday's thread from the mailing list.

    However, note that igraph's data structures are optimized for static graphs, so g.add_vertex (and I presume you also use g.add_edge) could also be a bottleneck. Internally, igraph uses an indexed edge list to store the graph and the index has to be re-built every time you mutate the graph, so it is much more efficient to do vertex and edge additions in batches where possible.

    Since you already seem to have an iterator that yields the edges of your graph in (subject, predicate, object) form, maybe it's easier to use Graph.DictList to construct the graph at once because it also takes care of storing the vertex IDs in the name attribute, adding edges in batches where it makes sense, and also adding the predicate attribute from your triplets:

    >>> g = Graph.DictList(vertices=None, edges=({"source": subject,
    ...         "target": object, "predicate": predicate}
    ...         for subject, predicate, object in your_iterator))
    

    Graph.DictList processes 100000 pre-generated random triplets in 1.63 seconds on my machine so I guess that improves things a little bit.