I need to load a subset of the DBPedia graph into iGraph in order to compute some graph statistics (such as node centrality, ...). I load DBPedia triples using the Redlands libRDF python library. Each node is associated with an URI (unique identifier).
I have some trouble loading the graph into iGraph. This is what I do:
1) Read a triple line (subject, predicate, object)
2) Use the following algorithm to get or create a vertex (with attribute)
def add_or_find_vertex (self, g, uri):
try:
return g.vs.find(name=uri)
except (KeyError, ValueError):
g.add_vertex(name=uri)
return g.vs.find(name=uri)
subjVertex = self.add_or_find_vertex(self.g, subject)
objVertex = self.add_or_find_vertex(self.g, object)
self.g.add_edge(subjVertex, objVertex, uri=predicate)
The problem is that my script is very slow and I need to load 25M triples. Each node is unique but is found several time in the triple file. I thus need to perform a lookup before creating the edge. Can you tell me if the "find" method is using an index for lookups (Hashtable, ...)? What is the complexity of vertex lookups ? How would you do ?
Thank you very much
Already answered here. For sake of completeness, I'm copying my answer here as well:
Vertex lookups are usually O(|V|) since vertex attributes are not indexed by default - except the
name
vertex attribute, which is indexed. Howeverg.vs.find
is using this index only if you do this:g.vs.find(url)
but not if you do this:g.vs.find(name=url)
. This is sort of a bug as the index could be used in both cases. Also see yesterday's thread from the mailing list.However, note that igraph's data structures are optimized for static graphs, so
g.add_vertex
(and I presume you also useg.add_edge
) could also be a bottleneck. Internally, igraph uses an indexed edge list to store the graph and the index has to be re-built every time you mutate the graph, so it is much more efficient to do vertex and edge additions in batches where possible.Since you already seem to have an iterator that yields the edges of your graph in
(subject, predicate, object)
form, maybe it's easier to useGraph.DictList
to construct the graph at once because it also takes care of storing the vertex IDs in thename
attribute, adding edges in batches where it makes sense, and also adding thepredicate
attribute from your triplets:>>> g = Graph.DictList(vertices=None, edges=({"source": subject, ... "target": object, "predicate": predicate} ... for subject, predicate, object in your_iterator))
Graph.DictList
processes 100000 pre-generated random triplets in 1.63 seconds on my machine so I guess that improves things a little bit.