I have a graph g
with 600k
vertices and 950k
edges. After some processing, I need to clean up about 350k+
vertices with this query:
g.V().hasLabel(LABEL_INTERMEDIATE_COLUMN).not(inE(EDGE_DEPEND)).drop().iterate();
Even though I'm excluding vertices with no "depend" edges, they are still connected with other edges.
Using Java, tinkerpop/tinkergraph 3.4.6.
Currently it is taking about 45 minutes to drop all those vertices.
I did a java profiling and the results shows 73% of time spent in the TinkerVertex.remove
method, and the rest in the ExpandableStepIterator.next
Is there something like "bulk drop"? Would a JanusGraph or other graph provider be much faster?
It's unlikely that there are graphs that are much faster than TinkerGraph as TinkerGraph is a pure in-memory implementation. You might find one that is more efficient using that memory like OverflowDB which originally forked from TinkerGraph but I don't know that it will make this particular query go faster.
TinkerGraph, nor any graph I know of, has a filtered "bulk drop" operation.
The "not" style global query here is simple expensive as you're having to touch the a large portion of the graph. Of course, I'm a bit surprised that TinkerGraph is taking that long for a graph with less than one million edges. You didn't mention if you were experiencing a lot of GC as you were doing your profile. Perhaps that is an issue? If so, I would try to adjust your JVM memory configurations - maybe you just need a larger -Xmx
value or something simple like that.
From a query perspective, you could try to invert not()
portion of the traversal to positively find the things you want to remove. It might lead to a less concise query to read but could perhaps speed things up, but on the other hand you are still trying to delete 50% of your data so the cost may not be just in finding the vertices to get rid of.
One other thought would be try to parallelize the drop()
. You might hit concurrency errors so you could need a retry strategy but you could consider taking the Iterator
of g.V().hasLabel(LABEL_INTERMEDIATE_COLUMN).not(inE(EDGE_DEPEND))
and then delegating calls to each (or batches of) Vertex.remove()
to a separate worker thread.