gremlin tinkerpop tinkerpop3 tinkergraph

Improve performance removing TinkerGraph vertices

I have a graph g with 600k vertices and 950k edges. After some processing, I need to clean up about 350k+ vertices with this query:

g.V().hasLabel(LABEL_INTERMEDIATE_COLUMN).not(inE(EDGE_DEPEND)).drop().iterate();

Even though I'm excluding vertices with no "depend" edges, they are still connected with other edges.

Using Java, tinkerpop/tinkergraph 3.4.6.

Currently it is taking about 45 minutes to drop all those vertices.

I did a java profiling and the results shows 73% of time spent in the TinkerVertex.remove method, and the rest in the ExpandableStepIterator.next

Is there something like "bulk drop"? Would a JanusGraph or other graph provider be much faster?

Solution

It's unlikely that there are graphs that are much faster than TinkerGraph as TinkerGraph is a pure in-memory implementation. You might find one that is more efficient using that memory like OverflowDB which originally forked from TinkerGraph but I don't know that it will make this particular query go faster.

TinkerGraph, nor any graph I know of, has a filtered "bulk drop" operation.

The "not" style global query here is simple expensive as you're having to touch the a large portion of the graph. Of course, I'm a bit surprised that TinkerGraph is taking that long for a graph with less than one million edges. You didn't mention if you were experiencing a lot of GC as you were doing your profile. Perhaps that is an issue? If so, I would try to adjust your JVM memory configurations - maybe you just need a larger -Xmx value or something simple like that.

From a query perspective, you could try to invert not() portion of the traversal to positively find the things you want to remove. It might lead to a less concise query to read but could perhaps speed things up, but on the other hand you are still trying to delete 50% of your data so the cost may not be just in finding the vertices to get rid of.

One other thought would be try to parallelize the drop(). You might hit concurrency errors so you could need a retry strategy but you could consider taking the Iterator of g.V().hasLabel(LABEL_INTERMEDIATE_COLUMN).not(inE(EDGE_DEPEND)) and then delegating calls to each (or batches of) Vertex.remove() to a separate worker thread.