gremlinorientdbtinkerpop3gremlin-serverorientdb3.0

Why Is addVertex() Faster Than addV()?


Problem

Is there something I'm missing on how to run with g.addV()?
Is there some way to speed up g.addV()?
If I'm asking this wrong or something else is missing, please let me know so I can correct myself.

I'm testing GraphDBs (Graph-Databases) and evaluation requirements and metrics are being inspected upon:

  1. Standard Format: "Apples-2-Apples" comparison
    1. Running them all with Gremlin-Server's g.addV() would be a stronger comparison of databases rather than difference in coding
  2. If implementation is faster, then I should be using that instead
    1. Like OrientDB which is about 40-50% faster handling the data
      1. addVertex() has been between 15-30 minutes
      2. addV() has been between 25-40 minutes

org.apache.tinkerpop.gremlin.orientdb.OrientGraph

        OrientGraph orientGraph = OrientGraph.open(configuration);
        // for-loop
            Vertex vertex = orientGraph.addVertex(_labels);

org.apache.tinkerpop.gremlin.process.traversal.dsl.graph.GraphTraversalSource

        Transaction tx = g.tx();
        tx.open();
        GraphTraversalSource g = orientGraph.traversal();
        // for-loop 
            Vertex vertex = g.addV(_labels).next();
        tx.commit();

After all, Tinkerpop3-Gremlin's GraphTraversalSource and its addV() uses Transaction tx just like OrientDB's OrientGraph and its addVertex().

public class GraphTraversalSource implements TraversalSource {

    // ...

    public GraphTraversal<Vertex, Vertex> addV(final String vertexLabel) {
        if (null == vertexLabel) {
            throw new IllegalArgumentException("vertexLabel cannot be null");
        } else {
            GraphTraversalSource clone = this.clone();
            clone.bytecode.addStep("addV", new Object[]{vertexLabel});
            GraphTraversal.Admin<Vertex, Vertex> traversal = new DefaultGraphTraversal(clone);
            return traversal.addStep(new AddVertexStartStep(traversal, vertexLabel));
        }
    }

    // ...

    public Transaction tx() {
        if (null == this.connection) {
            return this.graph.tx();
        } else {
            Transaction tx = this.connection.tx();
            return tx == Transaction.NO_OP && this.connection instanceof Transaction ? (Transaction)this.connection : tx;
        }
    }

    // ...

}

Solution

  • There is Gremlin (i.e. the g.addV()) and there is a lower level API that TinkerPop refers to as the Structure API (e.g. graph.addVertex()). The Structure API is meant for providers who implement TinkerPop interfaces and Gremlin is meant for users like you. I suppose in OrientDB's case you are perhaps closer to the metal so to speak and if you're just doing a bunch of addVertex() in a loop then I can imagine it's going to be faster.

    In reality though, inserting data is rarely that sterile as just doing addVertex() or addV(). You're typically adding a vertex and structure around it with connecting edges and other vertices. When you are adding vertices that way it's usually a bulk loading scenario and you'd probably get even better performance from your graphs bulk loader solution which will simply blow both Gremlin and the Structure API out of the water.

    That said, there are a number of downsides to an end-user utilizing the Structure API:

    1. You lose code portability. Some graphs databases will not support that directly, so you will be bound to only graphs that do.
    2. You typically can't use the Structure API remotely.
    3. You can't use Gremlin from any other language except Java/JVM.
    4. Since it is not user facing or recommended for user, TinkerPop could simply do away with it in a future release. I'd say this unlikely for version 3.x, but in the longer term much more possible.

    Not every graph will have that difference between Gremlin and the Structure API, if you can even draw that comparison well. I think if you're just testing for purpose of bulk loading which is typically a one-time effort then made using the Structure API isn't a bad choice. Or perhaps if you know you will forever be bound to embedded mode, in Java, on the same database you could use the Structure API where performance really really matters in a big way. But generally speaking, the general recommendation is to stick to writing Gremlin.