gremlin

What's the right approach to "join" two nodes of a graph in gremlin?


Suppose I have a graph looks like below

graph = TinkerGraph.open()
g = graph.traversal()
v1 = g.addV('CC').property('name','t1')
v2 = g.addV('KK').property('name','t1')

I want to find all CC which has the same 'name' as KK. I can write:

g.V().hasLabel('CC').as('c').values('name').as('cn').V().hasLabel('KK').values('name').as('k').where('cn',eq('k')).select('c')

This mimics the join in SQL but writing so the performance seems very bad. From SQL2Gremlin, they have example of "join" two nodes if there is an edge connected between two. I was wondering if there is any join approach in gremlin that whether there is a path connected two nodes is unknown beforehand? In other words, what's the best approach to write "join" in gremlin and we don't know whether such two nodes are connected or not either directly or via a path?

Thanks much!


Solution

  • Your intuition is largely right. The "join" is a realized relationship (i.e. an edge) between two vertices. That is generally the gain to be had in a graph database. Doing vertex to vertex joins on properties in a SQL style will not typically be efficient for a graph.

    As for your query, you might re-write it to this form for a bit more clarity:

    gremlin> g.V().hasLabel('CC').as('c').
    ......1>   V().hasLabel('KK').
    ......2>   where(eq('c')).
    ......3>     by('name').
    ......4>   select('c')
    ==>v[0]
    

    however performance will likely remain the same as I don't think any graph system will currently optimize this traversal. There will be no index usage and you will be left with full graph scans of "CC" and "KK" to get your results. Obviously that's very expensive on a large graph.

    There was some discussion on this topic in the Gremlin Users mailing list here where a number of nice points were made. Of note, Josh Perryman wrote (among other nice points):

    A SQL-style of join is a very poor use of a graph db engine. Like Daniel suggests joins should be pre-computed and the edges added at write-time / data I jest time.

    This is by necessity and design. Edges are basically materialized joins. Graph databases are optimized for them, a disk or cache read operation. Relational databases are optimized for joins, a query-time computer operation.

    It is usually significantly cheaper to pre-compute the edges in a separate engine before loading data than to do so after data is loaded into the graph. The exception to this is when an edge is determined based on a multi-hop path through the graph. For that use case a graph dB is best.