javagremlintinkerpopamazon-neptunetinkerpop3

Neptune gremlin java traversing with hops


We have a model in neptune db for tag and entity relationship as follows:

Vertices - tags.

id
---
123454
123478
345678
435677
890890
727455
345588
....
....
id   | from   | to     | hash | score
---------------------------------------------
3455 | 123454 | 345678 | key1 | 2.4
3456 | 123454 | 123478 | key2 | 3.5
3457 | 345678 | 435677 | key2 | 3
3458 | 435677 | 435677 | key3 | 1
3459 | 123454 | 435677 | key4 | 4
3460 | 345678 | 890890 | key5 | 3.25
3461 | 890890 | 435677 | key6 | 1.5
3462 | 123454 | 727455 | key7 | 0.5
3463 | 123454 | 345588 | key8 | 0.75

From and to are in and out vertices (ids of tags). Note that in the edges, 'hash' can have duplicate values.

We are programming in Java and using tinkerpop gremlin version 3.5 as the client.

Basically I'm trying to find the aggregated scores for between tags and entities using a max hop provided.

example request scenarios:

  1. entity-entity: total score between entity hashes 'key1' and 'key2' with max hops 2

  2. tag-tag: total score between tag ids '123454' and '435677' with max hops 1

  3. tag-entity: total score between tag id '123454' and entity hash 'key1' with max hops 0 (direct)

I'm new to neptune/gremline and don't have a proper idea on how to do this. Have been referring to tinkerpop documentations without much progress.

an example of a query that I have done for scenario 3 is:

List<Map<Object, Object>> = graphTraversalSource.V(tagId).outE().has("hash", entityHash).valueMap().toList());

And then get the scores by iterating the list and sum them. It works for direct, but not sure how to continue this for 1 hop or 2 hops.

Also, this is for tag-entity only. Haven't managed to do the tag-tag or entity-entity scoring.

Any ideas on modelling these queries would be really helpful. Many thanks!


Solution

  • The following steps can be used to build a sample graph based on the data provided.

    g.addV('tag').property(id,123454).as('t1').
      addV('tag').property(id,123478).as('t2').
      addV('tag').property(id,345678).as('t3').
      addV('tag').property(id,435677).as('t4').
      addE('link').from('t1').to('t3').property(id,3455).property('hash','key1').property('score',2.4).
      addE('link').from('t1').to('t2').property(id,3456).property('hash','key2').property('score',3.5).
      addE('link').from('t3').to('t4').property(id,3457).property('hash','key2').property('score',3).
      addE('link').from('t4').to('t4').property(id,3458).property('hash','key3').property('score',1)  
    

    We can confirm the graph looks OK by rendering it using a graph-notebook

    enter image description here

    The query below can be used to show us the scores between nodes123454 and 435677. The repeat combines until with loops to check for a specific target, or reaching a given depth. The second has is necessary to make sure that we found the node we wanted and did not just hit the maximum depth.

    g.V(123454).
      repeat(outE().inV().simplePath()).
      until(hasId(435677).or().loops().is(2)).
      hasId(435677).
      path().
        by(id).
        by('score')
    

    When run, using the Gremlin Console, we can see the results

    gremlin> g.V(123454).
    ......1>   repeat(outE().inV().simplePath()).
    ......2>   until(hasId(435677).or().loops().is(2)).
    ......3>   hasId(435677).
    ......4>   path().
    ......5>     by(id).
    ......6>     by('score')
    
    ==>[123454,2.4,345678,3,435677]
    

    All that remains is to extend the query to compute the total score. The sack step is useful for this.

    gremlin> g.withSack(0).
    ......1>   V(123454).
    ......2>   repeat(outE().sack(sum).by('score').inV().simplePath()).
    ......3>   until(hasId(435677).or().loops().is(2)).
    ......4>   hasId(435677).
    ......5>   sack()
    
    ==>5.4
    

    With these building blocks you should be able to create the queries to answer each of the scenarios.