gremlin janusgraph random-walk gremlinpython

Efficient controlled random walks in gremlin

Goal

The objective is to efficiently generate random walks on a relatively large graph with uneven probabilities of going through edges depending on their type.

Configuration

Ubuntu VM, 23Go RAM
JanusGraph 0.6.1 full
Local graph (default conf/remote.yaml file used)
~1.8m vertices (~28k will be start nodes for the random walks)
~21m relationships (they can all be used in the random walks)

What I am doing

I am currently generating random walks with the sample command:

g.V(<startnode_id>).
  repeat( local( both().sample(1) ) ).
    times(<desired_randomwalk_length>).
  path()

What I tried

I tried using a gremlinpython script to create a random walk generator that would first get all edges connected to the current node, then pick randomly an edge to go through and repeat <desired_randomwalk_length> times.

from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection
from gremlin_python.process.anonymous_traversal import traversal
from gremlin_python.structure.graph import Vertex
from typing import List

connection = DriverRemoteConnection(<URL>, "g")
g = traversal().withRemote(connection)

def get_next_node(start:Vertex) -> Vertex:
  next_vertices = g.V(start.id).both().fold().next()
  return next_vertices[randint(0, len(next_vertices)-1)]
def get_random_walk(start:Vertex, length:int=10) -> List[Vertex]:
  current_node = start
  random_walk = [current_node]
  for _ in range(length):
    current_node = get_next_node(current_node)
    random_walk.append(current_node)
  return random_walk

Issues

While testing on a subset of the total graph (400k vertices, 1.5m rel), I got these results

Sample query, <desired_randomwalk_length> of 10: 100k random walks in 1h10
Gremlinpython function, <desired_randomwalk_length> of 4: 2k random walks in 1h+

The sample command is really fast, but there are a few problems:

It doesn't seem to truly be a uniform distribution pick amongst the edges (it seems to be successive coin tosses) which could lead to certain paths being taken more often, which then diminishes the interest of generating random walks. (I can't directly do what is recommended here as the nodes ids aren't in a sequence, thus I have to acquire them first.)
I haven't found a way to give different probabilities to different types of relationships.

Is there a better way to do random walks with Gremlin?
If there is none, is there a way to modify the sample query to rectify the assign probabilities to types of edges? Maybe even a way to have a better distribution of the sampling?
In last recourse, is there a way to improve the queries to make this "by hand" with a gremlinpython script?

Thanks to everyone reading/replying!

EDIT

Is there a way to do the following:

Given a r_type1, r_type2, r_type3, ... the acceptable relationship type for this random walk
Given a proba1, proba2, proba3, ... the probabilities of going through these relationship types

For each step

Sample a node for each relationship type r_type1, r_type2, r_type3, ...
Keep only one according to the probabilities proba1, proba2, proba3, ...

I think the second step could be done be sampling multiple nodes for each relationships, in accordance with the probas (which could be done by using a gremlinpython script to build the query). This still leaves the question of how to sample on multiple relationships from a single node, and how to randomly pick one in the sampled nodes.

I hope this is clear!

Solution

Thanks to @Kelvin Lawrence's Practical Gremlin (especially the union section), I managed to do what I wanted (or close enough).

The Gremlin query I get is the following:

g.V(<vertex_id>).
  repeat(
    local(
      union(
        both(<relationship_type1>).sample(N1),
        both(<relationship_type2>).sample(N2),
        ...
      ).
      sample(1)
    )
  ).times(<walk_length>).
  path()

The N_ values are set independently of the node, such that the least probable transition yields exactly 1 sample. This also means that the probabilities are not exactly respected where the number of relationships of a given type is inferior to the corresponding N_ value.

The union part is built in python using gremlinpython (nb_samples is the dictionary storing the number of samples needed for each relationship type)

from gremlin_python.process.graph_traversal import __, GraphTraversal

next_node_traversal:GraphTraversal = __.union(
  *[
    __.both(key).sample(nb_samples[key])
    for key in nb_samples
  ]
).sample(1)

(Here we are using the * operator to unpack the list when passing it as argument to the union method)