The objective is to efficiently generate random walks on a relatively large graph with uneven probabilities of going through edges depending on their type.
conf/remote.yaml
file used)I am currently generating random walks with the sample
command:
g.V(<startnode_id>).
repeat( local( both().sample(1) ) ).
times(<desired_randomwalk_length>).
path()
I tried using a gremlinpython script to create a random walk generator that would first get all edges connected to the current node, then pick randomly an edge to go through and repeat <desired_randomwalk_length>
times.
from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection
from gremlin_python.process.anonymous_traversal import traversal
from gremlin_python.structure.graph import Vertex
from typing import List
connection = DriverRemoteConnection(<URL>, "g")
g = traversal().withRemote(connection)
def get_next_node(start:Vertex) -> Vertex:
next_vertices = g.V(start.id).both().fold().next()
return next_vertices[randint(0, len(next_vertices)-1)]
def get_random_walk(start:Vertex, length:int=10) -> List[Vertex]:
current_node = start
random_walk = [current_node]
for _ in range(length):
current_node = get_next_node(current_node)
random_walk.append(current_node)
return random_walk
While testing on a subset of the total graph (400k vertices, 1.5m rel), I got these results
<desired_randomwalk_length>
of 10: 100k random walks in 1h10<desired_randomwalk_length>
of 4: 2k random walks in 1h+The sample command is really fast, but there are a few problems:
Is there a better way to do random walks with Gremlin?
If there is none, is there a way to modify the sample query to rectify the assign probabilities to types of edges? Maybe even a way to have a better distribution of the sampling?
In last recourse, is there a way to improve the queries to make this "by hand" with a gremlinpython script?
Thanks to everyone reading/replying!
Is there a way to do the following:
r_type1
, r_type2
, r_type3
, ... the acceptable relationship type for this random walkproba1
, proba2
, proba3
, ... the probabilities of going through these relationship typesFor each step
r_type1
, r_type2
, r_type3
, ...proba1
, proba2
, proba3
, ...I think the second step could be done be sampling multiple nodes for each relationships, in accordance with the probas (which could be done by using a gremlinpython
script to build the query). This still leaves the question of how to sample on multiple relationships from a single node, and how to randomly pick one in the sampled nodes.
I hope this is clear!
Thanks to @Kelvin Lawrence's Practical Gremlin (especially the union section), I managed to do what I wanted (or close enough).
The Gremlin query I get is the following:
g.V(<vertex_id>).
repeat(
local(
union(
both(<relationship_type1>).sample(N1),
both(<relationship_type2>).sample(N2),
...
).
sample(1)
)
).times(<walk_length>).
path()
The N_
values are set independently of the node, such that the least probable transition yields exactly 1 sample. This also means that the probabilities are not exactly respected where the number of relationships of a given type is inferior to the corresponding N_
value.
The union
part is built in python using gremlinpython
(nb_samples
is the dictionary storing the number of samples needed for each relationship type)
from gremlin_python.process.graph_traversal import __, GraphTraversal
next_node_traversal:GraphTraversal = __.union(
*[
__.both(key).sample(nb_samples[key])
for key in nb_samples
]
).sample(1)
(Here we are using the *
operator to unpack the list when passing it as argument to the union
method)