I have a dataset where each row has three features <src, dest, traceID>. The row represents a single edge (from source to destination) and the ID of the trace it belongs to. Note that these traces are invocation of microservices collected from an observability tool such as Jaeger. So there could be multiple traces (with different traceids) but the same edge connections. I want to achieve the following: 1.Parse each trace separately into a graph. 2.Group graphs which are the same structure. 3.Dump a representative graph from each group and the count that graph occurs in my dataset. Note that I have 2 million such graphs (average number of nodes in each graph is 15). Is GraphX suitable for such a problem?
I am currently parsing this as an edge RDD but I am not sure how to parse each graph separately. Should I have multiple graph objects for each graph?
For what you want there is a lot of functionality that is not there in GraphX IMO.
To address problems similar to yours in my work, I have developed a Pyspark package called splink_graph which can handle the tasks you're aiming to achieve when in a Spark cluster environment.
Firstly I will define the way of how I would approach this problem that you have.
While you could likely execute the first two steps using GraphX (connected components in GraphX docs link), it's not capable of handling the latter two out of the box.
With splink_graph, you could:
By following this approach, you should be able to accomplish what you're seeking to do.
Of course what I propose is Python/Pyspark based and not Scala. If that is an issue I would propose implementing functions in Scala/Spark for the connected compenent & weisfeiler-lehman graph hash functionality