javaspark-streamingspark-graphx

updateStateByKey from RDD


I am a bit new to Spark-graphx, so please forgive if this is a stupid question. I also would prefer to do this in Java, rather than Scala, if at all possible.

I need to run a graphx calculation on the RDDs of a JavaDStream, but I need to roll the results back into my state object.

How would you solve this problem in Java? I am ready to restructure the calculations to a different logical flow, if there is a better way to do this.

To make this more visual, the structure looks like this:

JavaDStream<StateObject> stream = inputDataStream.updateStateByKey(function);

stream.foreachRDD(rdd -> {
  Graph<Vertex, EdgeProperty> graph = GraphImpl.apply(/* derive the Vertex and EdgeProperty from the rdd */);
  JavaRDD<Vertex> updatedVertices = graphOperation(graph);
  // How to put the contents of updatedVertices back into stream?
});

Solution

  • I put my graph calculation in as a transform and got things up and running up to the point of hanging during fold (in Pregel) and errors from Scala when running JavaConverters.asScalaIteratorConverter that there was no appropriate iterator...

    In short, after reading online that Graphframes is potentially more stable than graphx for Java, since it is apparently easier to wrap the Scala in Java context for Dataframes, I have abandoned this approach and moved to Graphframes. For others who have run into similar problems, I apologize that I have no solution to offer, but I am finding the Dataframe approach to work must better with my algorithm.