scalaapache-sparkspark-graphx

How does Scala represent immutable maps internally from storage standpoint?


I have an application in scala on Spark-graphx. The VD contains a Map[Long, Map[Long, Double]] which needs to grow with each iteration. Both are created from List.toMap, so AFAIK both inner and outer should be immutable. What I have run into on very large graph data sets is an understanding of why in the documentation for the Pregel AI it says that ideally the VD should not grow - I am getting the dreaded "Missing an output location for shuffle n partition m", i.e., OOM.

So my question is this - how are immutable maps stored internally in scala? If I had an idea of the memory usage of a map, then I could initialize each VD with some number of placeholder bytes that each vertex could "exchange" for map size, so that the overall size does not grow (significantly). This is not the most elegant solution, but I cannot think of another for this particular problem.

Alternatively, if someone could suggest a better way to handle this accumulation of data in the VD, then I am also open to such suggestions.


Solution

  • Answering my own question in an indirect way: there is a very nice piece of documentation: https://spark.apache.org/docs/latest/tuning.html that discusses the overhead of Java types (including maps) and how to reduce the overhead. So with this knowledge I have dumped maps altogether, and I do not need to come up with an ugly "ballast" method to ensure the constancy of the memory usage for a VD.