apache-sparkcompressionrddparquetmemory-footprint

RDD Memory footprint in spark


I'm not sure on the concept of memory foot print. When loading a parquet file of eg. 1GB and creating RDDs out of it in Spark, What would be the memory food print for each RDD?


Solution

  • When you create an RDD out of a parquet file, nothing will be loaded/executed until you run an action (e.g., first, collect) on the RDD.

    Now your memory footprint will most likely vary over time. Say you have 100 partitions and they are equally-sized (10 MB each). Say you are running on a cluster with 20 cores, then at any point in time you only need to have 10MB x 20 = 200MB data in memory.

    To add on top of this, given that Java objects tend to take more space, it's not easy to say exactly how much space your 1GB file will take in the JVM Heap (assuming you load the entire file). It could me 2x or it can be more.

    One trick you can do to test this is force your RDD to be cached. You can then check in the Spark UI under Storage and see how much space that RDD took to cache.