pythonscalaapache-sparkpysparkexecutor

How to calculate Spark driver and executor memory in local machine?


I am a beginner with spark, generally in some executions with a java.lang.OutOfMemoryError: Java heap space is raised:

java.lang.OutOfMemoryError: Java heap space at java.base/java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:61)
java.base/java.nio.ByteBuffer.allocate(ByteBuffer.java:348)
org.apache.parquet.bytes.HeapByteBufferAllocator.allocate(HeapByteBufferAllocator.java:32)
org.apache.parquet.hadoop.ParquetFileReader$ConsecutivePartList.readAll(ParquetFileReader.java:1696)
org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:925)
org.apache.parquet.hadoop.ParquetFileReader.readNextFilteredRowGroup(ParquetFileReader.java:956)
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:126)
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:225)
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207)
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
org.apache.spark.sql.execution.datasources.RecordReaderIterator$$anon$1.hasNext(RecordReaderIterator.scala:61)

I have looked for what could be due to the lack of --driver-memory and --executor-memory args, this spark is hosted in a docker container, and the pyspark script runs with spark-submit:

docker exec -it pyspark_container \
    /usr/local/lib/python3.10/dist-packages/pyspark/bin/spark-submit \
    /spark_files/mapping.py;

My computer specifications:

8 cores 30GB ram 1.2TB ssd

I would like to know if it is possible and if it makes sense to increase these args since I am not in a cluster, and how to do the allocation calculation.

Really appreciate your help


Solution

  • Since you're not using a --master argument in your spark-submit command, you're using Spark in local mode. That means that all the driver and executor processes happen on the same machine.

    In that case, the --executor-memory argument is not used. It is the --driver-memory argument that enable your local cluster to have more memory.

    As we don't know what your data looks like, it is a bit hard to say what a proper size would be to choose here. The default value of --driver-memory is 1g. Since you have 30G of RAM on your machine, you can increase this.

    Try something like:

    docker exec -it pyspark_container \
        /usr/local/lib/python3.10/dist-packages/pyspark/bin/spark-submit \
        --driver-memory Xg \
        /spark_files/mapping.py;
    

    where X is a number that makes sense for your data. If you have the whole machine to yourself, you can try using a big chunk of the available memory, like --driver-memory 25g.