I am a beginner with spark, generally in some executions with a java.lang.OutOfMemoryError: Java heap space is raised:
java.lang.OutOfMemoryError: Java heap space at java.base/java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:61)
java.base/java.nio.ByteBuffer.allocate(ByteBuffer.java:348)
org.apache.parquet.bytes.HeapByteBufferAllocator.allocate(HeapByteBufferAllocator.java:32)
org.apache.parquet.hadoop.ParquetFileReader$ConsecutivePartList.readAll(ParquetFileReader.java:1696)
org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:925)
org.apache.parquet.hadoop.ParquetFileReader.readNextFilteredRowGroup(ParquetFileReader.java:956)
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:126)
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:225)
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207)
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
org.apache.spark.sql.execution.datasources.RecordReaderIterator$$anon$1.hasNext(RecordReaderIterator.scala:61)
I have looked for what could be due to the lack of --driver-memory
and --executor-memory
args, this spark is hosted in a docker container, and the pyspark script runs with spark-submit:
docker exec -it pyspark_container \
/usr/local/lib/python3.10/dist-packages/pyspark/bin/spark-submit \
/spark_files/mapping.py;
My computer specifications:
8 cores 30GB ram 1.2TB ssd
I would like to know if it is possible and if it makes sense to increase these args since I am not in a cluster, and how to do the allocation calculation.
Really appreciate your help
Since you're not using a --master
argument in your spark-submit
command, you're using Spark in local mode. That means that all the driver and executor processes happen on the same machine.
In that case, the --executor-memory
argument is not used. It is the --driver-memory
argument that enable your local cluster to have more memory.
As we don't know what your data looks like, it is a bit hard to say what a proper size would be to choose here. The default value of --driver-memory
is 1g. Since you have 30G of RAM on your machine, you can increase this.
Try something like:
docker exec -it pyspark_container \
/usr/local/lib/python3.10/dist-packages/pyspark/bin/spark-submit \
--driver-memory Xg \
/spark_files/mapping.py;
where X
is a number that makes sense for your data. If you have the whole machine to yourself, you can try using a big chunk of the available memory, like --driver-memory 25g
.