I am using Jupyter notebook in emr to handle large chunks of data. While processing data I see this error:
An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 108 tasks (1027.9 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
It seems I need to update the maxResultsSize in the spark config. How do I set spark maxResultsSize from jupyter notebook.
Already checked this post: Spark 1.4 increase maxResultSize memory
Also, In emr notebook, spark context is already given, is there any way to edit spark context and increase maxResultsSize
Any leads would be very helpful.
Thanks
You can set livy configuration at the start of spark session See https://github.com/cloudera/livy#request-body
Place this at the start of your code
%%configure -f
{"conf":{"spark.driver.maxResultSize":"15G"}}
Check settings of your session by printing it in the next cell:
print(spark.conf.get('spark.driver.maxResultSize'))
This should resolve the problem