apache-sparkgarbage-collectionapache-spark-sqltableau-apispark-thriftserver

Spark Thriftserver stops or freezes due to tableau queries


The spark cluster (spark 2.2) is used by around 30 people via spark-shell and tableau (10.4). Once a day the thriftserver gets killed or freezes because the jvm has to many garbage to collect. These are the error messages that I can find in the thriftserver log file:

ERROR SparkExecuteStatementOperation: Error executing query, currentState RUNNING, java.lang.OutOfMemoryError: GC overhead limit exceeded

ERROR SparkExecuteStatementOperation: Error executing query, currentState RUNNING, java.lang.OutOfMemoryError: GC overhead limit exceeded

ERROR TaskSchedulerImpl: Lost executor 2 on XXX.XXX.XXX.XXX: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. Exception in thread "HiveServer2-Handler-Pool: Thread-152" java.lang.OutOfMemoryError: Java heap space

General information:

The Thriftserver is started with the following options (copied from the web-ui of the master -> sun.java.command):

org.apache.spark.deploy.SparkSubmit --master spark://bd-master:7077 --conf spark.driver.memory=6G --conf spark.driver.extraClassPath=--hiveconf --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --executor-memory 12G --total-executor-cores 12 --supervise --driver-cores 2 spark-internal hive.server2.thrift.bind.host bd-master --hiveconf hive.server2.thrift.port 10001

The spark standalone cluster has 48 cores and 240 GB memory at 6 machines. Every machine has 8 Cores and 64 GB memory. Two of them are virtual machines.

The users are querying a hive table which is a 1.6 GB csv file replicated on all machines.

Is there something I have done wrong why tableau is able to kill the thriftserver? Is there any other information I could provide that helps you to help me?


Solution

  • We are able to bypass this issue by setting:

    spark.sql.thriftServer.incrementalCollect=true
    

    With this parameter set to true, the thriftserver will send a result to the requester for every partition. This reduces the peak of memory the thriftserver needs when the thriftserver is going to send the result.