performancescalaapache-sparkheterogeneous

Apache Spark - How to avoid failing slow tasks


Dear fellow Apache Spark enthusiasts

I recently kicked off a sideline project with the goal of turning a couple of ODROID XU4 computers into a stand-alone Spark Cluster.

After setting up the cluster I ran into a problem that seems to be specific to heterogeneous multi processors. Spark executor tasks run extremely slow on the XU4 when using all 8 processors. The reason, as mentioned in a comment on my post below, is that Spark does not wait for the executors that have been kicked off on the slow processors.

http://forum.odroid.com/viewtopic.php?f=98&t=21369&sid=4276f7dc89a8d7825320e7f705011326&p=152415#p152415

One solution is to use fewer executor cores and to set the CPU affinity to not use the LITTLE processors. This is however a less than ideal solution.

Is there a way to ask Spark to wait a bit longer for feedback from slower executors? Obviously waiting too long will have a negative effect on performance. The positive effect of utilising all cores should however balance out the negative effect.

Thanks in advance for any help!


Solution

  • @Dikei response highlights two potential causes, but it turns out the problem is not the one he suspects. I have the same set up as the @TJVR, and it turns out the driver is missing heartbeats from executors. To address this, I added the following to spark-env.sh:

    export SPARK_DAEMON_JAVA_OPTS="-Dspark.worker.timeout=600 -Dspark.akka.timeout=200 -Dspark.shuffle.consolidateFiles=true"
    export SPARK_JAVA_OPTS="-Dspark.worker.timeout=600 -Dspark.akka.timeout=200 -Dspark.shuffle.consolidateFiles=true"
    

    This changes the default timeouts for executor heartbeats. Also set spark.shuffle.consolidateFiles to true to improve performance on my ext4 filesystem. These defaults changes allowed me to increased the core usage above one and not frequently lose executors.