databricksazure-databricks

Databricks Spot Instance: Completion Guarantee


Databricks allows to use spot instances for worker nodes.

enter image description here

I consider to use them for interactive clusters. Do I have a gurantee that code will be completed without any errors even if spot instances are evicted? I would accept execution delays but no errors.

Here, it is just stated:

Spot instances are a good choice for workloads where it is acceptable to take longer because one or more spot instances have been evicted by the cloud provider.


Solution

  • There's never a true guarantee, so if reliability and availability are paramount you should not use Spot instances.

    However, Databricks jobs do implement "spot with fallback", meaning that it will attempt to launch a spot instance but if Spot instances are unavailable or above your max price, Databricks will fallback to On-Demand. Databricks clusters also automatically replace worker nodes if they are lost due to spot instance termination/eviction.

    This doesn't necessarily mean your code will execute without error though--that depends on your code. For example, if you are using Spark Datasets/DataFrames/RDDs, you can lose instances while a job is running and not lose any data; if enough worker nodes are lost during the handling of a DataFrame though, you may see Spark warnings stating "no more replicas available for rdd_12" meaning enough was lost that Spark did not have replicas of some of the data to recover with.

    TL;DR: For an interactive or all-purpose cluster, I recommend using an on-demand Driver node, and only some spot instances for the worker nodes (e.g. 50% of workers on spot, so a cluster of 4 workers would have 1 on-demand driver + 2 on-demand workers + 2 spot workers). This will provide some cost savings while limiting the change that too many workers get reclaimed and cause issues.