apache-sparkpysparkdatabricksjob-schedulingdatabricks-workflows

How a failed databricks job can continue where it left?


I have a databricks job that run many commands and at the end it tries to save the results to a folder. However, it is failed because it tried to write a file to folder but folder was not exists. I simply created the folder. However, how can I make it continue where it left without executing all the previous commands.


Solution

  • I assume that by Databricks job you refer to the way to run non-interactive code in a Databricks cluster.

    I do not think that what you ask is possible, namely getting the output of a certain Spark task from a previous job run on Databricks. As pointed out in the other answer, "if job is finished, then all processed data is gone". This has to do with the way Spark works under the hood. If you are curious about this topic, I suggest you start reading this post about Transformations and Actions in Spark.

    Although you can think of a few workarounds, for instance if you are curious about certain intermediate outputs of your job, you could decide to temporary write your DataFrame/Dataset to some external location. In this way you can easily resume a the job from your preferred point by reading one of your checkpoints as input. This approach is a bit clanky and I do not recommend it, but it's a quick and dirty solution you might want to choose if you are in the testing/designing phase.

    A more robust solution would involve splitting your job in multiple sub-jobs and setting upstream & downstream dependencies among them. You can do that using Databricks natively (Task dependencies section) or an external scheduler that integrates with Databricks, like Airflow. In this way, you can split your tasks and you will be able to have an higher control granularity on your Application. So, in case of again failures on the writing step, you will be able to re run only the writing easily.