pythonazurecluster-computingdatabrickspool

Can you pre-install libraries on Databricks Pool nodes?


We have a number of Python Databricks jobs that all use the same underlying Wheel package to install their dependencies. Installing this Wheel package even with a node that has been idling in a Pool still takes 90 seconds.

Some of these jobs are very long-running so we would like to use Jobs computer clusters for the lower cost in DBUs.

Some of these jobs are much shorter-running (<10 seconds) where the 90 second install time seems more significant. We have been considering using a hot cluster (All-Purpose Compute) for these shorter jobs. We would like to avoid the extra cost of the All-Purpose Compute if possible.

Reading the Databricks documentation suggests that the Idle instances in the Pool are reserved for us but not costing us DBUs. Is there a way for us to pre-install the required libraries on our Idle instances so that when a job comes through we are able to immediately start processing it?

Is there an alternate approach that can fulfill a similar use case?


Solution

  • You can't install libraries directly into nodes from pool, because the actual code is executed in the Docker container corresponding to Databricks Runtime. There are several ways to speedup installation of the libraries: