We have a number of Python Databricks jobs that all use the same underlying Wheel package to install their dependencies. Installing this Wheel package even with a node that has been idling in a Pool still takes 90 seconds.
Some of these jobs are very long-running so we would like to use Jobs computer clusters for the lower cost in DBUs.
Some of these jobs are much shorter-running (<10 seconds) where the 90 second install time seems more significant. We have been considering using a hot cluster (All-Purpose Compute) for these shorter jobs. We would like to avoid the extra cost of the All-Purpose Compute if possible.
Reading the Databricks documentation suggests that the Idle instances in the Pool are reserved for us but not costing us DBUs. Is there a way for us to pre-install the required libraries on our Idle instances so that when a job comes through we are able to immediately start processing it?
Is there an alternate approach that can fulfill a similar use case?
You can't install libraries directly into nodes from pool, because the actual code is executed in the Docker container corresponding to Databricks Runtime. There are several ways to speedup installation of the libraries:
preloaded_docker_images
attribute), databrick-cli, or Databricks Terraform provider. The main disadvantage of custom Docker images is that some functionality isn't available out of box, for example, arbitrary files in Repos, web terminal, etc. (don't remember full list)pip download --prefer-binary lib1 lib2 ...
mvn dependency:get -Dartifact=<maven_coordinates>
, that will download dependencies into ~/.m2/repository
folder, from which you can copy jars to DBFS and in init script use cp /dbfs/.../jars/* /databricks/jars/
command