I'm developing some python code that would be used as entry points for various wheel-based-workflows on Databricks. Given that it's under development, after I make code changes to test it, I need to build a wheel and deploy on Databricks cluster to run it (I use some functionality that's only available in Databricks runtime so can not run locally).
Here is what I do:
REMOTE_ROOT='dbfs:/user/kash@company.com/wheels'
cd /home/kash/workspaces/project
rm -rf dist
poetry build
whl_file=$(ls -1tr dist/project-*-py3-none-any.whl | tail -1 | xargs basename)
echo 'copying..' && databricks fs cp --overwrite dist/$whl_file $REMOTE_ROOT
echo 'installing..' && databricks libraries install --cluster-id 111-222-abcd \
--whl $REMOTE_ROOT/$whl_file
# ---- I WANT TO AVOID THIS as it takes time ----
echo 'restarting' && databricks clusters restart --cluster-id 111-222-abcd
# Run the job that uses some modules from the wheel we deployed
echo 'running job..' && dbk jobs run-now --job-id 1234567
Problem is every time I make one line of change I need to restart the cluster which takes 3-4 minutes. And unless I restart the cluster databricks libraries install
does not reinstall the wheel.
I've tried updating the version number for the wheel, but then it shows that the cluster has two versions of same wheel installed on the GUI (Compute -> Select-cluster -> Libraries-tab), but on the cluster itself the newer version is actually not installed (verified using ls -l .../site-packages/
).
Here is what we ended up doing finally.
TL;DR;
module-runner.py
that can install a wheel file and executes it. Key is that this will install module and it's dependencies at "notebook scope".module-runner.py
to databricks a job pointing to it. Job parameter is the module + wheel file to run.
module-runner.py
assumes that module being run provides a main()
method.NOTE: In this case runner script is in workspace
and wheel file is in dbfs
. Not the best choice.
import some_other_module, sys
def main(*args):
print(args)
module-runner.py
to your workspace (/Users/kash/
in this case):import argparse, importlib, logging, os, pip, sys, traceback
from datetime import datetime
def main(argv=None):
logging.getLogger('py4j').setLevel(logging.ERROR)
parser = argparse.ArgumentParser(description='Module runner')
parser.add_argument('-m', '--module', help='an importable module, present in installed libraries or in specified --wheel-file.', required=True)
parser.add_argument('-w', '--wheel-file', help='path (or glob pattern) to wheel file to install', required=False)
args, parameters = parser.parse_known_args(argv)
# If --wheel-file is specified then we install wheel at Notebook scope.
# If --wheel-file is NOT specified then we assume it's a cluster library and importable.
if args.wheel_file:
lstat = os.lstat(args.wheel_file)
print(f'lstat(args.wheel_file): {lstat}, mtime: {datetime.fromtimestamp(lstat.st_mtime).isoformat()}')
pip.main(['install', args.wheel_file])
try:
importlib.import_module(args.module).main(*parameters)
# main could be defined as:
# def main() OR def main(*kwargs) OR def main(arg1: <type1>, arg2: <type2>, ...)
except Exception as ex:
print(f'Execution of {args.module}.main() failed with exception. e: {ex}, parameters: {parameters}')
traceback.print_exception(type(ex), ex, ex.__traceback__)
raise ex
if __name__ == '__main__':
main(sys.argv[1:])
Create a wheel of your module com.kash.module1
and upload the wheel file to /dbfs/Users/kash/my-module1-wheel.whl
.
Invoke the job with appropriate params e.g. ["-m", "com.kash.module1", "-w", "/dbfs/Users/kash/my-module1-wheel.whl"]
.
This will install my-module1-wheel.whl
and all it's dependencies every time you run the job