pysparkdeploymentdatabrickspython-wheel

How to reinstall same version of a wheel on Databricks without cluster restart


I'm developing some python code that would be used as entry points for various wheel-based-workflows on Databricks. Given that it's under development, after I make code changes to test it, I need to build a wheel and deploy on Databricks cluster to run it (I use some functionality that's only available in Databricks runtime so can not run locally).

Here is what I do:

REMOTE_ROOT='dbfs:/user/kash@company.com/wheels'
cd /home/kash/workspaces/project
rm -rf dist

poetry build
whl_file=$(ls -1tr dist/project-*-py3-none-any.whl | tail -1 | xargs basename)
echo 'copying..'     && databricks fs cp --overwrite dist/$whl_file $REMOTE_ROOT
echo 'installing..'  && databricks libraries install --cluster-id 111-222-abcd \
                                                    --whl $REMOTE_ROOT/$whl_file
# ---- I WANT TO AVOID THIS as it takes time ----
echo 'restarting'    && databricks clusters restart --cluster-id 111-222-abcd

# Run the job that uses some modules from the wheel we deployed
echo 'running job..' && dbk jobs run-now --job-id 1234567

Problem is every time I make one line of change I need to restart the cluster which takes 3-4 minutes. And unless I restart the cluster databricks libraries install does not reinstall the wheel.

I've tried updating the version number for the wheel, but then it shows that the cluster has two versions of same wheel installed on the GUI (Compute -> Select-cluster -> Libraries-tab), but on the cluster itself the newer version is actually not installed (verified using ls -l .../site-packages/).


Solution

  • Here is what we ended up doing finally.

    TL;DR;


    NOTE: In this case runner script is in workspace and wheel file is in dbfs. Not the best choice.

    1. Module to run:
    import some_other_module, sys
    
    def main(*args):
      print(args)
    
    1. Upload this module-runner.py to your workspace (/Users/kash/ in this case):
    import argparse, importlib, logging, os, pip, sys, traceback
    from datetime import datetime
    
    def main(argv=None):
        logging.getLogger('py4j').setLevel(logging.ERROR)
        parser = argparse.ArgumentParser(description='Module runner')
        parser.add_argument('-m', '--module', help='an importable module, present in installed libraries or in specified --wheel-file.', required=True)
        parser.add_argument('-w', '--wheel-file', help='path (or glob pattern) to wheel file to install', required=False)
        args, parameters = parser.parse_known_args(argv)
        # If --wheel-file is specified then we install wheel at Notebook scope.
        # If --wheel-file is NOT specified then we assume it's a cluster library and importable.
        if args.wheel_file:
            lstat = os.lstat(args.wheel_file)
            print(f'lstat(args.wheel_file): {lstat}, mtime: {datetime.fromtimestamp(lstat.st_mtime).isoformat()}')
            pip.main(['install', args.wheel_file])
    
        try:
            importlib.import_module(args.module).main(*parameters)
            # main could be defined as:
            # def main() OR def main(*kwargs) OR def main(arg1: <type1>, arg2: <type2>, ...)
        except Exception as ex:
            print(f'Execution of {args.module}.main() failed with exception. e: {ex}, parameters: {parameters}')
            traceback.print_exception(type(ex), ex, ex.__traceback__)
            raise ex
    
    if __name__ == '__main__':
        main(sys.argv[1:])
    
    
    1. Create the module runner job: enter image description here

    2. Create a wheel of your module com.kash.module1 and upload the wheel file to /dbfs/Users/kash/my-module1-wheel.whl.

    3. Invoke the job with appropriate params e.g. ["-m", "com.kash.module1", "-w", "/dbfs/Users/kash/my-module1-wheel.whl"].

    This will install my-module1-wheel.whl and all it's dependencies every time you run the job