pythondatabrickspython-packagingdatabricks-dbx

Nested Python package structure and using it to create Databricks wheel task


Problem understanding python package structure and how to use it to trigger python wheel task in Databricks .

So, it could either be something fundamental related to python packages/modules that I misunderstand or something specific to databricks. I have tried multiple options but none work.

So,jumping in,
I would like to call triggerjob function in createtables.py using

package_name: dbxdemo and entry_point jobs.createexternaltables.createtables.triggerjob

I have also tried using

package_name: dbxdemo.jobs.createexternaltables.createtables and entry_point: triggerjob

my package structure is

dbxdemo
 |--jobs
     |--createexttables
            |---__init__.py
            |---createtables.py
     |--sample
          |--__init__.py
          |--entrypoint.py
     |--__init__.py
     |--common.py
 |--__init__.py

Then I updated my init.py files in the various subfolders as follows

# dbxdemo/__init.py
from . import jobs
__all__=['jobs']
__version__ = "0.0.1"

# dbxdemo/jobs/__init__.py 
from . import createexternaltables
from . import sample
__all__=['createexternaltables', 'sample']

# dbxdemo/jobs/createexternaltables/__init__.py

from .createtables import *

The createtables.py file has this sample code

import logging
#import dbxdemo.common

from dbxdemo.common import Job


class CreateExternalTable(Job):

    def launch(self):
        try:
            #do something
        except Exception as e:
            #do logging


def triggerjob(): #created this outside the class to see if that helps, but no (ideally would want this to be part of the class_
    job = CreateExternalTable()
    job.launch()

When I try to create a databricks python wheel task and provide the package name as

dbxdemo

and entry_point as

jobs.createexternaltables.createtables.triggerjob

I keep getting an error that

module 'dbxdemo' has no attribute 'jobs'

I have also gone through other S.O posts and tried various combinations.

I have also tried putting the package name as dbxdemo.jobs.createexternaltables.createtables and entry_point as triggerjob but even that does not work

In addition , I have also tried changing setup.py (look at the comment)

from setuptools import find_packages, setup
from dbxdemo import __version__

setup(
    name="dbxdemo.jobs.createexternaltables.createtables", #earlier also tried with dbxdemo
    packages=find_packages(exclude=["tests", "tests.*"]),
    setup_requires=["wheel"],
    version=__version__,
    description="",
    author=""
)

P.S: If the problem is databricks specific then this is the dbx documentation I have been following here

I have a feeling this is probably databricks related as I can install the library manually and call this successfully

import dbxdemo

dbxdemo.jobs.createexternaltables.createtables.triggerjob()

Solution

  • Found the answer. Few things

    1. The import in the various init.py files didn't matter. So, I removed them.
    2. dbx execute needs a mandatory "parameters" element in deployment.yaml (this helped me get to the actual error of not being able to find the entrypoint.)
    3. The setup.py file modified to add the entrypoint (I looked at various answers like this one and also this one, but none actually explained the reason why. Finally this SO accepted answer explained entrypoints very lucidly)
    from setuptools import find_packages, setup
    from dbxdemo import __version__
    
    setup(
        name="dbxdemo",
        packages=find_packages(exclude=["tests", "tests.*"]),
        setup_requires=["wheel"],
        version=__version__,
        description="",
        author="",
        entry_points={
        'console_scripts': [
            'triggerjob = dbxdemo.jobs.createexternaltables.createtables:triggerjob',
        ],
    }
    
    )
    

    Once this was done, the package_name can be set to dbxdemo and the entry_point as triggerjob when creating a Databricks python wheel task.

    P.S: for anyone interested in doing through dbx, your deployment.yaml should be

    - name: "dbxdemowhl"
            <<:
              - *basic-static-cluster
            python_wheel_task:
              package_name: "dbxdemo"
              entry_point: triggerjob
              parameters: []  # This must be passed even if empty as dbx execute would error out