databricksazure-databricks

Databricks notebook_path - Unable to access the notebook


I have a simple python script I would like to deploy to databricks and rund as a workflow:

src/data_extraction/iban/test.py:

from pyspark.sql import SparkSession, DataFrame

def get_taxis(spark: SparkSession) -> DataFrame:
    return spark.read.table("samples.nyctaxi.trips")

# Create a new Databricks Connect session. If this fails,
# check that you have configured Databricks Connect correctly.
# See https://docs.databricks.com/dev-tools/databricks-connect.html.
def get_spark() -> SparkSession:
    try:
        from databricks.connect import DatabricksSession

        return DatabricksSession.builder.getOrCreate()
    except ImportError:
        return SparkSession.builder.getOrCreate()

def main():
    get_taxis(get_spark()).show(5)

if __name__ == "__main__":
    main()

And I have my yaml file to deploy to databricks:

# The main job for extract.
resources:
  jobs:
    extract_job:
      name: extract_job

      trigger:
        # Run this job every day, exactly one day from the last run; see https://docs.databricks.com/api/workspace/jobs/create#trigger
        periodic:
          interval: 1
          unit: DAYS

      email_notifications:
        on_failure:
          - xx@xx.com

      tasks:
        - task_key: notebook_task
          job_cluster_key: job_cluster
          notebook_task:
            notebook_path: ../../src/data_extraction/iban/test.py
        
        - task_key: main_task
          depends_on:
            - task_key: notebook_task
          
          job_cluster_key: job_cluster
          python_wheel_task:
            package_name: extract
            entry_point: main
          libraries:
            # By default we just include the .whl file generated for the extract package.
            # See https://docs.databricks.com/dev-tools/bundles/library-dependencies.html
            # for more information on how to add other libraries.
            - whl: ../dist/*.whl

      job_clusters:
        - job_cluster_key: job_cluster
          new_cluster:
            spark_version: 15.4.x-scala2.12
            node_type_id: Standard_D3_v2
            autoscale:
                min_workers: 1
                max_workers: 4

Now I deploy the bundle to databricks:

databricks bundle deploy --profile dev

And its deployed to Workspace/Users/xx@xx.com/.bundle/bundlename/dev/files

Now when I run the workflow I get an error that the workflow cannot access the test.py file. How do I make my notebook_path path so it is relative to where ever I deploy it?

Unable to access the notebook "/../../src/data_extraction/iban/test.py" in the workspace


Solution

  • According to this documentation you need to give notebook path for notebook task and also

    The path for the notebook to deploy is relative to the configuration file in which this task is declared.

    Since you are using python script you add python script task like below.

    resources:
      jobs:
        extract_job:
          name: extract_job
    
          trigger:
            # Run this job every day, exactly one day from the last run; see https://docs.databricks.com/api/workspace/jobs/create#trigger
            periodic:
              interval: 1
              unit: DAYS
    
          email_notifications:
            on_failure:
              - xx@xx.com
    
          tasks:
            - task_key: notebook_task
              job_cluster_key: job_cluster
              spark_python_task:
                python_file: <path-relative-to-this-yaml-file/test.py>
    

    If you still get the error please use absolute path from databricks workspace where it is uploaded.

    check this for more information

    To get full path go to your databricks workspace and navigate to Workspace/Users/xx@xx.com/.bundle/bundlename/dev/files then copy the full path like below.

    enter image description here