I have a simple python script I would like to deploy to databricks and rund as a workflow:
src/data_extraction/iban/test.py:
from pyspark.sql import SparkSession, DataFrame
def get_taxis(spark: SparkSession) -> DataFrame:
return spark.read.table("samples.nyctaxi.trips")
# Create a new Databricks Connect session. If this fails,
# check that you have configured Databricks Connect correctly.
# See https://docs.databricks.com/dev-tools/databricks-connect.html.
def get_spark() -> SparkSession:
try:
from databricks.connect import DatabricksSession
return DatabricksSession.builder.getOrCreate()
except ImportError:
return SparkSession.builder.getOrCreate()
def main():
get_taxis(get_spark()).show(5)
if __name__ == "__main__":
main()
And I have my yaml file to deploy to databricks:
# The main job for extract.
resources:
jobs:
extract_job:
name: extract_job
trigger:
# Run this job every day, exactly one day from the last run; see https://docs.databricks.com/api/workspace/jobs/create#trigger
periodic:
interval: 1
unit: DAYS
email_notifications:
on_failure:
- xx@xx.com
tasks:
- task_key: notebook_task
job_cluster_key: job_cluster
notebook_task:
notebook_path: ../../src/data_extraction/iban/test.py
- task_key: main_task
depends_on:
- task_key: notebook_task
job_cluster_key: job_cluster
python_wheel_task:
package_name: extract
entry_point: main
libraries:
# By default we just include the .whl file generated for the extract package.
# See https://docs.databricks.com/dev-tools/bundles/library-dependencies.html
# for more information on how to add other libraries.
- whl: ../dist/*.whl
job_clusters:
- job_cluster_key: job_cluster
new_cluster:
spark_version: 15.4.x-scala2.12
node_type_id: Standard_D3_v2
autoscale:
min_workers: 1
max_workers: 4
Now I deploy the bundle to databricks:
databricks bundle deploy --profile dev
And its deployed to Workspace/Users/xx@xx.com/.bundle/bundlename/dev/files
Now when I run the workflow I get an error that the workflow cannot access the test.py file. How do I make my notebook_path path so it is relative to where ever I deploy it?
Unable to access the notebook "/../../src/data_extraction/iban/test.py" in the workspace
According to this documentation you need to give notebook path for notebook task and also
The path for the notebook to deploy is relative to the configuration file in which this task is declared.
Since you are using python script you add python script task like below.
resources:
jobs:
extract_job:
name: extract_job
trigger:
# Run this job every day, exactly one day from the last run; see https://docs.databricks.com/api/workspace/jobs/create#trigger
periodic:
interval: 1
unit: DAYS
email_notifications:
on_failure:
- xx@xx.com
tasks:
- task_key: notebook_task
job_cluster_key: job_cluster
spark_python_task:
python_file: <path-relative-to-this-yaml-file/test.py>
If you still get the error please use absolute path from databricks workspace where it is uploaded.
check this for more information
To get full path go to your databricks workspace and
navigate to Workspace/Users/xx@xx.com/.bundle/bundlename/dev/files
then copy the full path like below.