I would like to use databricks asset bundles to set the default catalog/schemas in different environments in order to refer to it in scripts/ creating paths etc. The DAB would be deployed with Azure pipelines. The databricks.yaml
would look for example like this for the catalog:
bundle:
name: "DAB"
variables:
default_catalog:
description: Catalog to set and use
default: catalog_dev
include:
- "resources/*.yaml"
targets:
dev:
variables:
default_catalog: catalog_dev
workspace:
host: xxxx
prod:
variables:
default_catalog: catalog_prod
workspace:
host: xxxx
I have workflow tasks that are delta live tables, there I can use the catalog definition e.g.
pipelines:
bronze_scd2:
name: bronze_scd2
clusters:
- label: default
autoscale:
min_workers: 1
max_workers: 5
mode: ENHANCED
libraries:
- notebook:
path: ${workspace.file_path}/1b_bronze/bronze_scd2.py
target: bronze
development: false
catalog: ${var.default_catalog}
but how would you best set it for notebook task? I could create a widget for every single notebook that I am using (as described in the documentation here) but that doesn't seem to be the most efficient way .
tasks:
- task_key: ingest_api
notebook_task:
notebook_path: ${workspace.file_path}/ingest_api
source: WORKSPACE
base_parameters:
catalog: ${var.default_catalog}
[...]
Additionally that would require to set it in the notebooks e.g. like this
dbutils.widgets.text("catalog", "")
catalog = dbutils.widgets.get("catalog")
It seems to make most sense to use DAB as the configuration files especially if you have a combination of dlt and notebook tasks- but are there any recommendations on how to best share catalog settings/ schema names in different databricks environments (seperate configuration files/ maybe setting environment variables through pipelines)? Would apprecciate any hints and recommendations.
You use parameter, spark configuration or spark environment variable to use the variables in spark notebook.
Since, you are not that satisfied using notebook parameters you can use either spark configuration or spark environment variable.
There are 2 places you can set this, in job_clusters
settings under jobs
mappings or new_cluster
settings under tasks
mappings
Here, is the sample.
In job_clusters
settings
resources:
jobs:
<some-unique-programmatic-identifier-for-this-job>:
# ...
job_clusters:
- job_cluster_key: <some-unique-programmatic-identifier-for-this-key>
new_cluster:
node_type_id: i3.xlarge
num_workers: 0
spark_version: 14.3.x-scala2.12
spark_conf:
"taskCatalog":${var.default_catalog}
And the same you access with spark like below.
spark.conf.get("taskCatalog")
Refer this for more about job cluster task settings override.
OR
In new_cluster
settings
resources:
jobs:
my-job:
name: my-job
tasks:
- task_key: my-key
new_cluster:
spark_version: 13.3.x-scala2.12
node_type_id: i3.xlarge
num_workers: 0
spark_conf:
"taskCatalog":${var.default_catalog}
Refer this for more about tasks settings override.
Same way you can configure spark environment variables.
spark_env_vars:
"taskCatalog":${var.default_catalog}
and access it like below.
import os
default_catalog = os.getenv("taskCatalog", "catalog_dev")
Also, refer few samples here.