azure-devopsdatabricksdbt

CalledProcessError when running dbt in Databricks


I'm trying to build/schedule a dbt project (sourced in Azure DevOps) in Databricks Workflows. However, whenever I run dbt there, I get the following error message:

CalledProcessError: Command 'b'\nmkdir -p "/tmp/tmp-dbt-run-1124228490001263"\nunexpected_errors="$(cp -a -u "/Workspace/Repos/.internal/085c4ffe5e_commits/16113d05ffd8cd7b148ed973080aa51439e98b0c/." "/tmp/tmp-dbt-run-1124228490001263" 2> >(grep -v \'Operation not supported\'))"\nif [[ -n "$unexpected_errors" ]]; then\n  >&2 echo -e "Unexpected error(s) encountered while copying:\n$unexpected_errors"\n  exit 1\nfi\n        returned non-zero exit status 1.

Unexpected error(s) encountered while copying:
cp: cannot stat '/Workspace/Repos/.internal/085c4ffe5e_commits/16113d05ffd8cd7b148ed973080aa51439e98b0c/./venv/share/doc/networkx-3.1/examples/3d_drawing/__pycache__': No such file or directory
cp: cannot stat '/Workspace/Repos/.internal/085c4ffe5e_commits/16113d05ffd8cd7b148ed973080aa51439e98b0c/./venv/share/doc/networkx-3.1/examples/algorithms/__pycache__': No such file or directory
cp: cannot stat '/Workspace/Repos/.internal/085c4ffe5e_commits/16113d05ffd8cd7b148ed973080aa51439e98b0c/./venv/share/doc/networkx-3.1/examples/basic/__pycache__': No such file or directory
cp: cannot stat '/Workspace/Repos/.internal/085c4ffe5e_commits/16113d05ffd8cd7b148ed973080aa51439e98b0c/./venv/share/doc/networkx-3.1/examples/drawing/__pycache__': No such file or directory
cp: cannot stat '/Workspace/Repos/.internal/085c4ffe5e_commits/16113d05ffd8cd7b148ed973080aa51439e98b0c/./venv/share/doc/networkx-3.1/examples/graph/__pycache__': No such file or directory
cp: cannot stat '/Workspace/Repos/.internal/085c4ffe5e_commits/16113d05ffd8cd7b148ed973080aa51439e98b0c/./venv/share/doc/networkx-3.1/examples/subclass/__pycache__': No such file or directory

I gather the issue arises the moment the repo files are being copied, but I don't know how to solve it. Any ideas?

These are the task settings:

resources:
  jobs:
    otd:
      name: otd
      email_notifications:
        on_failure:
          - mauricio.schwartsman@xxxxxxxx.com
        no_alert_for_skipped_runs: true
      notification_settings:
        no_alert_for_skipped_runs: true
        no_alert_for_canceled_runs: true
      tasks:
        - task_key: otd_dbt
          dbt_task:
            project_directory: ""
            commands:
              - dbt deps
              - dbt build -s +otd_total
            schema: gold
            warehouse_id: xxxxxxxxxxx
            catalog: logistics_prd
            source: GIT
          job_cluster_key: dbt_CLI
          libraries:
            - pypi:
                package: dbt-databricks>=1.0.0,<2.0.0
      job_clusters:
        - job_cluster_key: dbt_CLI
          new_cluster:
            cluster_name: ""
            spark_version: 15.4.x-scala2.12
            spark_conf:
              spark.master: local[*, 4]
              spark.databricks.cluster.profile: singleNode
            azure_attributes:
              first_on_demand: 1
              availability: ON_DEMAND_AZURE
              spot_bid_max_price: -1
            node_type_id: Standard_D4ds_v5
            custom_tags:
              ResourceClass: SingleNode
            spark_env_vars:
              PYSPARK_PYTHON: /databricks/python3/bin/python3
            enable_elastic_disk: true
            data_security_mode: SINGLE_USER
            runtime_engine: PHOTON
            num_workers: 0
      git_source:
        git_url: https://dev.azure.com/copa-energia/Logistics/_git/dbt_logistica
        git_provider: azureDevOpsServices
        git_branch: main
      queue:
        enabled: true

Please feel free to ask me for more details.


Solution

  • As it turns out, the solution was simpler than I expected.

    Since those files are not necessary, I could simply remrove them from the repo and add them to .gitignore:

    venv/
    __pycache__/