pysparkdatabricksazure-databricksaws-databricksdatabricks-dbx

Running local python code with arguments in Databricks via dbx utility


I am trying to execute a local PySpark script on a Databricks cluster via dbx utility to test how passing arguments to python works in Databricks when developing locally. However, the test arguments I am passing are not being read for some reason. Could someone help? Following this guide, but it is a bit unclear and lacks good examples. https://dbx.readthedocs.io/en/latest/quickstart.html Found this, but it also not clear: How can I pass and than get the passed arguments in databricks job

Databricks manuals are very much not clear in this area.

My PySpark script:

import sys

n = len(sys.argv)
print("Total arguments passed:", n)

print("Script name", sys.argv[0])

print("\nArguments passed:", end=" ")
for i in range(1, n):
    print(sys.argv[i], end=" ")

dbx deployment.json:

{
  "default": {
    "jobs": [
      {
        "name": "parameter-test",
        "spark_python_task": {
            "python_file": "parameter-test.py"
        },
        "parameters": [
          "test-argument-1",
          "test-argument-2"
        ]
      }
    ]
  }
}

dbx execute command:

dbx execute\
  --cluster-id=<reducted>\
  --job=parameter-test\
  --deployment-file=conf/deployment.json\
  --no-rebuild\
  --no-package

Output:

(parameter-test) user@735 parameter-test % /bin/zsh /Users/user/g-drive/git/parameter-test/parameter-test.sh
[dbx][2022-07-26 10:34:33.864] Using profile provided from the project file
[dbx][2022-07-26 10:34:33.866] Found auth config from provider ProfileEnvConfigProvider, verifying it
[dbx][2022-07-26 10:34:33.866] Found auth config from provider ProfileEnvConfigProvider, verification successful
[dbx][2022-07-26 10:34:33.866] Profile DEFAULT will be used for deployment
[dbx][2022-07-26 10:34:35.897] Executing job: parameter-test in environment default on cluster None (id: 0513-204842-7b2r325u)
[dbx][2022-07-26 10:34:35.897] No rebuild will be done, please ensure that the package distribution is in dist folder
[dbx][2022-07-26 10:34:35.897] Using the provided deployment file conf/deployment.json
[dbx][2022-07-26 10:34:35.899] Preparing interactive cluster to accept jobs
[dbx][2022-07-26 10:34:35.997] Cluster is ready
[dbx][2022-07-26 10:34:35.998] Preparing execution context
[dbx][2022-07-26 10:34:36.534] Existing context is active, using it
[dbx][2022-07-26 10:34:36.992] Requirements file requirements.txt is not provided, following the execution without any additional packages
[dbx][2022-07-26 10:34:36.992] Package was disabled via --no-package, only the code from entrypoint will be used
[dbx][2022-07-26 10:34:37.161] Processing parameters
[dbx][2022-07-26 10:34:37.449] Processing parameters - done
[dbx][2022-07-26 10:34:37.449] Starting entrypoint file execution
[dbx][2022-07-26 10:34:37.767] Command successfully executed
Total arguments passed: 1
Script name python

Arguments passed:
[dbx][2022-07-26 10:34:37.768] Command execution finished
(parameter-test) user@735 parameter-test % 

Please help :)


Solution

  • It turns out the parameter section format of my deployment.json was not correct. Here is the corrected example:

    {
      "default": {
        "jobs": [
          {
            "name": "parameter-test",
            "spark_python_task": {
              "python_file": "parameter-test.py",
              "parameters": [
                "test1",
                "test2"
              ]
            }
          }
        ]
      }
    }
    

    I've also posted my original question in Databricks forum: https://community.databricks.com/s/feed/0D58Y00008znXBxSAM?t=1659032862560 Hope it helps someone else.