I am trying to execute a local PySpark script on a Databricks cluster via dbx utility to test how passing arguments to python works in Databricks when developing locally. However, the test arguments I am passing are not being read for some reason. Could someone help? Following this guide, but it is a bit unclear and lacks good examples. https://dbx.readthedocs.io/en/latest/quickstart.html Found this, but it also not clear: How can I pass and than get the passed arguments in databricks job
Databricks manuals are very much not clear in this area.
My PySpark script:
import sys
n = len(sys.argv)
print("Total arguments passed:", n)
print("Script name", sys.argv[0])
print("\nArguments passed:", end=" ")
for i in range(1, n):
print(sys.argv[i], end=" ")
dbx deployment.json:
{
"default": {
"jobs": [
{
"name": "parameter-test",
"spark_python_task": {
"python_file": "parameter-test.py"
},
"parameters": [
"test-argument-1",
"test-argument-2"
]
}
]
}
}
dbx execute command:
dbx execute\
--cluster-id=<reducted>\
--job=parameter-test\
--deployment-file=conf/deployment.json\
--no-rebuild\
--no-package
Output:
(parameter-test) user@735 parameter-test % /bin/zsh /Users/user/g-drive/git/parameter-test/parameter-test.sh
[dbx][2022-07-26 10:34:33.864] Using profile provided from the project file
[dbx][2022-07-26 10:34:33.866] Found auth config from provider ProfileEnvConfigProvider, verifying it
[dbx][2022-07-26 10:34:33.866] Found auth config from provider ProfileEnvConfigProvider, verification successful
[dbx][2022-07-26 10:34:33.866] Profile DEFAULT will be used for deployment
[dbx][2022-07-26 10:34:35.897] Executing job: parameter-test in environment default on cluster None (id: 0513-204842-7b2r325u)
[dbx][2022-07-26 10:34:35.897] No rebuild will be done, please ensure that the package distribution is in dist folder
[dbx][2022-07-26 10:34:35.897] Using the provided deployment file conf/deployment.json
[dbx][2022-07-26 10:34:35.899] Preparing interactive cluster to accept jobs
[dbx][2022-07-26 10:34:35.997] Cluster is ready
[dbx][2022-07-26 10:34:35.998] Preparing execution context
[dbx][2022-07-26 10:34:36.534] Existing context is active, using it
[dbx][2022-07-26 10:34:36.992] Requirements file requirements.txt is not provided, following the execution without any additional packages
[dbx][2022-07-26 10:34:36.992] Package was disabled via --no-package, only the code from entrypoint will be used
[dbx][2022-07-26 10:34:37.161] Processing parameters
[dbx][2022-07-26 10:34:37.449] Processing parameters - done
[dbx][2022-07-26 10:34:37.449] Starting entrypoint file execution
[dbx][2022-07-26 10:34:37.767] Command successfully executed
Total arguments passed: 1
Script name python
Arguments passed:
[dbx][2022-07-26 10:34:37.768] Command execution finished
(parameter-test) user@735 parameter-test %
Please help :)
It turns out the parameter section format of my deployment.json was not correct. Here is the corrected example:
{
"default": {
"jobs": [
{
"name": "parameter-test",
"spark_python_task": {
"python_file": "parameter-test.py",
"parameters": [
"test1",
"test2"
]
}
}
]
}
}
I've also posted my original question in Databricks forum: https://community.databricks.com/s/feed/0D58Y00008znXBxSAM?t=1659032862560 Hope it helps someone else.