bashazurecommand-linedatabricks

Submitting jobs with different parameters using command line databricks


I have jar and an associated properties file. In order to run the jar, this is what I do on Databricks on Azure:

I click on:

  +Create Job

      Task: com.xxx.sparkmex.core.ModelExecution in my.jar - Edit / Upload JAR / Remove
      Parameters: Edit
        Main Class: com.xxx.sparkmex.core.ModelExecution
        Arguments: ["-file","/dbfs/mnt/mypath/myPropertyFile.properties","-distributed"]

      Cluster: MyCluster 

and then I click RunNow

I am trying to achive the same using databricks cli

This is what I am doing/want to do:

1) upload the properties file

dbfs cp myPropertyFile.properties dbfs:/mnt/mypath/myPropertyFile.properties 

2) Create a job: databricks jobs create when I do this, it asks for a --jason-file. Where do I get the jason file from?

3) Upload the jar file: how do I upload the jar file?

4) Upload the property file: how do I upload the properties file?

5) restart the cluster: databricks clusters restart --cluster-id MYCLUSTERID

6) Run the job

and Repeat. The reason I want to repeat is that everytime I upload a new properties file with a different settings. I do not know how to do step 2 to 4 and step 5.


Solution

  • For step two, you will need to create the JSON file yourself. Think of this as your cluster configuration. Here's an example that creates a job using a JAR file:

    {
          "name": "SparkPi JAR job",
          "new_cluster": {
            "spark_version": "5.2.x-scala2.11",
            "node_type_id": "r3.xlarge",
            "aws_attributes": {"availability": "ON_DEMAND"},
            "num_workers": 2
            },
         "libraries": [{"jar": "dbfs:/docs/sparkpi.jar"}],
         "spark_jar_task": {
            "main_class_name":"org.apache.spark.examples.SparkPi",
            "parameters": "10"
            }
    }
    

    Save that as a JSON file and include it in your API request. You can also include the JSON directly in the curl command. See the link above for an example of that. You would want to pass myPropertyFile.properties as a value to the "parameters" key in the JSON.

    You can upload the JAR and properties file in the same way you performed step 1 - the DBFS command group of the Databricks CLI.

    databricks fs cp /path_to_local_file/myJar.jar dbfs:/path_to_file_on_dbfs/
    

    After you create the job and have the job ID, you can use the run-now API to kick it off.

    If you want to automate this process and make it repeatable you could write a bash script that takes arguments and makes calls to the CLI. Alternatively, you could use wrappers for the CLI in Python or write a Python script to manage the REST API calls yourself.