gogoogle-cloud-platformpysparkgoogle-cloud-dataprocgoogle-cloud-dataproc-serverless

Programmatically cancelling a pyspark dataproc batch job


Using golang, I have several dataproc batch jobs running and I can access them via their Uuid by creating a client like this.

BatchClient, err := dataproc.NewBatchControllerClient(context, ...options)

If I wanted to delete a batch job, I could do it using google cloud's golang client library like this (the request body contains the Uuid of the batch)

_, err := batchClient.DeleteBatch(context, request, ...options)

However, there doesn't seem to be any way to cancel a batch that's already running programmatically. If I try to delete a batch that is already running, I rightfully get an error of FAILED_PRECONDITION

Now, I'm aware that Google cloud's SDK cli has a simple way to cancel a job like this:

gcloud dataproc batches cancel (BATCH : --region=REGION) [GCLOUD_WIDE_FLAG …]

Unfortunately, this approach is not a good fit for my application.


Solution

  • The functionality for serverless job handling was added in version 2.0 of the dataproc golang client library.

    To access this version, the following packages had to be updated:

        dataproc "cloud.google.com/go/dataproc/v2/apiv1"
    
        dataprocpb "cloud.google.com/go/dataproc/v2/apiv1/dataprocpb" 
    
    

    Afterwards, the provided batch client.CancelOperation can be used to cancel a serverless batch job using the same client that's used to delete a batch job like this:

    err := batchClient.CancelOperation(context, request, ...options)