We have a DataFusion pipeline which is triggered by a Cloud Composer DAG. This pipeline provisions an ephemeral DataProc cluster which cluster - in an ideally scenario - terminates after finishing the tasks.
In our case, sometimes, not always, this ephemeral DataProc cluster stucks in a running state. The job inside in the cluster is also in a running state, and the last log messages are the followings:
INFO runtimejob.DataprocJobMain: Invoking initialize() on io.cdap.cdap.runtime.spi.runtimejob.DataprocRuntimeEnvironment with spark2_2.11
INFO runtimejob.DataprocJobMain: Invoking run() on io.cdap.cdap.internal.app.runtime.distributed.runtimejob.DefaultRuntimeJob
INFO runtimejob.DataprocJobMain: Invoking destroy() on io.cdap.cdap.internal.app.runtime.distributed.runtimejob.DefaultRuntimeJob
INFO runtimejob.DataprocJobMain: Runtime job completed.
Exception: java.lang.NoClassDefFoundError thrown from the UncaughtExceptionHandler in thread " STARTING-SendThread(cdap-<our-identifier>-1f11111b-1d11-11eb-b1a1-1a111fb11d11-m.c.<our-gcp-project-name>.internal:41409)"
Exception: java.lang.NoClassDefFoundError thrown from the UncaughtExceptionHandler in thread "threadDeathWatcher-2-1"
On the DataFusion side, the pipeline marked as successful. DataFusion logs are the followings:
Completed DEPROVISION subtask REQUESTING_DELETE for program run program_run: <data_fusion_namespace>.<pipeline_name>.-SNAPSHOT.workflow.DataPipelineWorkflow.<data_proc_id> //this message is repeated many-many times
DEBUG [provisioning-service-4:i.c.c.c.s.Retries@197] - Retries exhausted after 1 failures and 14 ms.
Any ideas what is causing this issue?
p.s.: identifiers in messages were replaced with random values
Which version of Datafusion are you running? Also what is the amount of memory for the Dataproc cluster? Sometimes we observe this issue when the Dataproc cluster ran out of memory. I would suggest increasing the amount of memory.