workflowcadence-workflowtemporal-workflowuber-cadence

Uber Cadence - Timer started


I am new to Uber Cadence framework and currently working on a project workflow management project using Cadence. I am seeing strange behavior all the workflows are having a timer for ~270hrs as part of the flow and I am not sure how that number is calculated and where that timer is coming from.

And the other issue is, once timer is fired, workflows are failing (not terminated) with UNHADLED_DECISION error. This exception is keep throwing and spamming the logs. Here is the stacktrace.

"com.uber.cadence.internal.replay.NonDeterminisicWorkflowError: Unknown DecisionId{decisionTarget=TIMER, decisionEventId=11}. The possible causes are a nondeterministic workflow definition code or an incompatible change in the workflow definition.\n\tat com.uber.cadence.internal.replay.DecisionsHelper.getDecision(DecisionsHelper.java:733)\n\tat com.uber.cadence.internal.replay.DecisionsHelper.handleTimerStarted(DecisionsHelper.java:451)\n\tat com.uber.cadence.internal.replay.ReplayDecider.processEvent(ReplayDecider.java:229)\n\tat com.uber.cadence.internal.replay.ReplayDecider.decideImpl(ReplayDecider.java:452)\n\tat com.uber.cadence.internal.replay.ReplayDecider.decide(ReplayDecider.java:385)\n\tat com.uber.cadence.internal.replay.ReplayDecisionTaskHandler.processDecision(ReplayDecisionTaskHandler.java:145)\n\tat com.uber.cadence.internal.replay.ReplayDecisionTaskHandler.handleDecisionTaskImpl(ReplayDecisionTaskHandler.java:125)\n\tat com.uber.cadence.internal.replay.ReplayDecisionTaskHandler.handleDecisionTask(ReplayDecisionTaskHandler.java:86)\n\tat com.uber.cadence.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:257)\n\tat com.uber.cadence.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:229)\n\tat com.uber.cadence.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:71)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\tat java.base/java.lang.Thread.run(Thread.java:834)\n"

enter image description here

Can somebody explain me what's happening here and what is this timer? Is there a way to handle this error/exception gracefully and avoid spamming the logs? There are thousand of workflows like this in the test environment and is there a way to terminate them all using Cadence Web or some other? Thanks in advance

Edit: I have two code blocks where I am using LocalDateTime and Workflow.sleep for waiting.

  1. If the workflow startTime is in the future, calculate the wait time and then sleep

    if (LocalDateTime.now(CampaignConstants.ZONE_ID_UTC).isBefore(myWorkflow.getStartDateTime())) { Duration waitDuration = Duration.between(LocalDateTime.now(), myWorkflow.getStartDateTime()); Workflow.sleep(waitDuration); }

  2. If the workflow step is to wait for specified time period, then call Workflow.sleep with the scarified time

    Integer waitPeriod = Integer.parseInt((String) props.get("waitPeriod")); ChronoUnit chronoUnit = ChronoUnit.valueOf((String) props.get("waitPeriodType")); Workflow.sleep(Duration.of(waitPeriod, chronoUnit));

Is this right way of implementing? Seems not the right way, so what's the proper way of implementing these functionalities. Thanks


Solution

  • Not Cadence expert but since you tagged it also with "temporal-workflow" can give it a shot :)

    The TimerStarted->Fired events seem to come from your workflow code. Check if you have workflow.Sleep in your code or create a timer and wait for it to complete in code.

    After the timer fires you have a decision task that times out on ScheduleToStart timeout meaning the task was placed on task queue but was not picked up by one of your workers.

    This task in then placed again on the global ("normal") task queue partition and most likely another one of your workers picked it up (check identity field on your WorkflowTaskStarted events). This worker did not have the execution history in its in-memory cache, meaning worker had to pull whole history from service and then perform internal workflow replay, which led to a non-deterministic error. Would check your code to see if you are using maybe system clock to calculate sleep durations or some other type on non-deterministic stuff in wf code. If you can share your code, could take a look. Hope this helps.