[SOLVED] Is there a Cadence metric that can help spot overloads for each specific activity worker?

Is there a Cadence metric that can help spot overloads for each specific activity worker?

My company would like to automatically scale the activity workers and each workflow workers independently according to the load of a tasklist.

Reading the docs I have found the following metrics for activity workers:

cadence_activity_scheduled_to_start_latency_bucket
cadence_activity_scheduled_to_start_latency_count
cadence_activity_scheduled_to_start_latency_sum

However these seem to be global metrics for activity workers. Is there a Cadence metric that would allow me to spot overloads for each specific activity worker?

Example: We have 4 different activity workers : A, B, C and D We would like to scale independently A or B or C or D without impacting the others

Solution

Understand scheduled_to_start_latency

scheduled_to_start_latency is a measurement of the time from scheduled to started by worker. From scheduled to started, a task is transferred from matching service to an activity worker.

These are the potential hotspots when this latency got high:

The matching service is too hot to dispatch tasks -- in this case, need to confirm with CPU/memory of the matching nodes
The tasklist is overloaded because it defaults to have one partition which mapped to only one matching node: https://cadenceworkflow.io/docs/operation-guide/maintain/#scale-up-a-tasklist-using-scalable-tasklist-feature -- in this case, use task per second metrics to confirm the task rate of the tasklist
The activity worker is overloaded.

How to monitor activity worker being overloaded

CPU/memory/Thread usage/Garbage collection of the activity worker is usually enough to make sure an worker is not overloaded
You can also use scheduled_to_start_latency, but the high latency could mean different things like above. Use other metrics to rule out the causes.