When running workflows in my QA environment, I encounter a strange issue with Cadence. On occasion, when executing a workflow, I get the following error in the logs:
{"level":"error","ts":"2025-04-09T10:06:58.716-0300","msg":"Operation failed with internal error.","service":"cadence-history","error":"UpdateShard operation failed. Error: rowsAffected returned 2 shards instead of one","metric-scope":2,"logging-call-at":"base.go:112"}
{"level":"error","ts":"2025-04-09T10:06:58.716-0300","msg":"renewRangeLocked failed.","service":"cadence-history","shard-id":0,"address":"myAddress:7934","store-operation":"update-shard","error":"UpdateShard operation failed. Error: rowsAffected returned 2 shards instead of one","shard-range-id":20,"previous-shard-range-id":19,"logging-call-at":"context.go:1046"}
{"level":"error","ts":"2025-04-09T10:06:58.716-0300","msg":"Internal service error","service":"cadence-history","error":"UpdateShard operation failed. Error: rowsAffected returned 2 shards instead of one","wf-id":"WID","wf-run-id":"","wf-domain-id":"a0460a71-e68a-4da1-a113-75767a7b6c17","logging-call-at":"handler.go:2097"}
The error indicates that the UpdateShard
operation affected 2 shards instead of one, which seems unexpected. A few observations:
Why might the workflow run correctly sometimes but fail at other times?
Could certain conditions or state differences between the QA and development environments be triggering this inconsistent behavior?
What debugging steps or configuration aspects should I check to isolate and resolve the issue, particularly focusing on the differences between QA and development environments?
Since I am using Postgres, I looked at the cadence code and how operations are performed for SQL databases.
In common\persistence\sql\sql_shard_store.go
there is a function UpdateShard
that throws the first message in the log that I provided: Error: rowsAffected returned 2 shards instead of one
After I inspected the shards
table I noticed that there was a row duplicated. The strange part is the fact that a primary key (shard_id = 0)
was the one duplicated.
So far I have not discovered why there was a duplicated PK. Postgres has the constraint for shard_id
column and common\persistence\sql\sql_shard_store.go
has a function CreateShard
that guarantees no duplicates.
Now that I removed the duplicate, the system is working perfectly.