I am doing some metrics collection and want to do some aggregations based on Operation.
What would you say are the top 5 (or more or less) operations across all services that we should be focusing on? OR
Are there top 5 (or more or less) for individual services? If yes, can you list them.
Thanks in advance.
Solution
First of all, this question is quite vague. I just made some for my own preference as minimum set of monitors.
Server metrics
You should monitor availability & latency of all APIs for every service, and persistence API.
You should monitor queue latency from history service -- this is the key metric to understand the background task perf which is missing from API availability & latency
You should make dashboard for API counters for each service so that you can see the load changing over the time
Client metrics
You should monitor on Workflow failure/timeout
You should monitor on Activity task failure/timeout