We are using the https://github.com/Labbs/github-actions-exporter at my workplace and I'm having a hard time accomplishing a query that can alert us when a particular workflow in a particular repo has failed a few times in a row.
We have a metric called github_workflow_run_status
it has a repo and workflow label but then spins out a new metric line for each run. The value for each is 0 for a failure or 1 for a success.
Previously we are using a hack like so: count by (workflow, source_eks_cluster, repo) (sum_over_time(github_workflow_run_status{repo='xxx/yyy', workflow='zzz', event='schedule', status='completed'}[185m]) == 0)
This job was running as a cron and it basically works, however because the metrics don't 'decay' very quickly, once we get a few failures in a row and the alert goes off it just continues to go off for days.
To complicate matters a bit further we've switched this from a 30m schedule to a triggered on demand action. So the time period is now variable.
I feel like this exporter is a bit of a anti-pattern, these 'metrics' are more like 'events', but it is what it is, and I've got what I've got.
There must be a better way, but I'm not seeing it. So how do I treat these 'event' style metrics?
I ended up going with this:
increase((count by (repo,workflow) (github_workflow_run_status{repo='xxxx/yyy', workflow='zzz'}) - sum by (repo,workflow) (github_workflow_run_status{repo='xxx/yyy', workflow='zzz'}))[2h:5m])
This gives me the number of failed jobs in the last two hours to work with for my alert.