Several of our applications have batch jobs that aggregate data every night. These batch jobs, which are Python scripts, use Prometheus Pushgateway to push metric values to Prometheus, and we have rules that trigger alerts (in Alertmanager) when these metrics become invalid (e.g. exceed a certain threshold).
We would now also like to use Prometheus metrics to double-check that the batch jobs itself ran correctly: For example, did the job start on-time? Did any errors occur? Did the job run to completion? To this end, we would like to change our Python scripts to push a metric when the script start and finishes, and when any errors occur. This does raise some problems though: we have quite a few batch jobs and 3 metrics per batch-job creates a lot of manual configuration for rules/alerts; we would also like to display the status graphically in Grafana and aren't really sure what the right visual for that would look like.
Has anyone else tried to tackle a similar problem to use Prometheus metrics to monitor the status of several batch jobs? Which metrics did you record and what did your alerts/rules look like? Did you find a intuitive way to graphically display the status of each batch job?
You could expose a metric per batch job called last_run_at
.
And then you could have alerts based on if the job was run more than 24 hours ago (or whatever your threshold is).
A simple alert would be: last_run_at{env="prod"} < scalar(time()) - 60 * 60 * 24
The time() function in Prometheus would be useful for this. Docs: https://prometheus.io/docs/prometheus/latest/querying/functions/#time
You don't have to make an alert per job. You can make an alert for any job which hasn't run in the last 24 hours. Or you could filter by environment or any other labels.
The point is that it doesn't have to be a 1:1 of job to alert. And you should be able to graph this fairly easily too in Grafana.