elasticsearchprometheusprometheus-alertmanagerpromqlelasticsearch-indices

Use today date in label values in Prometheus alerting rules


I am new in Prometheus and alerting, and I couldn't fine my answer by looking at the documentation.

I have some data that's coming to an elasticsearch cluster. Every day, the process creates a new index on elasticsearch and writes the data of that day to this index (e.g., my_index-2019-10-06, my_index-2019-10-05, ...). I want to monitor the size of the index of today and see that it's growing, and if it's not growing in a defined interval (15 min for example), I want to fire an alert in Prometheus. To do so, I was thinking about such an expr in alert rule:

expr: delta(elasticsearch_index_primary_store_size{index_name="my_index-TODAY-DATE"}[15m] <= 0)

The TODAY-DATE should be dynamic, and generated every day. But as far as I understand you cannot have a dynamic value in the label values, and neither a function to get the date. Then I was thinking about to compare the delta of sum of the size of all the indices start with my_index, but the problem with this approach is the retention time, and if an index is deleted, the delta of the sum may be negative, while new data is coming to the today index. Do you have any solution for this problem?

Thanks in advance.


Solution

  • The problem comes from your assumption that you would be alerting based on the delta() of a sum() of timeseries, which is one of the first things the Prometheus documentation warns against. (And which, before subqueries were introduced, was impossible to do with a single query; you needed to set up recording rules to achieve that.)

    If instead you're using a sum() of delta() values (and your exporter doesn't produce a zero or rapidly decreasing index size metric during deletion) you're all set. When an index is deleted, its delta will just silently disappear from the results produced by delta() and not affect the resulting sum in any way. Previous days' indexes will probably not change size and thus also not affect the sum. And in case there's e.g. compaction going on, causing index sizes to drop suddenly, you can just filter out those values:

    expr: sum(delta(elasticsearch_index_primary_store_size{index_name=~"my_index-.*"}[15m]) > 0)) <= 0
    

    That being said, you could generate a label with today's date as value using count_values without() ("year", year(vector(time()))) (and month() and day_of_month()) plus label_join() / label_replace() but you probably don't want to go there.