I am using Grafana Loki to monitor system logs, and I need to determine when each system last sent a log entry. My goal is to identify systems that are offline. The following query identifies such systems by checking for logs in mrs_system_info and not in mrs_error_list within a certain time frame:
count by(system) (count_over_time({job="mrs_system_info"} [10m]))
unless
count by(system) (count_over_time({job="mrs_error_list"} [5m]))
This query effectively flags the systems that are considered offline, as illustrated in the linked image. However, I am running into an issue with timestamps being identical and inaccurate—merely reflecting the current time. My approach was to obtain the last log entry for each system using an additional query and then use Grafana transformations to associate them with the corresponding offline systems. However, I am encountering a challenge due to the 5000 log limit, and I need a workaround. Ideally, I would like to get just the last log entry for the job="mrs_system_info" of every system. Yet, I am unsure of the correct query for this in Loki.
Once I have the last entry for each system—for instance, for the job="mrs_error_list"—I could use transformations to pair it with the offline systems to get the correct timestamp.
How can I modify my Loki query to retrieve only the last log entry per system within a specified time range, considering the high volume of logs?
How can I modify my Loki query to retrieve only the last log entry per system within a specified time range, considering the high volume of logs?
You can use max_over_time
function. Next query returns last log timestamp for each system for the last 24h. Change query type from Range to Instant.
max_over_time(
{job="mrs_system_info"}
| label_format ts=`{{ __timestamp__ | unixEpoch }}`
| unwrap ts
[24h]
) by (system)
If you have really high volume of log data and the query execution time is slow you can use recording rules to periodically query logs for a small time window and send results to Mimir or Prometheus.