grafanagrafana-lokigrafana-variablepromtaillogql

How to Count Specific ERROR Messages Using Loki in Grafana?


Problem Description: I have a setup involving Promtail, Loki, and Grafana, where I am currently sending systemd-journal logs to Loki. I am attempting to create a Grafana dashboard that showcases the count of distinct ERROR messages from the logs over a period of 7 days. The log samples contain various ERROR messages and my goal is to have a breakdown of each distinct ERROR message with its respective count displayed on the dashboard.

Details and Samples: Here is an excerpt from the logs that are being sent to Loki:

2023-09-14 08:57:55  ERROR: AuthController: Login TECHNIKER: Falsches Passwort bei Local-Login
2023-09-14 08:57:35  ERROR: AuthController: Login TECHNIKER: Falsches Passwort bei Local-Login
2023-09-14 08:57:31  ERROR: AuthController: Login TECHNIKER: Falsches Passwort bei Local-Login
2023-09-14 03:39:07  ERROR: VDAI: SetErrorClass[Timeout]: Keine Verbindung nach 5 Minuten
2023-09-13 11:25:55  ERROR: No reversePort received
2023-09-13 11:25:55  ERROR: ssh failed with exitcode=255 in requestReversePortNumber()
2023-09-13 11:25:20  ERROR: No reversePort received
2023-09-13 11:25:20  ERROR: ssh failed with exitcode=255 in requestReversePortNumber()
2023-09-13 11:24:41  ERROR: No reversePort received

I am using the following query to count all ERROR messages from the last 7 days:

count_over_time({job="systemd-journal"} |~ `ERROR:` [7d])

Expected Outcome: I want to craft a query that counts specific ERROR messages, enabling me to present data on my Grafana dashboard as below:

[3] ERROR: AuthController: Login TECHNIKER: Falsches Passwort bei Local-Login
[1] ERROR: VDAI: SetErrorClass[Timeout]: Keine Verbindung nach 5 Minuten
[3] ERROR: No reversePort received
[2] ERROR: ssh failed with exitcode=255 in requestReversePortNumber()

Attempts and Challenges: So far, the query I have only returns the total count of ERROR messages over the time range, and does not break it down by distinct messages.

Questions: How can I create a query that counts specific ERROR messages in Grafana using Loki and displays the count for each unique ERROR message? Any guidance or direction would be highly appreciated. Thank you!


Solution

  • To do this you need to make convert everything after ERROR: into a label, and then aggregate based on that label.

    To do this you can use regexp operation.{job="systemd-journal"} |~ `ERROR:` | regexp `(?P<error_message>ERROR:.*)` will produce all your log lines which contain ERROR: with whole error message being included into label error_message

    Then all you need is to simply aggregate over this label. Full query will be

    sum by(error_message)(count_over_time({job="systemd-journal"} |~ `ERROR:` | regexp `(?P<error_message>ERROR:.*)` [7d]))