grafana grafana-loki grafana-variable promtail logql

How to Count Specific ERROR Messages Using Loki in Grafana?

Problem Description: I have a setup involving Promtail, Loki, and Grafana, where I am currently sending systemd-journal logs to Loki. I am attempting to create a Grafana dashboard that showcases the count of distinct ERROR messages from the logs over a period of 7 days. The log samples contain various ERROR messages and my goal is to have a breakdown of each distinct ERROR message with its respective count displayed on the dashboard.

Details and Samples: Here is an excerpt from the logs that are being sent to Loki:

2023-09-14 08:57:55  ERROR: AuthController: Login TECHNIKER: Falsches Passwort bei Local-Login
2023-09-14 08:57:35  ERROR: AuthController: Login TECHNIKER: Falsches Passwort bei Local-Login
2023-09-14 08:57:31  ERROR: AuthController: Login TECHNIKER: Falsches Passwort bei Local-Login
2023-09-14 03:39:07  ERROR: VDAI: SetErrorClass[Timeout]: Keine Verbindung nach 5 Minuten
2023-09-13 11:25:55  ERROR: No reversePort received
2023-09-13 11:25:55  ERROR: ssh failed with exitcode=255 in requestReversePortNumber()
2023-09-13 11:25:20  ERROR: No reversePort received
2023-09-13 11:25:20  ERROR: ssh failed with exitcode=255 in requestReversePortNumber()
2023-09-13 11:24:41  ERROR: No reversePort received

I am using the following query to count all ERROR messages from the last 7 days:

count_over_time({job="systemd-journal"} |~ `ERROR:` [7d])

Expected Outcome: I want to craft a query that counts specific ERROR messages, enabling me to present data on my Grafana dashboard as below:

[3] ERROR: AuthController: Login TECHNIKER: Falsches Passwort bei Local-Login
[1] ERROR: VDAI: SetErrorClass[Timeout]: Keine Verbindung nach 5 Minuten
[3] ERROR: No reversePort received
[2] ERROR: ssh failed with exitcode=255 in requestReversePortNumber()

Attempts and Challenges: So far, the query I have only returns the total count of ERROR messages over the time range, and does not break it down by distinct messages.

Questions: How can I create a query that counts specific ERROR messages in Grafana using Loki and displays the count for each unique ERROR message? Any guidance or direction would be highly appreciated. Thank you!

Solution

To do this you need to make convert everything after ERROR: into a label, and then aggregate based on that label.

To do this you can use regexp operation.{job="systemd-journal"} |~ `ERROR:` | regexp `(?P<error_message>ERROR:.*)` will produce all your log lines which contain ERROR: with whole error message being included into label error_message

Then all you need is to simply aggregate over this label. Full query will be

sum by(error_message)(count_over_time({job="systemd-journal"} |~ `ERROR:` | regexp `(?P<error_message>ERROR:.*)` [7d]))