I have a splunk query that produces a summarises errors by frequency
index="pc_1" LogLevel=ERROR
| eval Message=split(_raw,"|")
| stats count(LogLevel) as Frequency by Message
| sort -Frequency
This produces results in the form
Message | Frquency |
---|---|
No such user | 137 |
unable to deliver mail to example@email.com: Unable to reach server | 70 |
unable to deliver mail to example1@email.com: Unable to reach server | 43 |
unable to authenticate user 3456 | 8 |
unable to deliver mail to example2@email.com: Unable to reach server | 6 |
unable to authenticate user 2321 | 5 |
unable to authenticate user 13321 | 3 |
... | . |
... | . |
... | . |
unable to deliver mail to examplen@email.com: Unable to reach server | 1 |
As you can notice in the results produced, some similar errors are being split based on difference in ids of users emails, and machine ids. I am looking for a way I can group this based on similarities in strings. Currently what I am using is the replace the strings with a common regexp and then find the frequency
index="pc_1" LogLevel=ERROR
| eval Message=split(_raw,"|")
| eval Message=replace("unable to deliver mail to (.)* Unable to reach server", "unable to deliver mail to [email]: Unable to reach server")
| eval Message=replace("unable to authenticate user \d+", "unable to authenticate user [userId]")
| stats count(LogLevel) as Frequency by Message
| sort -Frequency
This approach works but is quite cumbersome as there are a number of different types of errors and if this solution is to be implemented then it require going through each error and developing a regular expression for each.
Is there a way this can be improved with a query that can summarize this error more effectively?
Answer for posterity:
Perhaps the cluster
command will help. It groups like messages together.