
How to group strings based on similarities in the string

I have a splunk query that produces a summarises errors by frequency

index="pc_1" LogLevel=ERROR 
   | eval Message=split(_raw,"|") 
   | stats count(LogLevel) as Frequency by Message 
   | sort -Frequency

This produces results in the form

Message Frquency
No such user 137
unable to deliver mail to Unable to reach server 70
unable to deliver mail to Unable to reach server 43
unable to authenticate user 3456 8
unable to deliver mail to Unable to reach server 6
unable to authenticate user 2321 5
unable to authenticate user 13321 3
... .
... .
... .
unable to deliver mail to Unable to reach server 1

As you can notice in the results produced, some similar errors are being split based on difference in ids of users emails, and machine ids. I am looking for a way I can group this based on similarities in strings. Currently what I am using is the replace the strings with a common regexp and then find the frequency

index="pc_1" LogLevel=ERROR 
   | eval Message=split(_raw,"|")

   | eval Message=replace("unable to deliver mail to (.)* Unable to reach server", "unable to deliver mail to [email]: Unable to reach server")
   | eval Message=replace("unable to authenticate user \d+", "unable to authenticate user [userId]")

   | stats count(LogLevel) as Frequency by Message 
   | sort -Frequency

This approach works but is quite cumbersome as there are a number of different types of errors and if this solution is to be implemented then it require going through each error and developing a regular expression for each.

Is there a way this can be improved with a query that can summarize this error more effectively?


  • Answer for posterity:

    Perhaps the cluster command will help. It groups like messages together.