I am trying to parse log lines for log anomaly detection, but two log lines are too similar for the parser to keep them apart:
[Something] VM Started
[Something] VM Paused
it parses it to VM <*> grouping the event into the same cluster. I tried masking Started and Paused to force it to detect it but maybe haven't found the right masking option yet.
I am aware that in general the parser is doing a fine job at finding the variable content, but in this case I would like to keep it separated.
Last idea I have is to mask the entire line and replace it but I wonder whether a better way exists.
Those log sequences are pretty short and we would need to know your similarity threshold st configuration value . In the original paper the similarity is defined as follows.
It sums the number of tokens that the log lines share and divides it by the total token length. (I assume in your case that is 2, since [Something] does not appear to be part of your template and is likely removed prior to parsing
That similarity value ( in your case simSeq= 0.5) is compared to the similarity threshold st (default=0.4). If the similarity is greater than the similarity threshold then the log lines are considered equal.
Try tuning your st value to 0.6, although that might wreak havoc with longer sequences. One solution for that problem is to make the similarity threshold a dictionary that depends on sequence length. That way you can have lower values for longer sequences and larger values for shorter ones.