antlr4lexer

Antlr4 lexer rule with action gets rather slow processing long strings


I work with a lexer grammar for a language that has binary strings formatted like this octetstring:

'00A1C2'O

The rule for these strings contain the following (simplified for clarity):

boolean isOdd = false;

...
(
   [0-9a-fA-F] { isOdd = !isOdd; }
   ...
)*

This g4 file has been written long ago and works correctly functionally, but I have found accidentally that the above part of the grammar gets rather slow when processing binary strings that contain thousands of characters. If I remove the action part, there is no slowdown.

Tested this with multiple antlr version and found no real difference.

I have refactored the grammar and now it performs well, but I'm curious if this slowdown is normal or not. Having a very limited knowledge of antlr internals, my guess is that calling an action for a rule is an expensive operation and doing this couple of thousand times can degrade performance.

Anyway, I am still curious about the reason for this regression, so I would be grateful if someone can enlighten me.


Solution

  • If you have a lot of embedded code inside ( … )*, then it is not odd things are getting slow. Preferably, you don’t even use target specific code inside your grammar. Quite often, it is a far better option to leave out the code from parser- of lexer rules and perform certain validations after the input was parsed. You'd do such things in a visitor or listener: it will make parsing/lexing faster and will keep you grammar clean of embedded target code.

    which, as mentioned by Mike in the comments, can get executed thousands of times because the * means zero or more times