I work with a lexer grammar for a language that has binary strings formatted like this octetstring:
'00A1C2'O
The rule for these strings contain the following (simplified for clarity):
boolean isOdd = false;
...
(
[0-9a-fA-F] { isOdd = !isOdd; }
...
)*
This g4 file has been written long ago and works correctly functionally, but I have found accidentally that the above part of the grammar gets rather slow when processing binary strings that contain thousands of characters. If I remove the action part, there is no slowdown.
Tested this with multiple antlr version and found no real difference.
I have refactored the grammar and now it performs well, but I'm curious if this slowdown is normal or not. Having a very limited knowledge of antlr internals, my guess is that calling an action for a rule is an expensive operation and doing this couple of thousand times can degrade performance.
Anyway, I am still curious about the reason for this regression, so I would be grateful if someone can enlighten me.
If you have a lot of embedded code inside ( … )*
†, then it is not odd things are getting slow. Preferably, you don’t even use target specific code inside your grammar. Quite often, it is a far better option to leave out the code from parser- of lexer rules and perform certain validations after the input was parsed. You'd do such things in a visitor or listener: it will make parsing/lexing faster and will keep you grammar clean of embedded target code.
† which, as mentioned by Mike in the comments, can get executed thousands of times because the *
means zero or more times