I'm quite new to ANTLR and I need an explanation about its behaviour when recognising input string. From what I understood, rules staring by an uppercase letter are lexer rules while the ones starting by lowercase letter are parser rules.
I have the following language and I need to write the grammar accordingly:
So L should accept all the strings composed by:
This is the first grammar I wrote and I don't have problems with it when recognising the input bbbcbcbcbcdaaacacacaaaa
:
start : (alfa 'd' beta);
alfa : ('b'|'c') | ('b'|'c') alfa;
beta : ('a'|'c') | ('a'|'c') beta;
WS : (' '|'\n'|'\t'|'\r')->skip;
But if I change it like the following one, the same string is not recognised anymore (below you can see the debugger result of ANTLRWorks):
start : (alfa 'd' beta);
alfa : ALFA_VALS | ALFA_VALS alfa;
beta : BETA_VALS | BETA_VALS beta;
ALFA_VALS: ('b'|'c');
BETA_VALS: ('a'|'c');
WS : (' '|'\n'|'\t'|'\r')->skip;
Furthermore, if I change ALFA_VALS
and BETA_VALS
to alfa_vals
and beta_vals
respectively, no problems occur.
Could someone provide me an explanation about this behaviour? Because I couldn't find a specific one addressing this problem.
Thank you so much!
The ANTLR lexer matches the longest possible sub-sequence of input, or if it can match the same input using multiple lexer rules, it uses the first rule that matches.
The lexer knows no context and no parser state, and decides based solely on the characters on input.
If you define ALFA_VALS
and BETA_VALS
in lexer this way, the input 'c'
will always be matched as ALFA_VALS
token.