I am trying to generate a parser using antlr4.
My content seems quite simple. But let's have a look at my grammar first:
Lexer:
DOLLAR: '$' -> pushMode(VAR_MODE); // as soon as there's an "$" jump into VAR_MODE
TEXT: ~[\\$]+; // everything except "$"
mode VAR_MODE;
// a variable name must start with a letter
VARIABLE: [a-zA-Z][a-zA-Z0-9]+;
// as soon as there's something not a letter or a digit, pop back to default mode
// more makes sure the parsing will be continued after a variable
ENDVAR: ~[a-zA-Z0-9] -> popMode, more;
Parser:
parse:
content+
;
content:
variable
| text
;
variable
: DOLLAR VARIABLE
;
text
: TEXT+
;
A sample string would be:
Hi there $this is some $variable and here $are$two
The parser in general compiles and works, the only thing that does not work is the second variable, since it's recognized as text, not as variable. What do I do wrong? As soon as you would add a space between $are
and $two
it works again. I want to get rid of this space.
What is wrong in the grammar, it seems to me pretty ok?!
Is there another way of recognizing the variable, may be without using extra mode?
[Edit]
Another try
Lexer:
DOLLAR: '$' -> pushMode(VAR_MODE);
TEXT: ~[\\$]+;
mode VAR_MODE;
VARIABLE: [a-zA-Z][a-zA-Z0-9]*;
NEW_DOLLAR: '$' -> type(DOLLAR);
ENDVAR: ~[a-zA-Z0-9] -> popMode, more;
Parser:
parse:
content+
;
content:
variable
| text
;
variable
: DOLLAR VARIABLE
| NEW_DOLLAR VARIABLE
;
text
: TEXT+
;
It does work, even though it seems very wrong. Does anyone have a better idea?
[Edit 2]
A very simple solution would be
lexer grammar SampleLexer;
VARIABLE: '$'[a-zA-Z][a-zA-Z0-9]*;
TEXT: ~[\\$]+;
parser grammar SampleParser;
parse:
content+
;
content:
VARIABLE
| TEXT
;
The disadvantage is that I need to split the string for getting the pure name myself. This is the best solution so far. Still would like to spit '$' and the actual name...
It is because the $
from $two
is matched as an ENDVAR
which is "glued" to the TEXT
token, making $two
a single TEXT
token. What you need to do is also match a $
inside your var-mode:
DOLLAR : '$' -> pushMode(VAR_MODE);
TEXT : ~[\\$]+;
mode VAR_MODE;
VAR_DOLLAR : '$' -> type(DOLLAR);
VARIABLE : [a-zA-Z] [a-zA-Z0-9]*;
ENDVAR : ~[a-zA-Z0-9$] -> popMode, more;
But, as mentioned by sepp2k in the comments: I'd also just create a single lexer rule that includes the $
and the name.