parsingantlr4grammar

Parser recognizing variables


I am trying to generate a parser using antlr4.

My content seems quite simple. But let's have a look at my grammar first:

Lexer:

DOLLAR: '$' -> pushMode(VAR_MODE); // as soon as there's an "$" jump into VAR_MODE

TEXT: ~[\\$]+; // everything except "$"


mode VAR_MODE;
// a variable name must start with a letter
VARIABLE: [a-zA-Z][a-zA-Z0-9]+; 
// as soon as there's something not a letter or a digit, pop back to default mode
// more makes sure the parsing will be continued after a variable
ENDVAR: ~[a-zA-Z0-9] -> popMode, more; 

Parser:

parse:
    content+
    ;

content:
      variable
    | text
    ;

variable
    : DOLLAR VARIABLE
    ;

text
    : TEXT+
    ;

A sample string would be: Hi there $this is some $variable and here $are$two

The parser in general compiles and works, the only thing that does not work is the second variable, since it's recognized as text, not as variable. What do I do wrong? As soon as you would add a space between $are and $two it works again. I want to get rid of this space.

What is wrong in the grammar, it seems to me pretty ok?!

Is there another way of recognizing the variable, may be without using extra mode?

[Edit]

Another try

Lexer:

DOLLAR: '$' -> pushMode(VAR_MODE);
TEXT: ~[\\$]+;

mode VAR_MODE;
VARIABLE: [a-zA-Z][a-zA-Z0-9]*;
NEW_DOLLAR: '$' -> type(DOLLAR);
ENDVAR: ~[a-zA-Z0-9] -> popMode, more;

Parser:

parse:
    content+
    ;

content:
      variable
    | text
    ;

variable
    : DOLLAR VARIABLE
    | NEW_DOLLAR VARIABLE
    ;

text
    : TEXT+
    ;

It does work, even though it seems very wrong. Does anyone have a better idea?

[Edit 2]

A very simple solution would be

lexer grammar SampleLexer;
VARIABLE: '$'[a-zA-Z][a-zA-Z0-9]*;
TEXT: ~[\\$]+;
parser grammar SampleParser;
parse:
    content+
    ;

content:
      VARIABLE
    | TEXT
    ;

The disadvantage is that I need to split the string for getting the pure name myself. This is the best solution so far. Still would like to spit '$' and the actual name...


Solution

  • It is because the $ from $two is matched as an ENDVAR which is "glued" to the TEXT token, making $two a single TEXT token. What you need to do is also match a $ inside your var-mode:

    DOLLAR     : '$' -> pushMode(VAR_MODE);
    TEXT       : ~[\\$]+;
    
    mode VAR_MODE;
    
    VAR_DOLLAR : '$' -> type(DOLLAR);
    VARIABLE   : [a-zA-Z] [a-zA-Z0-9]*;
    ENDVAR     : ~[a-zA-Z0-9$] -> popMode, more;
    

    But, as mentioned by sepp2k in the comments: I'd also just create a single lexer rule that includes the $ and the name.