antlrantlr4

Lexical token disappeared in specific mode


I have this lexer config:

WS
    : ((' ' | '\t' | '\r' | '\n')+ | '\\' '\n') -> skip
    ;

T_QUOTED
    : '"'
    ;

T_CONFDIR_MYDIR
    : 'MyDirective' -> pushMode(mydir)
    ;

T_COMMENT
    : '#' .*? '\r'? '\n'
    ;

mode mydir;

T_MYDIRARG
    : ~([\\" ])+ -> popMode
    ;

And this is the input:

MyDirective "LiteralString"

When I try to parse (with Python, actually) I get this error:

line 4:21 token recognition error at: ' '
line 4:22 token recognition error at: '"'
line 4:23 extraneous input 'LiteralString' expecting '"'
line 5:0 mismatched input '<EOF>' expecting T_MYDIRARG

It seems like if the state goes to mydir, then the tokens in default mode (WS, T_QUOTED) are disappeared.

Why does not lexer recognize the space and the " characters (as those are defined as WS and T_QUOTED)?

What would be the expected solution?

Thanks.


Solution

  • If you go into the mydir mode after the input MyDirective, the first char will be a space char, which the mydir does not recognize.

    mydir can only recognize tokens defined in its own scope, not tokens in other scopes (also not the default scope/mode). In other words, in your case, mydir only recognizes T_MYDIRARG tokens.

    Look like what you're after is something like this:

    WS
        : ((' ' | '\t' | '\r' | '\n')+ | '\\' '\n') -> skip
        ;
    
    T_QUOTED_OPEN
        : '"' -> pushMode(mydir)
        ;
    
    T_CONFDIR_MYDIR
        : 'MyDirective'
        ;
    
    T_COMMENT
        : '#' .*? '\r'? '\n'
        ;
    
    mode mydir;
    
    T_QUOTED_CLOSE
        : '"' -> popMode
        ;
    
    T_MYDIRARG
        : ~([\\" ])+
        ;
    

    which will produce the following:

    5 tokens:
      1    T_CONFDIR_MYDIR                'MyDirective'
      2    T_QUOTED_OPEN                  '"'
      3    T_MYDIRARG                     'LiteralString'
      4    T_QUOTED_CLOSE                 '"'
      5    EOF                            '<EOF>'