phptokenantlr4lexer

How ANTLR4 lexer consume more any tokens and stop at existing rules?


Is ANTLR4 lexer can consume more any tokens and stop at existing rules? Expect consume more chars into one token.

Small rules

lexer grammar PhpLexer;

options {
    superClass = PhpLexerBase;
    caseInsensitive = true;
}

T_OPEN_TAG_WITH_ECHO: '<?='  -> pushMode(PHP);
T_OPEN_TAG: PhpOpenTag -> pushMode(PHP);

T_INLINE_HTML: .+?;      // Problem Point

mode PHP;
   T_CLOSE_TAG: '?>';
   T_BAD_CHARACTER: .;

fragment NEWLINE: '\r'? '\n' | '\r';

fragment PhpOpenTag
    : '<?php' ([ \t] | NEWLINE)
    | '<?php' EOF
    ;

Input:

<html><?php echo "Hello, world!"; ?></html>

Got:

T_INLINE_HTML -> "<"
T_INLINE_HTML -> "h"
T_INLINE_HTML -> "t"
T_INLINE_HTML -> "m"
T_INLINE_HTML -> "l"
T_INLINE_HTML -> ">"
T_OPEN_TAG -> "<?php "
……

Expect:

T_INLINE_HTML -> "<html>"
T_OPEN_TAG -> "<?php "
……

Solution

  • Note that T_INLINE_HTML: .+?; result in the same as writing T_INLINE_HTML: .;: both will always match a single char.

    Try something like this:

    T_INLINE_HTML
     : T_INLINE_HTML_ATOM+
     ;
    
    fragment T_INLINE_HTML_ATOM
     : ~'<'               // match a char other than '<'
     | '<' ~'?'           // match a '<' followed by something other than '?'
     | '<?' ~[p=]         // match '<?' followed by something other than '?' and '='
     | '<?p' ~'h'         // match '<?p' followed by something other than 'h'
     | '<?ph' ~'p'        // match '<?ph' followed by something other than 'p'
     | '<?php' ~[ \t\r\n] // match '<?php' followed by something other than a space char
     ;