Is ANTLR4 lexer can consume more any tokens and stop at existing rules? Expect consume more chars into one token.
Small rules
lexer grammar PhpLexer;
options {
superClass = PhpLexerBase;
caseInsensitive = true;
}
T_OPEN_TAG_WITH_ECHO: '<?=' -> pushMode(PHP);
T_OPEN_TAG: PhpOpenTag -> pushMode(PHP);
T_INLINE_HTML: .+?; // Problem Point
mode PHP;
T_CLOSE_TAG: '?>';
T_BAD_CHARACTER: .;
fragment NEWLINE: '\r'? '\n' | '\r';
fragment PhpOpenTag
: '<?php' ([ \t] | NEWLINE)
| '<?php' EOF
;
Input:
<html><?php echo "Hello, world!"; ?></html>
Got:
T_INLINE_HTML -> "<"
T_INLINE_HTML -> "h"
T_INLINE_HTML -> "t"
T_INLINE_HTML -> "m"
T_INLINE_HTML -> "l"
T_INLINE_HTML -> ">"
T_OPEN_TAG -> "<?php "
……
Expect:
T_INLINE_HTML -> "<html>"
T_OPEN_TAG -> "<?php "
……
Note that T_INLINE_HTML: .+?;
result in the same as writing T_INLINE_HTML: .;
: both will always match a single char.
Try something like this:
T_INLINE_HTML
: T_INLINE_HTML_ATOM+
;
fragment T_INLINE_HTML_ATOM
: ~'<' // match a char other than '<'
| '<' ~'?' // match a '<' followed by something other than '?'
| '<?' ~[p=] // match '<?' followed by something other than '?' and '='
| '<?p' ~'h' // match '<?p' followed by something other than 'h'
| '<?ph' ~'p' // match '<?ph' followed by something other than 'p'
| '<?php' ~[ \t\r\n] // match '<?php' followed by something other than a space char
;