I am still learning to work with ANTLR4 to reimplement a grammar for a language I am familiar with. I have been using lab.antlr.org to help with debugging. I have the following grammar defined:
grammar lexrules;
program
: statement+
;
statement
: output_statement
;
test_number
: DIGIT DIGIT DIGIT DIGIT
;
step_number
: DIGIT DIGIT
;
statno
: test_number step_number
| SPACE SPACE SPACE SPACE step_number
| SPACE SPACE SPACE SPACE SPACE SPACE
;
fstatno
: 'E' statno ws+
| ws statno ws+
;
output_statement
: fstatno 'OUTPUT' fd using fd formatted_output+ STATEMENT_TERMINATOR
| fstatno 'OUTPUT' fd using fd unformatted_output STATEMENT_TERMINATOR
;
formatted_output
: '(' '(' formatted_output_formats (fd formatted_output_formats)* ')' (data_store (fd data_store)*)? ')'
;
formatted_output_formats
: ( output_tab | output_int | output_space | output_text )
;
output_tab
: 'T' DIGIT+
;
output_int
: 'I' DIGIT+ (PERIOD DIGIT+)?
;
output_space
: DIGIT+ 'X'
;
output_text
: SINGLEQUOTE ( ALPHA_UPPER | ALPHA_LOWER | DIGIT | ws | HYPHEN | COMMA | PERIOD )* SINGLEQUOTE
;
data_store
: label
;
unformatted_output
: text
;
using
: 'USING' ws+ label
;
fd
: ws* COMMA ws*
;
label
: SINGLEQUOTE ( ALPHA_UPPER | ALPHA_LOWER )( ALPHA_UPPER | ALPHA_LOWER | DIGIT )* SINGLEQUOTE
;
text
: ( ALPHA_UPPER | ALPHA_LOWER | DIGIT | ws | HYPHEN | SINGLEQUOTE )+
;
ws
: SPACE
| TAB
;
SINGLEQUOTE: '\'' ;
STATEMENT_TERMINATOR: '$' ;
ALPHA_UPPER: [A-Z] ;
ALPHA_LOWER: [a-z] ;
DIGIT: [0-9] ;
HYPHEN: '-' ;
COMMA: ',' ;
PERIOD: '.' ;
SPACE: ' ' ;
TAB: '\t' ;
NL: [\n\r]+ -> skip ;
And I have the following text input:
100015 OUTPUT, USING 'FOO', 'BAR' $
This will parse just fine, but if I try this input:
100015 OUTPUT, USING 'CRT', 'BAR' $
I will get:
1:25 no viable alternative at input ' 100015 OUTPUT, USING 'CRT'
If I change the rule for output_text to start with anything other than 'C', 'R', or 'T' then it seems to parse... I am a bit lost as to why this is happening
The 'T'
, 'I'
etc. are not being tokenized as ALPHA_UPPER
tokens because you added them as "literal tokens" inside parser rules. So the T
in CRT
fails to be parsed as a label
:
100015 OUTPUT, USING 'CRT', 'BAR' $
^
|
not an ALPHA_UPPER
When you do this:
parser_rule
: 'T' ALPHA_UPPER
;
ALPHA_UPPER
: [A-Z]
;
ANTLR will translate that into the following grammar:
parser_rule
: T__0 ALPHA_UPPER
;
T__0
: 'T'
;
ALPHA_UPPER
: [A-Z]
;
I.e.: it moves all "literal tokens" to the top of the lexer rules. And since input (like T
) can only be tokenized in 1 way, ANTLR will always create a T__0
token for it, never as a ALPHA_UPPER
.
I recommend not to use "literal tokens" and do something like this:
parser_rule
: T alpha_upper
;
alpha_upper
: T
| ALPHA_UPPER
;
T
: 'T'
;
ALPHA_UPPER
: [A-Z]
;
Note that your keywords OUTPUT
and USING
will not be matches as ALPHA_UPPER
either. So you might want to include those in your text
rule:
text
: ( OUTPUT | USING | alpha_upper | ALPHA_LOWER | DIGIT | ws | HYPHEN | SINGLEQUOTE )+
;
OUTPUT : 'OUTPUT';
USING : 'USING';
Finally, be sure to have a parser rule containing the built-in EOF
token (end-of-file token). This forces the parser to consume all tokens:
program
: statement+ EOF
;