antlrantlr4

ANTLR4 no viable alternative at input


I am still learning to work with ANTLR4 to reimplement a grammar for a language I am familiar with. I have been using lab.antlr.org to help with debugging. I have the following grammar defined:

grammar lexrules;

program
    : statement+
    ;

statement
    : output_statement
    ;

test_number
    : DIGIT DIGIT DIGIT DIGIT
    ;

step_number
    : DIGIT DIGIT
    ;

statno
    : test_number step_number
    | SPACE SPACE SPACE SPACE step_number
    | SPACE SPACE SPACE SPACE SPACE SPACE
    ;

fstatno
    : 'E' statno ws+
    | ws statno ws+
    ;

output_statement
    : fstatno 'OUTPUT' fd using fd formatted_output+ STATEMENT_TERMINATOR
    | fstatno 'OUTPUT' fd using fd unformatted_output STATEMENT_TERMINATOR
    ;

formatted_output
    : '(' '(' formatted_output_formats (fd formatted_output_formats)* ')' (data_store (fd data_store)*)?   ')'
    ;

formatted_output_formats
    : ( output_tab | output_int | output_space | output_text )
    ;

output_tab
    : 'T' DIGIT+
    ;

output_int
    : 'I' DIGIT+ (PERIOD DIGIT+)?
    ;

output_space
    : DIGIT+ 'X'
    ;

output_text
    : SINGLEQUOTE ( ALPHA_UPPER | ALPHA_LOWER | DIGIT | ws | HYPHEN | COMMA | PERIOD )* SINGLEQUOTE
    ;

data_store
    : label
    ;

unformatted_output
    : text
    ;

using
    : 'USING' ws+ label
    ;

fd
    : ws* COMMA ws*
    ;

label
    : SINGLEQUOTE ( ALPHA_UPPER | ALPHA_LOWER )( ALPHA_UPPER | ALPHA_LOWER | DIGIT )* SINGLEQUOTE
    ;

text
    : ( ALPHA_UPPER | ALPHA_LOWER | DIGIT | ws | HYPHEN | SINGLEQUOTE )+
    ;

ws
    : SPACE
    | TAB
    ;

SINGLEQUOTE: '\'' ;
STATEMENT_TERMINATOR: '$' ;

ALPHA_UPPER: [A-Z] ;
ALPHA_LOWER: [a-z] ;
DIGIT: [0-9] ;
HYPHEN: '-' ;
COMMA: ',' ;
PERIOD: '.' ;
SPACE: ' ' ;
TAB: '\t' ;

NL: [\n\r]+ -> skip ;

And I have the following text input:

 100015 OUTPUT, USING 'FOO', 'BAR' $

This will parse just fine, but if I try this input:

 100015 OUTPUT, USING 'CRT', 'BAR' $

I will get:

1:25 no viable alternative at input ' 100015 OUTPUT, USING 'CRT'

If I change the rule for output_text to start with anything other than 'C', 'R', or 'T' then it seems to parse... I am a bit lost as to why this is happening


Solution

  • The 'T', 'I' etc. are not being tokenized as ALPHA_UPPER tokens because you added them as "literal tokens" inside parser rules. So the T in CRT fails to be parsed as a label:

    100015 OUTPUT, USING 'CRT', 'BAR' $
                            ^
                            |
                             not an ALPHA_UPPER
    

    When you do this:

    parser_rule
     : 'T' ALPHA_UPPER
     ;
    
    ALPHA_UPPER
     : [A-Z]
     ;
    

    ANTLR will translate that into the following grammar:

    parser_rule
     : T__0 ALPHA_UPPER
     ;
    
    T__0
     : 'T'
     ;
    
    ALPHA_UPPER
     : [A-Z]
     ;
    

    I.e.: it moves all "literal tokens" to the top of the lexer rules. And since input (like T) can only be tokenized in 1 way, ANTLR will always create a T__0 token for it, never as a ALPHA_UPPER.

    I recommend not to use "literal tokens" and do something like this:

    parser_rule
     : T alpha_upper
     ;
    
    alpha_upper
     : T
     | ALPHA_UPPER
     ;
    
    T
     : 'T'
     ;
    
    ALPHA_UPPER
     : [A-Z]
     ;
    

    Note that your keywords OUTPUT and USING will not be matches as ALPHA_UPPER either. So you might want to include those in your text rule:

    text
        : ( OUTPUT | USING | alpha_upper | ALPHA_LOWER | DIGIT | ws | HYPHEN | SINGLEQUOTE )+
        ;
    
    OUTPUT : 'OUTPUT';
    USING : 'USING';
    

    Finally, be sure to have a parser rule containing the built-in EOF token (end-of-file token). This forces the parser to consume all tokens:

    program
        : statement+ EOF
        ;