javaantlrantlr4

ANTLR4: no viable alternative at input 'stringname'


I am doing my research by making a programming language using antlr4 and I am struggling for whole day to fix the problem with two words being one token after whitespace removal.

This is my grammar for antlr:

grammar Grammar;

start: (statement ';')*;

//needs expressions extension
statement
    : variable
    | //class
    | if
    | function
    | loop
    | functionCall
    | show
    ;

variable
    : TYPE ID ('=' VAR_TYPE)?
    | ...
    ;

array 
    : TYPE ID '[]' ('=' '[' VAR_TYPE (',' VAR_TYPE)* ']')?
    ;

//needs expressions extension
function
    : (ACCESS TYPE ID '(' ID* ')' '{' 
        (
            variable
            | if
            | loop
            | functionCall
        ) 'return' VAR_TYPE
      '}')
    | (ACCESS 'void' ID '(' ID* ')' '{' 
        (
            variable
            | if
            | loop
            | functionCall
        )
      '}')
    ;

//needs expressions extension
if: 'if' (ID | VAR_TYPE) COMPARISON (ID | VAR_TYPE) ':'
        (
            '\t' variable
            | '\t' if
            | '\t' loop
            | '\t' functionCall
            | '\t' show
        )*
    ('else if' (ID | VAR_TYPE) COMPARISON (ID | VAR_TYPE) ':'
        (
            '\t' variable
            | '\t' if
            | '\t' loop
            | '\t' functionCall
            | '\t' show
        )*
    )*
    ('else' ':'
        (
            '\t' variable
            | '\t' if
            | '\t' loop
            | '\t' functionCall
            | '\t' show
        )*
    )?
    ;

loop: 'foreach' ID 'in' ID ':'
    (
        '\t' variable
        | '\t' if
        | '\t' loop
        | '\t' functionCall
        | '\t' show
    )*
    ;

functionCall: (ID '.')? ID '()';

//needs expressions extension
show: 'show' '(' (ID | VAR_TYPE)? ('+' (ID | VAR_TYPE))* ')';

ACCESS: 'private' | 'public';
COMPARISON: '>' | '<' | '>=' | '<=' | '==';
TYPE: 'int' | 'float' | 'string';
VAR_TYPE: STRING | INT | BOOL | FLOAT | ID;
ID: [a-zA-Z_][a-zA-Z0-9_]* ;
STRING : '"' .*? '"' ;
INT : [0-9]+ ;
BOOL : 'true' | 'false' ;
FLOAT : [0-9]+ '.' [0-9]+ ;
WS : [ \t\r\n]+ -> skip;

This is what console gives after making a tree:

line 1:7 no viable alternative at input 'stringname'
line 2:4 no viable alternative at input 'intage'

And here is also input.txt file for grammar:

string name;
int age;
bool sex;
string children[];

public string returnPerson() {
    return "Name " + name + "\nAge " + age + "\nSex " + sex + "\n";
}

public bool isMinor() {
    if age > 17:
        return false;
    else:
        return true;
}

public void showChildren() {
    int i = 0;
    foreach child in children:
        show("Children №" + (i + 1) + ": " + child + "\n");
}

I basically just don't know what to do with this, I have witespaces sorted out, but it still thinks it is one token. Also, by the output tree I see that it doesnt go further than two first lines of input.txt.

Help me to fix this problem please.


Solution

  • Your lexer will never produce an ID token because of this:

    VAR_TYPE: STRING | INT | BOOL | FLOAT | ID;
    ID: [a-zA-Z_][a-zA-Z0-9_]* ;
    

    Because VAR_TYPE also matches an ID. ANTLR's lexer works like this:

    1. try to match a rule with as many characters as possible
    2. if 2 (or more) rules match the same amount of characters, let tthe one defined first "win"

    Because of rule 2, it is clear that ID will never get matched.

    VAR_TYPE seems a better candidate for a parser rule:

    var_type : STRING | INT | BOOL | FLOAT | ID;
    

    But there are quite a few other things incorrect with the grammar you posted. If you define '()' in your grammar, then a single '(' token will not be matched. When creating literal tokens inside parser rules, ANTLR creates tokens for them like this:

    functionCall: (ID '.')? ID '()';
    show: 'show' '(' (ID | VAR_TYPE)? ('+' (ID | VAR_TYPE))* ')';
    
    T__0 : '.';
    T__1 : '()';
    T__2 : 'show';
    T__3 : '(';
    T__4 : ')';
    ...
    

    If you now try to parse the input:

    public string returnPerson() {
        return "Name " + name + "\nAge " + age + "\nSex " + sex + "\n";
    }
    

    using the parser rule:

    function
     : ACCESS TYPE ID '(' ...
     ;
    

    it will fail, because () is tokenized as a T__1 token, not as T__3 and T__4 tokens.

    EDIT

    Also, BOOL : 'true' | 'false' ; will never get matched because of the 2 match-rules I mentioned earlier (true and false will also be matched as VAR_TYPE tokens).

    Here's a quick edit of your grammar so that it will correctly parse your input:

    grammar Grammar;
    
    start : statement* EOF;
    
    statement
     : variable ';'
     | array ';'
     | if
     | function
     | loop
     | functionCall ';'
     | show ';'
     | 'return' expression ';'
     ;
    
    function
     : ACCESS TYPE ID '(' ID* ')' '{' statement* '}'
     | ACCESS 'void' ID '(' ID* ')' '{' statement* '}'
     ;
    
    variable     : TYPE ID ('=' expression)?;
    array        : TYPE ID '[' ']' ('=' '[' expression (',' expression)* ']')?;
    if           : 'if' expression ':' statement* ('else if' expression ':' statement*)* ('else' ':' statement*)?;
    loop         : 'foreach' ID 'in' expression ':' statement*;
    functionCall : (ID '.')? ID '(' ')';
    show         : 'show' '(' expression ')';
    
    expression
     : '(' expression ')'
     | expression '+' expression
     | expression COMPARISON expression
     | STRING
     | ID
     | INT
     | BOOL
     | FLOAT
     | ID
     ;
    
    ACCESS     : 'private' | 'public';
    COMPARISON : '>' | '<' | '>=' | '<=' | '==';
    TYPE       : 'int' | 'float' | 'string' | 'bool';
    BOOL       : 'true' | 'false' ;
    ID         : [a-zA-Z_][a-zA-Z0-9_]* ;
    STRING     : '"' (~[\\"] | '\\' .)* '"';
    INT        : [0-9]+;
    FLOAT      : [0-9]+ '.' [0-9]+;
    WS         : [ \t\r\n]+ -> channel(HIDDEN);
    

    enter image description here