parsingfortranantlr4grammarlexer

Lexing Issue in ANTLR4 Grammar for Fortran 2018: Token Misclassification


I am developing a Fortran 2018 grammar in ANTLR4 using the ISO standard. I am encountering an issue during the lexing phase with some of the lexer rules. Specifically, certain keywords are being misclassified. Below is the minimal grammar demonstrating the problem:

Grammar: FortranTestF18.g4

grammar FortranTestF18;

//LEXER RULES

LINE_COMMENT : '!' .*? '\r'? '\n' -> skip ;

BLOCK_COMMENT: '/*' .*? '*/' -> skip;

WS: [ \t\r\n]+ -> skip;

PROGRAM: 'PROGRAM' | 'Program' | 'program';

END: 'END' | 'End' | 'end';

COMMA: ',';

LPAREN: '(';

RPAREN: ')';

ASTERIK: '*';

NONE: 'NONE' | 'None' | 'none';

IMPLICIT: 'IMPLICIT' | 'Implicit' | 'implicit';

FORMAT: 'FORMAT' | 'Format' | 'format';

PLUS: '+';

// R765 binary-constant -> B ' digit [digit]... ' | B " digit [digit]... "
BINARYCONSTANT: B APOSTROPHE DIGIT+ APOSTROPHE | B QUOTE DIGIT+ QUOTE;

// R766 octal-constant -> O ' digit [digit]... ' | O " digit [digit]... "
OCTALCONSTANT: O APOSTROPHE DIGIT+ APOSTROPHE | O QUOTE DIGIT+ QUOTE;

//R0003 RepChar
APOSTROPHEREPCHAR: APOSTROPHE (~[\u0000-\u001F\u0027])*  APOSTROPHE;

QUOTEREPCHAR: QUOTE (~[\u0000-\u001F\u0022])*  QUOTE;

APOSTROPHE: '\'';

QUOTE: '"';

DOT: '.';

C: 'C';

// R603 name -> letter [alphanumeric-character]...
NAME: LETTER (ALPHANUMERICCHARACTER)*;

// R711 digit-string -> digit [digit]...
DIGITSTRING: DIGIT+; 

MINUS: '-';

B: 'B';

O: 'O';

Z: 'Z';

A: 'A';

F: 'F';

D: 'D';

E: 'E';

I: 'I';

G: 'G';

L: 'L';

DT: 'DT';

EN: 'EN';

ES: 'ES';

EX: 'EX';

T: 'T';

TL: 'TL';

TR: 'TR';

X: 'X';

SS: 'SS';

SP: 'SP';

S: 'S';

BN: 'BN';

BZ: 'BZ';

RU: 'RU';

RD: 'RD';

RZ: 'RZ';

RN: 'RN';

RC: 'RC';

RP: 'RP';

DC: 'DC';

DP: 'DP';

P: 'P';

// R602 UNDERSCORE -> _
UNDERSCORE: '_';

// R601 alphanumeric-character -> letter | digit | underscore
ALPHANUMERICCHARACTER: LETTER | DIGIT | UNDERSCORE;

// R0002 Letter ->
//         A | B | C | D | E | F | G | H | I | J | K | L | M |
//         N | O | P | Q | R | S | T | U | V | W | X | Y | Z
LETTER: 'A'..'Z' | 'a'..'z'; 

// R0001 Digit -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
DIGIT: '0'..'9';

//PARSER RULES

programName: NAME;

// R1402 program-stmt -> PROGRAM program-name
programStmt: PROGRAM programName;

typeName: NAME;

// R516 keyword -> name
keyword: NAME;

// R863 implicit-stmt -> IMPLICIT implicit-spec-list | IMPLICIT NONE [( [implicit-name-spec-list] )]
implicitStmt:
        IMPLICIT NONE;

// R709 kind-param -> digit-string | scalar-int-constant-name
kindParam: DIGITSTRING;

// R708 int-literal-constant -> digit-string [_ kind-param]
intLiteralConstant: DIGITSTRING (UNDERSCORE kindParam)?;

// R712 sign -> + | -
sign: PLUS | MINUS;

// R707 signed-int-literal-constant -> [sign] int-literal-constant
signedIntLiteralConstant: sign? intLiteralConstant;

// R1306 r -> int-literal-constant
r: intLiteralConstant;

// R1308 w -> int-literal-constant
w: intLiteralConstant;

// R1309 m -> int-literal-constant
m: intLiteralConstant;

// R1310 d -> int-literal-constant
d: intLiteralConstant;

// R1311 e -> int-literal-constant
e: intLiteralConstant;

// R1312 v -> signed-int-literal-constant
v: signedIntLiteralConstant;

vList: v (COMMA v)*;

// R724 char-literal-constant -> [kind-param _] ' [rep-char]... ' | [kind-param _] " [rep-char]... "
charLiteralConstant: 
        (kindParam UNDERSCORE)? APOSTROPHEREPCHAR
    | (kindParam UNDERSCORE)? QUOTEREPCHAR;

// R1307 data-edit-desc ->
//         I w [. m] | B w [. m] | O w [. m] | Z w [. m] | F w . d |
//         E w . d [E e] | EN w . d [E e] | ES w . d [E e] | EX w . d [E e] |
//         G w [. d [E e]] | L w | A [w] | D w . d |
//         DT [char-literal-constant] [( v-list )]
dataEditDesc:
    I w (DOT m)? |
    B w (DOT m)? |
    O w (DOT m)? |
    Z w (DOT m)? |
    F w DOT d |
    E w DOT d ( E e )? |
    EN w DOT d ( E e )? |
    ES w DOT d ( E e )? |
    EX w DOT d ( E e )? |
    G w (DOT d ( E e )?)? |
    L w |
    A w? |
    D w DOT d |
    DT charLiteralConstant? ( LPAREN vList RPAREN )?;

// R1304 format-item ->
//         [r] data-edit-desc | control-edit-desc | char-string-edit-desc | [r] ( format-items )
formatItem: r? dataEditDesc;

// R1303 format-items -> format-item [[,] format-item]...
formatItems: formatItem (COMMA? formatItem)*;

// R1305 unlimited-format-item -> * ( format-items )
unlimitedFormatItem: ASTERIK LPAREN formatItems RPAREN;

// R1302 format-specification ->
//         ( [format-items] ) | ( [format-items ,] unlimited-format-item )
formatSpecification:
    LPAREN formatItems? RPAREN |  LPAREN (formatItems COMMA)? unlimitedFormatItem  RPAREN;

// R1301 format-stmt -> FORMAT format-specification
formatStmt: FORMAT formatSpecification;

//R506 implicit-part-stmt -> implicit-stmt | parameter-stmt | format-stmt | entry-stmt
implicitPartStmt:
      implicitStmt
    | formatStmt;

//R505 implicit-part -> [implicit-part-stmt]... implicit-stmt
implicitPart: (implicitPartStmt)* implicitStmt;

//R504 specification-part -> [use-stmt]... [import-stmt]... [implicit-part]
// [declaration-construct]...
  specificationPart:
    (implicitPart)?;

// R1403 end-program-stmt -> END [PROGRAM [program-name]]
endProgramStmt: END (PROGRAM programName?)?;

// R1401 main-program ->
//         [program-stmt] [specification-part] [execution-part]
//         [internal-subprogram-part] end-program-stmt
///COMMENT: WHY ? after programStmt
  mainProgram:
      programStmt? specificationPart? endProgramStmt;

//R502 program-unit -> main-program | external-subprogram | module | submodule | block-data
programUnit:
    mainProgram;

//R501 program -> program-unit [program-unit]...    
program: programUnit (programUnit)*;      

Test File: FortranTest.f90

FORMAT(I 12)

Commands:

antlr4 FortranTestF18.g4
javac *.java
grun FortranTestF18 formatStmt -tokens FortranTest.f90

Grun Output:

[@0,0:5='FORMAT',<FORMAT>,1:0]
[@1,6:6='(',<'('>,1:6]
[@2,7:7='I',<NAME>,1:7]
[@3,9:10='12',<DIGITSTRING>,1:9]
[@4,11:11=')',<')'>,1:11]
[@5,12:11='<EOF>',<EOF>,1:12]
line 1:7 no viable alternative at input '(I'

Here, token 'I' is recognized as NAME but I want it to be recognized as token I: 'I';. But if I move the lexer rule I to top of NAME then the identifiers cannot be named as 'I'. How do I solve this problem?


Solution

  • Here is the solution I implemented:

    I divided the grammar into separate lexer and parser grammars and introduced lexer modes.

    Parser Grammar: FortranTestF18Parser.g4

    parser grammar FortranTestF18Parser;
    
    options { tokenVocab = FortranTestF18Lexer; }
    
    
    programName: NAME;
    
    // R1402 program-stmt -> PROGRAM program-name
    programStmt: PROGRAM programName;
    
    typeName: NAME;
    
    // R516 keyword -> name
    keyword: NAME;
    
    // R863 implicit-stmt -> IMPLICIT implicit-spec-list | IMPLICIT NONE [( [implicit-name-spec-list] )]
    implicitStmt:
            IMPLICIT NONE;
    
    // R709 kind-param -> digit-string | scalar-int-constant-name
    kindParam: DIGITSTRING;
    
    // R708 int-literal-constant -> digit-string [_ kind-param]
    intLiteralConstant: DIGITSTRING (UNDERSCORE kindParam)?;
    
    // R712 sign -> + | -
    sign: PLUS | MINUS;
    
    // R707 signed-int-literal-constant -> [sign] int-literal-constant
    signedIntLiteralConstant: sign? intLiteralConstant;
    
    // R1306 r -> int-literal-constant
    r: intLiteralConstant;
    
    // R1308 w -> int-literal-constant
    w: intLiteralConstant;
    
    // R1309 m -> int-literal-constant
    m: intLiteralConstant;
    
    // R1310 d -> int-literal-constant
    d: intLiteralConstant;
    
    // R1311 e -> int-literal-constant
    e: intLiteralConstant;
    
    // R1312 v -> signed-int-literal-constant
    v: signedIntLiteralConstant;
    
    vList: v (COMMA v)*;
    
    // R724 char-literal-constant -> [kind-param _] ' [rep-char]... ' | [kind-param _] " [rep-char]... "
    charLiteralConstant: 
            (kindParam UNDERSCORE)? APOSTROPHEREPCHAR
        | (kindParam UNDERSCORE)? QUOTEREPCHAR;
    
    // R1307 data-edit-desc ->
    //         I w [. m] | B w [. m] | O w [. m] | Z w [. m] | F w . d |
    //         E w . d [E e] | EN w . d [E e] | ES w . d [E e] | EX w . d [E e] |
    //         G w [. d [E e]] | L w | A [w] | D w . d |
    //         DT [char-literal-constant] [( v-list )]
    dataEditDesc:
        I w (DOT m)? |
        B w (DOT m)? |
        O w (DOT m)? |
        Z w (DOT m)? |
        F w DOT d |
        E w DOT d ( E e )? |
        EN w DOT d ( E e )? |
        ES w DOT d ( E e )? |
        EX w DOT d ( E e )? |
        G w (DOT d ( E e )?)? |
        L w |
        A w? |
        D w DOT d |
        DT charLiteralConstant? ( LPAREN vList RPAREN )?;
    
    // R1304 format-item ->
    //         [r] data-edit-desc | control-edit-desc | char-string-edit-desc | [r] ( format-items )
    formatItem: r? dataEditDesc;
    
    // R1303 format-items -> format-item [[,] format-item]...
    formatItems: formatItem (COMMA? formatItem)*;
    
    // R1305 unlimited-format-item -> * ( format-items )
    unlimitedFormatItem: ASTERIK LPAREN formatItems RPAREN;
    
    // R1302 format-specification ->
    //         ( [format-items] ) | ( [format-items ,] unlimited-format-item )
    formatSpecification:
      formatItems? |  (formatItems COMMA)? unlimitedFormatItem;
    
    // R1301 format-stmt -> FORMAT format-specification
    formatStmt: FORMATIN formatSpecification RPAREN;
    
    //R506 implicit-part-stmt -> implicit-stmt | parameter-stmt | format-stmt | entry-stmt
    implicitPartStmt:
          implicitStmt
        | formatStmt;
    
    //R505 implicit-part -> [implicit-part-stmt]... implicit-stmt
    implicitPart: (implicitPartStmt)* implicitStmt;
    
    //R504 specification-part -> [use-stmt]... [import-stmt]... [implicit-part]
    // [declaration-construct]...
      specificationPart:
        (implicitPart)?;
    
    // R1403 end-program-stmt -> END [PROGRAM [program-name]]
    endProgramStmt: END (PROGRAM programName?)?;
    
    // R1401 main-program ->
    //         [program-stmt] [specification-part] [execution-part]
    //         [internal-subprogram-part] end-program-stmt
    ///COMMENT: WHY ? after programStmt
      mainProgram:
          programStmt? specificationPart? endProgramStmt;
    
    //R502 program-unit -> main-program | external-subprogram | module | submodule | block-data
    programUnit:
        mainProgram;
    
    //R501 program -> program-unit [program-unit]...    
    program: programUnit (programUnit)*;      
    

    Lexer Grammar: FortranTestF18Lexer.g4

    lexer grammar FortranTestF18Lexer;
    
    options { caseInsensitive=true; }
    
    
    LINE_COMMENT: '!' .*? '\r'? '\n' -> skip;
    
    BLOCK_COMMENT: '/*' .*? '*/' -> skip;
    
    SPACE: [ ] -> skip;
    
    WS: [\t\r\n]+ -> skip;
    
    FORMATIN : FORMAT (SPACE)* LPAREN -> pushMode(FORMAT_MODE);
    
    PROGRAM: 'PROGRAM';
    
    END: 'END';
    
    COMMA: ',';
    
    LPAREN: '(';
    
    RPAREN: ')';
    
    ASTERIK: '*';
    
    PLUS: '+';
    
    MINUS: '-';
    
    NONE: 'NONE';
    
    IMPLICIT: 'IMPLICIT';
    
    FORMAT: 'FORMAT';
    
    DOT: '.';
    
    APOSTROPHE: '\'';
    
    QUOTE: '"';
    
    // R602 UNDERSCORE -> _
    UNDERSCORE: '_';
    
    //R0003 RepChar
    APOSTROPHEREPCHAR: APOSTROPHE (~[\u0000-\u001F])*?  APOSTROPHE;
    
    QUOTEREPCHAR: QUOTE (~[\u0000-\u001F])*?  QUOTE;
    
    // R603 name -> letter [alphanumeric-character]...
    NAME: LETTER (ALPHANUMERICCHARACTER)*;
    
    // R711 digit-string -> digit [digit]...
    DIGITSTRING: DIGIT+; 
    
    // R601 alphanumeric-character -> letter | digit | underscore
    ALPHANUMERICCHARACTER: LETTER | DIGIT | UNDERSCORE;
    
    // R0002 Letter ->
    //         A | B | C | D | E | F | G | H | I | J | K | L | M |
    //         N | O | P | Q | R | S | T | U | V | W | X | Y | Z
    LETTER: 'A'..'Z'; 
    
    // R0001 Digit -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
    DIGIT: '0'..'9';
    
    
    ///FORMAT MODE
    
    mode FORMAT_MODE;
    
    FORMAT_LPAREN: '(' -> pushMode(FORMAT_MODE), type(LPAREN);
    
    FORMAT_RPAREN: ')' -> popMode, type(RPAREN);
    
    FORMAT_SPACE: ' ' -> skip;
    
    //R0003 RepChar
    FORMAT_APOSTROPHEREPCHAR: FORMAT_APOSTROPHE (~[\u0000-\u001F\u0027])*?  FORMAT_APOSTROPHE -> type(APOSTROPHEREPCHAR);
    
    FORMAT_QUOTEREPCHAR: FORMAT_QUOTE (~[\u0000-\u001F\u0022])*?  FORMAT_QUOTE -> type(QUOTEREPCHAR);
    
    FORMAT_APOSTROPHE: '\'' -> type(APOSTROPHE);
    
    FORMAT_QUOTE: '"' -> type(QUOTE);
    
    FORMAT_PLUS: '+' -> type(PLUS);
    
    FORMAT_MINUS: '-' -> type(MINUS);
    
    FORMAT_COMMA: ',' -> type(COMMA);
    
    FORMAT_ASTERIK: '*' -> type(ASTERIK);
    
    FORMAT_DOT: '.' -> type(DOT);
    
    // R711 digit-string -> digit [digit]...
    FORMAT_DIGITSTRING: FORMAT_DIGIT+  -> type(DIGITSTRING); 
    
    // R602 UNDERSCORE -> _
    FORMAT_UNDERSCORE: '_' -> type(UNDERSCORE);
    
    // R0001 Digit -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
    FORMAT_DIGIT: '0'..'9' -> type(DIGIT);
    
    I: 'I';
    
    B: 'B';
    
    O: 'O';
    
    Z: 'Z';
    
    F: 'F';
    
    E: 'E';
    
    EN: 'EN';
    
    ES: 'ES';
    
    EX: 'EX';
    
    G: 'G';
    
    L: 'L';
    
    A: 'A';
    
    D: 'D';
    
    DT: 'DT';
    

    By introducing a lexer mode for FORMAT statements and switching modes upon encountering FORMAT ( and exiting when ) is encountered, I was able to correctly classify tokens within the format context and resolve the misclassification issue.