I am developing a Fortran 2018 grammar in ANTLR4 using the ISO standard. I am encountering an issue during the lexing phase with some of the lexer rules. Specifically, certain keywords are being misclassified. Below is the minimal grammar demonstrating the problem:
Grammar: FortranTestF18.g4
grammar FortranTestF18;
//LEXER RULES
LINE_COMMENT : '!' .*? '\r'? '\n' -> skip ;
BLOCK_COMMENT: '/*' .*? '*/' -> skip;
WS: [ \t\r\n]+ -> skip;
PROGRAM: 'PROGRAM' | 'Program' | 'program';
END: 'END' | 'End' | 'end';
COMMA: ',';
LPAREN: '(';
RPAREN: ')';
ASTERIK: '*';
NONE: 'NONE' | 'None' | 'none';
IMPLICIT: 'IMPLICIT' | 'Implicit' | 'implicit';
FORMAT: 'FORMAT' | 'Format' | 'format';
PLUS: '+';
// R765 binary-constant -> B ' digit [digit]... ' | B " digit [digit]... "
BINARYCONSTANT: B APOSTROPHE DIGIT+ APOSTROPHE | B QUOTE DIGIT+ QUOTE;
// R766 octal-constant -> O ' digit [digit]... ' | O " digit [digit]... "
OCTALCONSTANT: O APOSTROPHE DIGIT+ APOSTROPHE | O QUOTE DIGIT+ QUOTE;
//R0003 RepChar
APOSTROPHEREPCHAR: APOSTROPHE (~[\u0000-\u001F\u0027])* APOSTROPHE;
QUOTEREPCHAR: QUOTE (~[\u0000-\u001F\u0022])* QUOTE;
APOSTROPHE: '\'';
QUOTE: '"';
DOT: '.';
C: 'C';
// R603 name -> letter [alphanumeric-character]...
NAME: LETTER (ALPHANUMERICCHARACTER)*;
// R711 digit-string -> digit [digit]...
DIGITSTRING: DIGIT+;
MINUS: '-';
B: 'B';
O: 'O';
Z: 'Z';
A: 'A';
F: 'F';
D: 'D';
E: 'E';
I: 'I';
G: 'G';
L: 'L';
DT: 'DT';
EN: 'EN';
ES: 'ES';
EX: 'EX';
T: 'T';
TL: 'TL';
TR: 'TR';
X: 'X';
SS: 'SS';
SP: 'SP';
S: 'S';
BN: 'BN';
BZ: 'BZ';
RU: 'RU';
RD: 'RD';
RZ: 'RZ';
RN: 'RN';
RC: 'RC';
RP: 'RP';
DC: 'DC';
DP: 'DP';
P: 'P';
// R602 UNDERSCORE -> _
UNDERSCORE: '_';
// R601 alphanumeric-character -> letter | digit | underscore
ALPHANUMERICCHARACTER: LETTER | DIGIT | UNDERSCORE;
// R0002 Letter ->
// A | B | C | D | E | F | G | H | I | J | K | L | M |
// N | O | P | Q | R | S | T | U | V | W | X | Y | Z
LETTER: 'A'..'Z' | 'a'..'z';
// R0001 Digit -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
DIGIT: '0'..'9';
//PARSER RULES
programName: NAME;
// R1402 program-stmt -> PROGRAM program-name
programStmt: PROGRAM programName;
typeName: NAME;
// R516 keyword -> name
keyword: NAME;
// R863 implicit-stmt -> IMPLICIT implicit-spec-list | IMPLICIT NONE [( [implicit-name-spec-list] )]
implicitStmt:
IMPLICIT NONE;
// R709 kind-param -> digit-string | scalar-int-constant-name
kindParam: DIGITSTRING;
// R708 int-literal-constant -> digit-string [_ kind-param]
intLiteralConstant: DIGITSTRING (UNDERSCORE kindParam)?;
// R712 sign -> + | -
sign: PLUS | MINUS;
// R707 signed-int-literal-constant -> [sign] int-literal-constant
signedIntLiteralConstant: sign? intLiteralConstant;
// R1306 r -> int-literal-constant
r: intLiteralConstant;
// R1308 w -> int-literal-constant
w: intLiteralConstant;
// R1309 m -> int-literal-constant
m: intLiteralConstant;
// R1310 d -> int-literal-constant
d: intLiteralConstant;
// R1311 e -> int-literal-constant
e: intLiteralConstant;
// R1312 v -> signed-int-literal-constant
v: signedIntLiteralConstant;
vList: v (COMMA v)*;
// R724 char-literal-constant -> [kind-param _] ' [rep-char]... ' | [kind-param _] " [rep-char]... "
charLiteralConstant:
(kindParam UNDERSCORE)? APOSTROPHEREPCHAR
| (kindParam UNDERSCORE)? QUOTEREPCHAR;
// R1307 data-edit-desc ->
// I w [. m] | B w [. m] | O w [. m] | Z w [. m] | F w . d |
// E w . d [E e] | EN w . d [E e] | ES w . d [E e] | EX w . d [E e] |
// G w [. d [E e]] | L w | A [w] | D w . d |
// DT [char-literal-constant] [( v-list )]
dataEditDesc:
I w (DOT m)? |
B w (DOT m)? |
O w (DOT m)? |
Z w (DOT m)? |
F w DOT d |
E w DOT d ( E e )? |
EN w DOT d ( E e )? |
ES w DOT d ( E e )? |
EX w DOT d ( E e )? |
G w (DOT d ( E e )?)? |
L w |
A w? |
D w DOT d |
DT charLiteralConstant? ( LPAREN vList RPAREN )?;
// R1304 format-item ->
// [r] data-edit-desc | control-edit-desc | char-string-edit-desc | [r] ( format-items )
formatItem: r? dataEditDesc;
// R1303 format-items -> format-item [[,] format-item]...
formatItems: formatItem (COMMA? formatItem)*;
// R1305 unlimited-format-item -> * ( format-items )
unlimitedFormatItem: ASTERIK LPAREN formatItems RPAREN;
// R1302 format-specification ->
// ( [format-items] ) | ( [format-items ,] unlimited-format-item )
formatSpecification:
LPAREN formatItems? RPAREN | LPAREN (formatItems COMMA)? unlimitedFormatItem RPAREN;
// R1301 format-stmt -> FORMAT format-specification
formatStmt: FORMAT formatSpecification;
//R506 implicit-part-stmt -> implicit-stmt | parameter-stmt | format-stmt | entry-stmt
implicitPartStmt:
implicitStmt
| formatStmt;
//R505 implicit-part -> [implicit-part-stmt]... implicit-stmt
implicitPart: (implicitPartStmt)* implicitStmt;
//R504 specification-part -> [use-stmt]... [import-stmt]... [implicit-part]
// [declaration-construct]...
specificationPart:
(implicitPart)?;
// R1403 end-program-stmt -> END [PROGRAM [program-name]]
endProgramStmt: END (PROGRAM programName?)?;
// R1401 main-program ->
// [program-stmt] [specification-part] [execution-part]
// [internal-subprogram-part] end-program-stmt
///COMMENT: WHY ? after programStmt
mainProgram:
programStmt? specificationPart? endProgramStmt;
//R502 program-unit -> main-program | external-subprogram | module | submodule | block-data
programUnit:
mainProgram;
//R501 program -> program-unit [program-unit]...
program: programUnit (programUnit)*;
Test File: FortranTest.f90
FORMAT(I 12)
Commands:
antlr4 FortranTestF18.g4
javac *.java
grun FortranTestF18 formatStmt -tokens FortranTest.f90
Grun Output:
[@0,0:5='FORMAT',<FORMAT>,1:0]
[@1,6:6='(',<'('>,1:6]
[@2,7:7='I',<NAME>,1:7]
[@3,9:10='12',<DIGITSTRING>,1:9]
[@4,11:11=')',<')'>,1:11]
[@5,12:11='<EOF>',<EOF>,1:12]
line 1:7 no viable alternative at input '(I'
Here, token 'I' is recognized as NAME
but I want it to be recognized as token I: 'I';
. But if I move the lexer rule I
to top of NAME
then the identifiers cannot be named as 'I'. How do I solve this problem?
Here is the solution I implemented:
I divided the grammar into separate lexer and parser grammars and introduced lexer modes.
Parser Grammar: FortranTestF18Parser.g4
parser grammar FortranTestF18Parser;
options { tokenVocab = FortranTestF18Lexer; }
programName: NAME;
// R1402 program-stmt -> PROGRAM program-name
programStmt: PROGRAM programName;
typeName: NAME;
// R516 keyword -> name
keyword: NAME;
// R863 implicit-stmt -> IMPLICIT implicit-spec-list | IMPLICIT NONE [( [implicit-name-spec-list] )]
implicitStmt:
IMPLICIT NONE;
// R709 kind-param -> digit-string | scalar-int-constant-name
kindParam: DIGITSTRING;
// R708 int-literal-constant -> digit-string [_ kind-param]
intLiteralConstant: DIGITSTRING (UNDERSCORE kindParam)?;
// R712 sign -> + | -
sign: PLUS | MINUS;
// R707 signed-int-literal-constant -> [sign] int-literal-constant
signedIntLiteralConstant: sign? intLiteralConstant;
// R1306 r -> int-literal-constant
r: intLiteralConstant;
// R1308 w -> int-literal-constant
w: intLiteralConstant;
// R1309 m -> int-literal-constant
m: intLiteralConstant;
// R1310 d -> int-literal-constant
d: intLiteralConstant;
// R1311 e -> int-literal-constant
e: intLiteralConstant;
// R1312 v -> signed-int-literal-constant
v: signedIntLiteralConstant;
vList: v (COMMA v)*;
// R724 char-literal-constant -> [kind-param _] ' [rep-char]... ' | [kind-param _] " [rep-char]... "
charLiteralConstant:
(kindParam UNDERSCORE)? APOSTROPHEREPCHAR
| (kindParam UNDERSCORE)? QUOTEREPCHAR;
// R1307 data-edit-desc ->
// I w [. m] | B w [. m] | O w [. m] | Z w [. m] | F w . d |
// E w . d [E e] | EN w . d [E e] | ES w . d [E e] | EX w . d [E e] |
// G w [. d [E e]] | L w | A [w] | D w . d |
// DT [char-literal-constant] [( v-list )]
dataEditDesc:
I w (DOT m)? |
B w (DOT m)? |
O w (DOT m)? |
Z w (DOT m)? |
F w DOT d |
E w DOT d ( E e )? |
EN w DOT d ( E e )? |
ES w DOT d ( E e )? |
EX w DOT d ( E e )? |
G w (DOT d ( E e )?)? |
L w |
A w? |
D w DOT d |
DT charLiteralConstant? ( LPAREN vList RPAREN )?;
// R1304 format-item ->
// [r] data-edit-desc | control-edit-desc | char-string-edit-desc | [r] ( format-items )
formatItem: r? dataEditDesc;
// R1303 format-items -> format-item [[,] format-item]...
formatItems: formatItem (COMMA? formatItem)*;
// R1305 unlimited-format-item -> * ( format-items )
unlimitedFormatItem: ASTERIK LPAREN formatItems RPAREN;
// R1302 format-specification ->
// ( [format-items] ) | ( [format-items ,] unlimited-format-item )
formatSpecification:
formatItems? | (formatItems COMMA)? unlimitedFormatItem;
// R1301 format-stmt -> FORMAT format-specification
formatStmt: FORMATIN formatSpecification RPAREN;
//R506 implicit-part-stmt -> implicit-stmt | parameter-stmt | format-stmt | entry-stmt
implicitPartStmt:
implicitStmt
| formatStmt;
//R505 implicit-part -> [implicit-part-stmt]... implicit-stmt
implicitPart: (implicitPartStmt)* implicitStmt;
//R504 specification-part -> [use-stmt]... [import-stmt]... [implicit-part]
// [declaration-construct]...
specificationPart:
(implicitPart)?;
// R1403 end-program-stmt -> END [PROGRAM [program-name]]
endProgramStmt: END (PROGRAM programName?)?;
// R1401 main-program ->
// [program-stmt] [specification-part] [execution-part]
// [internal-subprogram-part] end-program-stmt
///COMMENT: WHY ? after programStmt
mainProgram:
programStmt? specificationPart? endProgramStmt;
//R502 program-unit -> main-program | external-subprogram | module | submodule | block-data
programUnit:
mainProgram;
//R501 program -> program-unit [program-unit]...
program: programUnit (programUnit)*;
Lexer Grammar: FortranTestF18Lexer.g4
lexer grammar FortranTestF18Lexer;
options { caseInsensitive=true; }
LINE_COMMENT: '!' .*? '\r'? '\n' -> skip;
BLOCK_COMMENT: '/*' .*? '*/' -> skip;
SPACE: [ ] -> skip;
WS: [\t\r\n]+ -> skip;
FORMATIN : FORMAT (SPACE)* LPAREN -> pushMode(FORMAT_MODE);
PROGRAM: 'PROGRAM';
END: 'END';
COMMA: ',';
LPAREN: '(';
RPAREN: ')';
ASTERIK: '*';
PLUS: '+';
MINUS: '-';
NONE: 'NONE';
IMPLICIT: 'IMPLICIT';
FORMAT: 'FORMAT';
DOT: '.';
APOSTROPHE: '\'';
QUOTE: '"';
// R602 UNDERSCORE -> _
UNDERSCORE: '_';
//R0003 RepChar
APOSTROPHEREPCHAR: APOSTROPHE (~[\u0000-\u001F])*? APOSTROPHE;
QUOTEREPCHAR: QUOTE (~[\u0000-\u001F])*? QUOTE;
// R603 name -> letter [alphanumeric-character]...
NAME: LETTER (ALPHANUMERICCHARACTER)*;
// R711 digit-string -> digit [digit]...
DIGITSTRING: DIGIT+;
// R601 alphanumeric-character -> letter | digit | underscore
ALPHANUMERICCHARACTER: LETTER | DIGIT | UNDERSCORE;
// R0002 Letter ->
// A | B | C | D | E | F | G | H | I | J | K | L | M |
// N | O | P | Q | R | S | T | U | V | W | X | Y | Z
LETTER: 'A'..'Z';
// R0001 Digit -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
DIGIT: '0'..'9';
///FORMAT MODE
mode FORMAT_MODE;
FORMAT_LPAREN: '(' -> pushMode(FORMAT_MODE), type(LPAREN);
FORMAT_RPAREN: ')' -> popMode, type(RPAREN);
FORMAT_SPACE: ' ' -> skip;
//R0003 RepChar
FORMAT_APOSTROPHEREPCHAR: FORMAT_APOSTROPHE (~[\u0000-\u001F\u0027])*? FORMAT_APOSTROPHE -> type(APOSTROPHEREPCHAR);
FORMAT_QUOTEREPCHAR: FORMAT_QUOTE (~[\u0000-\u001F\u0022])*? FORMAT_QUOTE -> type(QUOTEREPCHAR);
FORMAT_APOSTROPHE: '\'' -> type(APOSTROPHE);
FORMAT_QUOTE: '"' -> type(QUOTE);
FORMAT_PLUS: '+' -> type(PLUS);
FORMAT_MINUS: '-' -> type(MINUS);
FORMAT_COMMA: ',' -> type(COMMA);
FORMAT_ASTERIK: '*' -> type(ASTERIK);
FORMAT_DOT: '.' -> type(DOT);
// R711 digit-string -> digit [digit]...
FORMAT_DIGITSTRING: FORMAT_DIGIT+ -> type(DIGITSTRING);
// R602 UNDERSCORE -> _
FORMAT_UNDERSCORE: '_' -> type(UNDERSCORE);
// R0001 Digit -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
FORMAT_DIGIT: '0'..'9' -> type(DIGIT);
I: 'I';
B: 'B';
O: 'O';
Z: 'Z';
F: 'F';
E: 'E';
EN: 'EN';
ES: 'ES';
EX: 'EX';
G: 'G';
L: 'L';
A: 'A';
D: 'D';
DT: 'DT';
By introducing a lexer mode for FORMAT statements and switching modes upon encountering FORMAT (
and exiting when )
is encountered, I was able to correctly classify tokens within the format context and resolve the misclassification issue.