pythonantlr4

Trouble with python version of antlr4 to resolve ambiguity using predicate, while the Java version seems to work


strong text NOTE: given the answer, my way of using predicate was wrong. Hence, the title is misleading.

I'm writing a parser in python for an ancient language. The language contains syntax such as

want <decimal>
need <hex>

It is ambiguous on whether to resolve digits into dec or hex. Given the context of "want" and "need", I can predict the integer value format. Of course, the language is a bit more involved than this example. I started .g4 file using the predicate in java language since I developed the .g4 file and validated the correctness using antlr4-parse. However, after I ported the working .g4 file into python predicate, I got an parse error.

test.g4

grammar test;

@parser::members {
def setExpectDec(self, value):
    self.context.expectDec = value

def setExpectHex(self, value):
    self.context.expectHex = value

}


testFile: (command)* EOF;

command: 
    want
    | need;

want: 
    'want' length;
length: 
    {self.setExpectDec(True)} numberExpr {self.setExpectDec(False)};
need:
    'need' size;
size: 
    {self.setExpectHex(True)} numberExpr {self.setExpectHex(False)};

numberExpr:
    {self.context.expectHex}? NUMBER_HEX
    | {self.context.expectDec}? NUMBER_DEC;

NUMBER_HEX: 
    [0-9a-f]+;
NUMBER_DEC: 
    [0-9]+;
NEWLINE: 
    '\r'? '\n' -> skip;
WS: 
    [ \t]+ -> skip; // skip spaces and tabs

I ran this command to generate .py code:

antlr4 -Dlanguage=Python3 test.g4 -visitor -no-listener

test.py

import sys
from antlr4 import FileStream, InputStream, CommonTokenStream
from testLexer import testLexer
from testParser import testParser

class ParserContext:
    def __init__(self):
        # A flag to indicate if the next number should be interpreted as hexadecimal.
        self.expectDec = False
        # A flag to indicate if the next number should be interpreted as hexadecimal.
        self.expectHex = False

class TestReader:
    def parse_and_eval(self, testStream) -> None:
        context = ParserContext()
        lexer = testLexer(testStream)
        lexer.context = context
        tokenStream = CommonTokenStream(lexer)
        parser = testParser(tokenStream)
        parser.context = context
        tree = parser.testFile()


def main():
    if len(sys.argv) > 1:
        testStream = FileStream(sys.argv[1])
    else:
        testStream = InputStream(sys.stdin.readline())

    driver = None
    testReader = TestReader()
    testReader.parse_and_eval(testStream)

if __name__ == '__main__':
    main()

input.txt

want 10
need a

Error:

line 1:5 no viable alternative at input '10'

Debug into the parser py script. I can see that the var values are set correctly in this function

<__main__.ParserContext object at 0x7fa8ff0023d0>
special variables
expectDec = True
expectHex = False

    def numberExpr_sempred(self, localctx:NumberExprContext, predIndex:int):
            if predIndex == 0:
                return self.context.expectHex
         

            if predIndex == 1:
                return self.context.expectDec
         

Once I step out of this function, I got an exception shown below. It seems that the exception is thrown from some native code, not visible to py debugger. Should I expect numberExpr_sempred() to be called the second time with predIndex = 1 such that it returns with True? Unfortunately, I don't have a java development env setup to debug the java code. I see very similar java code structure.

Exception has occurred: NoViableAltException
None
  File "sdm_atb_testParser.py", line 393, in numberExpr
    la_ = self._interp.adaptivePredict(self._input,2,self._ctx)
  File "sdm_atb_testParser.py", line 266, in length
    self.numberExpr()
  File "sdm_atb_testParser.py", line 225, in want
    self.length()
  File "sdm_atb_testParser.py", line 174, in command
    self.want()
  File "sdm_atb_testParser.py", line 120, in testFile
    self.command()
  File "sdm_atb_test.py", line 21, in parse_and_eval
    tree = parser.testFile()
  File "sdm_atb_test.py", line 32, in main
    testReader.parse_and_eval(testStream)
  File "sdm_atb_test.py", line 35, in <module>
    main()
antlr4.error.Errors.NoViableAltException: None

I expect to be able to parse the input.txt successfully without syntax error.


Solution

  • When writing the rules:

    NUMBER_HEX: 
        [0-9a-f]+;
    
    NUMBER_DEC: 
        [0-9]+;
    

    the lexer will never produce a NUMBER_DEC token. ANTLR's lexer works like this:

    1. find the lexer rule that consumes the most characters
    2. in case rule 1 produces 2 (or more) rules that match the same amount of character, let the rule defined first "win"

    Input like 1234 is matched by both NUMBER_HEX and NUMBER_DEC, and NUMBER_HEX is defined first, so it will "win".

    You'd need to place the rules like this:

    NUMBER_DEC: 
        [0-9]+;
    
    NUMBER_HEX: 
        [0-9a-f]+;
    

    and then add a parser rule like this:

    hex
     : NUMBER_HEX
     | NUMBER_DEC
     ;
    

    and use it in your other parser rules:

    testFile
     : command* EOF
     ;
    
    command
     : want
     | need
     ;
    
    want
     : 'want' NUMBER_DEC
     ;
    
    need
     : 'need' hex
     ;
    
    hex
     : NUMBER_HEX
     | NUMBER_DEC
     ;
    
    NUMBER_DEC : [0-9]+;
    NUMBER_HEX : [0-9a-f]+;
    NEWLINE    : '\r'? '\n' -> skip;
    WS         : [ \t]+ -> skip;