pythonparsinglexersly

Sly parser: condition rule not recalled


Hello I am building my first programming language with sly module in python. I was following some tutorials and I have a question about how the parser behaves. This is the code:

from sly import Lexer
from sly import Parser

class BasicLexer(Lexer):
    tokens = { NAME, NUMBER, STRING, IF, THEN, ELSE, FOR, FUN, TO, ARROW, EQEQ }
    ignore = '\t '

    literals = { '=', '+', '-', '/', '*', '(', ')', ',', ';' }

    # Define tokens
    IF = r'IF'
    THEN = r'THEN'
    ELSE = r'ELSE'
    FOR = r'FOR'
    FUN = r'FUN'
    TO = r'TO'
    ARROW = r'->'
    NAME = r'[a-zA-Z_][a-zA-Z0-9_]*'
    STRING = r'\".*?\"'

    EQEQ = r'=='

    @_(r'\d+')
    def NUMBER(self, t):
        t.value = int(t.value)
        return t

    @_(r'#.*')
    def COMMENT(self, t):
        pass

    @_(r'\n+')
    def newline(self,t ):
        self.lineno = t.value.count('\n')

class BasicParser(Parser):
    tokens = BasicLexer.tokens

    precedence = (
        ('left', '+', '-'),
        ('left', '*', '/'),
        ('right', 'UMINUS'),
        )

    def __init__(self):
        self.env = { }
    @_('')
    def statement(self, p):
        pass

    @_('FOR var_assign TO expr THEN statement')
    def statement(self, p):
        return ('for_loop', ('for_loop_setup', p.var_assign, p.expr), p.statement)

    @_('IF condition THEN statement ELSE statement')
    def statement(self, p):
        return ('if_stmt', p.condition, ('branch', p.statement0, p.statement1))

    @_('FUN NAME "(" ")" ARROW statement')
    def statement(self, p):
        return ('fun_def', p.NAME, p.statement)

    @_('NAME "(" ")"')
    def statement(self, p):
        return ('fun_call', p.NAME)

    @_('expr EQEQ expr')
    def condition(self, p):
        return ('condition_eqeq', p.expr0, p.expr1)

    @_('var_assign')
    def statement(self, p):
        return p.var_assign

    @_('NAME "=" expr')
    def var_assign(self, p):
        return ('var_assign', p.NAME, p.expr)

    @_('NAME "=" STRING')
    def var_assign(self, p):
        return ('var_assign', p.NAME, p.STRING)

    @_('expr')
    def statement(self, p):
        return (p.expr)

    @_('expr "+" expr')
    def expr(self, p):
        return ('add', p.expr0, p.expr1)

    @_('expr "-" expr')
    def expr(self, p):
        return ('sub', p.expr0, p.expr1)

    @_('expr "*" expr')
    def expr(self, p):
        return ('mul', p.expr0, p.expr1)

    @_('expr "/" expr')
    def expr(self, p):
        return ('div', p.expr0, p.expr1)

    @_('"-" expr %prec UMINUS')
    def expr(self, p):
        return ('neg', p.expr)

    @_('NAME')
    def expr(self, p):
        return ('var', p.NAME)

    @_('NUMBER')
    def expr(self, p):
        return ('num', p.NUMBER)

if __name__ == '__main__':
    lexer = BasicLexer()
    parser = BasicParser()
    env = {}
    while True:
        try:
            text = input('basic > ')
        except EOFError:
            break
        if text:
            tree = parser.parse(lexer.tokenize(text))
            print(tree)

From what I understand, the condition rule is non terminal and the BNF representation is: condition : expr EQEQ expr. If I give as input "a == b" I get the error sly: Syntax error at line 1, token=EQEQ ('var', 'b'). I know this is the wanted behavior as there's no meaning in giving this input outside a statement, but I want to understand why the condition method is not recalled. Thanks!


Solution

  • Although we call this technique bottom-up parsing, it's really still top-down in the sense that the grammar is applied starting with what will be the root of the parse, which is the start symbol, on this case statement.

    What makes the parse "bottom-up" is the construction of the parse tree. Unlike top-down parsing, in which the next tree node is immediately predicted (and added to the parse tree) before the child nodes are parsed, the bottom-up parser first parses the child nodes and only once all the children have been reduced does it add the parent node. Thus, the root node is the last thing added to the tree; the tree has been built from the bottom up. [Note 1]

    This gives the parser a lot more flexibility because it doesn't need to know for certain which production to predict. It can predict a number of possibilities and explore them on parallel. (This can be done in linear time using a state machine.)

    In effect, condition only becomes relevant when it might be the next thing to parse. In your grammar, that only happens after an IF token, so it doesn't apply to your sample input.

    Here's a quick summary of the parsing process. (If you haven't been exposed to finite state machines, this might not be so easy to follow.) It's useful to know that:

    With your grammar, the state machine's starting state is the itemset →•:

    START      → • statement $
    statement  → •                 [ $ ]
    statement  → • FOR var_assign TO expr THEN statement
    statement  → • IF condition THEN statement ELSE statement
    statement  → • FUN NAME "(" ")" ARROW statement
    statement  → • NAME "(" ")"
    statement  → • expr
    statement  → • var_assign
    expr       → • expr "+" expr
    expr       → • expr "-" expr
    expr       → • expr "*" expr
    expr       → • expr "/" expr
    expr       → • "-" expr
    expr       → • NAME
    expr       → • NUMBER
    var_assign → • NAME "=" expr
    var_assign → • NAME "=" STRING
    

    Note that no production for condition is in this itemset, because that cannot be at the beginning of a sentence generated from the start symbol.

    The first input token is a, which is a NAME. That disallows the reduction of statement → •, so the next state consists of the items with NAME following the marker:

    statement  → NAME • "(" ")"
    expr       → NAME •            [ "+", "-", "*", "/", ELSE, $ ]
    var_assign → NAME • "=" expr
    var_assign → NAME • "=" STRING
    

    Now the next token is EQEQ. At this point, the state machine is stuck. No shift action is possible and the only reduce action does not have == in its lookahead set. So a syntax error is produced.


    Notes

    1. Confusingly, mathematical trees, unlike the trees outside your window, have the root at the top and the leaves on the bottom. Also, it's possible that the tree isn't really built; that's up to you. But there is always an implicit tree defined by the order the action functions are executed: children first, finally the parent.