Hello I am building my first programming language with sly module in python. I was following some tutorials and I have a question about how the parser behaves. This is the code:
from sly import Lexer
from sly import Parser
class BasicLexer(Lexer):
tokens = { NAME, NUMBER, STRING, IF, THEN, ELSE, FOR, FUN, TO, ARROW, EQEQ }
ignore = '\t '
literals = { '=', '+', '-', '/', '*', '(', ')', ',', ';' }
# Define tokens
IF = r'IF'
THEN = r'THEN'
ELSE = r'ELSE'
FOR = r'FOR'
FUN = r'FUN'
TO = r'TO'
ARROW = r'->'
NAME = r'[a-zA-Z_][a-zA-Z0-9_]*'
STRING = r'\".*?\"'
EQEQ = r'=='
@_(r'\d+')
def NUMBER(self, t):
t.value = int(t.value)
return t
@_(r'#.*')
def COMMENT(self, t):
pass
@_(r'\n+')
def newline(self,t ):
self.lineno = t.value.count('\n')
class BasicParser(Parser):
tokens = BasicLexer.tokens
precedence = (
('left', '+', '-'),
('left', '*', '/'),
('right', 'UMINUS'),
)
def __init__(self):
self.env = { }
@_('')
def statement(self, p):
pass
@_('FOR var_assign TO expr THEN statement')
def statement(self, p):
return ('for_loop', ('for_loop_setup', p.var_assign, p.expr), p.statement)
@_('IF condition THEN statement ELSE statement')
def statement(self, p):
return ('if_stmt', p.condition, ('branch', p.statement0, p.statement1))
@_('FUN NAME "(" ")" ARROW statement')
def statement(self, p):
return ('fun_def', p.NAME, p.statement)
@_('NAME "(" ")"')
def statement(self, p):
return ('fun_call', p.NAME)
@_('expr EQEQ expr')
def condition(self, p):
return ('condition_eqeq', p.expr0, p.expr1)
@_('var_assign')
def statement(self, p):
return p.var_assign
@_('NAME "=" expr')
def var_assign(self, p):
return ('var_assign', p.NAME, p.expr)
@_('NAME "=" STRING')
def var_assign(self, p):
return ('var_assign', p.NAME, p.STRING)
@_('expr')
def statement(self, p):
return (p.expr)
@_('expr "+" expr')
def expr(self, p):
return ('add', p.expr0, p.expr1)
@_('expr "-" expr')
def expr(self, p):
return ('sub', p.expr0, p.expr1)
@_('expr "*" expr')
def expr(self, p):
return ('mul', p.expr0, p.expr1)
@_('expr "/" expr')
def expr(self, p):
return ('div', p.expr0, p.expr1)
@_('"-" expr %prec UMINUS')
def expr(self, p):
return ('neg', p.expr)
@_('NAME')
def expr(self, p):
return ('var', p.NAME)
@_('NUMBER')
def expr(self, p):
return ('num', p.NUMBER)
if __name__ == '__main__':
lexer = BasicLexer()
parser = BasicParser()
env = {}
while True:
try:
text = input('basic > ')
except EOFError:
break
if text:
tree = parser.parse(lexer.tokenize(text))
print(tree)
From what I understand, the condition rule is non terminal and the BNF representation is:
condition : expr EQEQ expr
.
If I give as input "a == b" I get the error sly: Syntax error at line 1, token=EQEQ ('var', 'b')
. I know this is the wanted behavior as there's no meaning in giving this input outside a statement, but I want to understand why the condition method is not recalled. Thanks!
Although we call this technique bottom-up parsing, it's really still top-down in the sense that the grammar is applied starting with what will be the root of the parse, which is the start symbol, on this case statement
.
What makes the parse "bottom-up" is the construction of the parse tree. Unlike top-down parsing, in which the next tree node is immediately predicted (and added to the parse tree) before the child nodes are parsed, the bottom-up parser first parses the child nodes and only once all the children have been reduced does it add the parent node. Thus, the root node is the last thing added to the tree; the tree has been built from the bottom up. [Note 1]
This gives the parser a lot more flexibility because it doesn't need to know for certain which production to predict. It can predict a number of possibilities and explore them on parallel. (This can be done in linear time using a state machine.)
In effect, condition
only becomes relevant when it might be the next thing to parse. In your grammar, that only happens after an IF
token, so it doesn't apply to your sample input.
Here's a quick summary of the parsing process. (If you haven't been exposed to finite state machines, this might not be so easy to follow.) It's useful to know that:
[...]
.$
), with the position mark at the beginning. This itemset is expanded as per the step above.With your grammar, the state machine's starting state is the itemset →•:
START → • statement $
statement → • [ $ ]
statement → • FOR var_assign TO expr THEN statement
statement → • IF condition THEN statement ELSE statement
statement → • FUN NAME "(" ")" ARROW statement
statement → • NAME "(" ")"
statement → • expr
statement → • var_assign
expr → • expr "+" expr
expr → • expr "-" expr
expr → • expr "*" expr
expr → • expr "/" expr
expr → • "-" expr
expr → • NAME
expr → • NUMBER
var_assign → • NAME "=" expr
var_assign → • NAME "=" STRING
Note that no production for condition
is in this itemset, because that cannot be at the beginning of a sentence generated from the start symbol.
The first input token is a
, which is a NAME
. That disallows the reduction of statement → •
, so the next state consists of the items with NAME
following the marker:
statement → NAME • "(" ")"
expr → NAME • [ "+", "-", "*", "/", ELSE, $ ]
var_assign → NAME • "=" expr
var_assign → NAME • "=" STRING
Now the next token is EQEQ
. At this point, the state machine is stuck. No shift action is possible and the only reduce action does not have ==
in its lookahead set. So a syntax error is produced.