I am trying to build a CAS in Python and am currently stuck on implementing the parser I will use to go from expression string to an Anytree tree I can eventually manipulate and simplify. The issue seems to be that, when parsing, yacc doesn't implement my GROUP
node with it's proper children nodes as defined in my parser grammar spec. I've tried messing with precedences and associativities, switching up the order of the grammar rules but nothing seems to make it parent the nodes properly. What's even stranger is that in debug/verbose mode, it makes a node for the expression when it pattern-matches to it, but it (for some reason) fails to parent it to the GROUP
node when it recognises an LPAREN expression RPAREN
token
Here's my code:
import ply.yacc as yacc
from anytree import Node, RenderTree
import ply.lex as lex
#init token names
tokens = (
'INTEGER',
'PLUS',
'MINUS',
'TIMES',
'DIVIDE',
'LPAREN',
'RPAREN',
)
#init regex rules for said tokens
t_PLUS = r'\+'
t_MINUS = r'-'
t_TIMES = r'\*'
t_DIVIDE = r'/'
t_LPAREN = r'\('
t_RPAREN = r'\)'
#init regex rule/function for integers
def t_INTEGER(t):
r'\d+'
t.value = int(t.value)
return t
#ignoring whitespace
t_ignore = '\t'
#handling unknown characters (%s inserts whatever is past the second %)
def t_error(t):
print("Illegal character '%s'" %t.value[0] )
t.lexer.skip(1)
#building the lexer
lexer = lex.lex()
#parser grammar spec
#binary operators
def p_expression_binop(p):
'''expression : expression PLUS expression
| expression MINUS expression
| expression TIMES expression
| expression DIVIDE expression '''
p[0] = Node(p[2],children = [Node(p[1]),Node(p[3])])
#brackets/grouping
def p_expression_group(p):
'expression : LPAREN expression RPAREN'
p[0] = Node(p[1], children = [Node(p[2])])
#integers
def p_expression_number(p):
'expression : INTEGER'
p[0] = Node(p[1])
def p_error(p):
print("Input syntax error ")
treeParser = yacc.yacc()
while True:
try:
s = input("Calculate this >")
except EOFError:
break
if not s: break
ParsTree = treeParser.parse(s)
print(RenderTree(ParsTree))
Sample input: (2)+(2)
Sample Output:
Calculate this >(2)+(2)
Node('/+')
├── Node("/+/Node('/GROUP')")
└── Node("/+/Node('/GROUP')")
As you can see, it only creates GROUP
nodes and does not make any child integer nodes under said GROUP
nodes
Edit: Made the code self-contained and adding sample input and output to better explain the problem
The first argument to Node
is expected to be a string. (In fact, it's expected to be a string which labels the node, but I'll get back to that in a minute.) If it's not a string, it's converted to one.
So consider your action:
p[0] = Node(p[2],children = [Node(p[1]),Node(p[3])])
Here, p[2]
is a token, whose value is a string (+
, in the example). But p[1]
and p[3]
are both expression
, and the value of an expression is a Node
, not a string. So anytree
converts it to a string, such as 'Node("\GROUP")'
.
Actually, when I tried the code you pasted, the result was 'Node("/(")'
, because the action for p_expression_group
is:
p[0] = Node(p[1], children = [Node(p[2])])
I guess that in the code you ran, unlike the code you pasted into your question, that rule read:
p[0] = Node("GROUP", children = [Node(p[2])])
Or perhaps you had a different lexical action for \(
. As a side-note, please always verify that the same output you paste into a question is really the output of the code you paste, and not of some other version of the same code, even if that's very similar, in order to avoid confusions.
So the solution is to not call Node
on things which are already Nodes
. I changed the two relevant actions as follows:
#binary operators
def p_expression_binop(p):
'''expression : expression PLUS expression
| expression MINUS expression
| expression TIMES expression
| expression DIVIDE expression '''
p[0] = Node(p[2],children = [p[1], p[3]]) # Removed Node calls
#brackets/grouping
def p_expression_group(p):
'expression : LPAREN expression RPAREN'
p[0] = Node(p[1], children = [p[2]]) # Removed Node call
And then got the "expected" result:
$ python3 parser.py<<<'(2)+(2)'
Generating LALR tables
WARNING: 16 shift/reduce conflicts
Calculate this >Node('/2')
Node('/2')
Node('/+')
├── Node('/+/(')
│ └── Node('/+/(/2')
└── Node('/+/(')
└── Node('/+/(/2')
You can see here that anytree
has its own idea of how to present the label of a Node: it produces a "path" of labels from the parent, grandparent, etc. of the node, using /
as a path separator. That definitely obscures your use of the label, which is effectively a operator name or a value. There's no real need for an AST node to have a unique human-readable label, which seems to be the anytree model. If you want to continue using anytree
, you should consider adding other attributes, perhaps a nodetype
attribute and a value
or operator
attribute, so that you can use the label field as a label (in case you find labels convenient for something).
On the other hand, you might in the long run find that anytree
is not a good match for you parsing problem. (Tree with parent nodes suffer from a number of oddities. For example, you can only put a node into one place, so you can't share it between different trees, or put more than one instance in the same tree. All of these use cases will require an explicit node copy.)