algorithmparsingbnfinfix-notationrecursive-descent

Recursive descent: infix to postfix conversion repeating operators


We recently learned about converting infix to postfix using stacks during our programming course at uni. And I have meaning to write my parser for a while, so I decided to it using recursive descent. I am following along with this: Parsing Expressions by Recursive Descent Theodore Norvell . Here is the grammar he uses:

E --> P {B P}
P --> v | "(" E ")" | U P
B --> "+" | "-" | "*" | "/" | "^"
U --> "-" | "+"

I have attempted at implementing this in C and it works. However If I give it the following input with operators trailing after one another like this:

---+-1-(-(2-1)+3)*-2

It outputs this:

---+-1.00 -2.00 1.00  - 3.00  +  - -2.00 *

it appears to be wrong for the following:

Another peculiar result I am getting is with2+(2^4*(7+2^6)) to which I got:

2.00 2.00 4.00 ^ 7.00 2.00  + 6.00 ^* +

when I expected getting:

 2.00 2.00 4.00 ^ 7.00 2.00 6.00 ^ + * +

I am not sure but do I perhaps need a precedence climbing parser- also suggested in the linked article . However the main question is how would i simplify a trailing pair of operations```---+``? Any help will be really appreciated. Thanks a lot in advance. still a beginner in all this.

Here is the code:

#include <stdio.h>
#include <stdlib.h>

void expr();
void term();
void match(int t);
void error();
void parseNumber();
//E --> P {B P}
//P --> v | "(" E ")" | U P
//B --> "+" | "-" | "*" | "/" | "^"
//U --> "-" | "+"
//
// Erecognizer is
//    E()
//    expect( end )
// 
// E is
//     P
//     while next is a binary operator
//        consume
//        P

char lookahead;

int main() {
  lookahead = getchar();
  expr();
return 0;
}
// E is
//     P
//     while next is a binary operator
//        consume
//        P
void expr() {
  term();
 /* optimized by inlining rest() and removing recursive calls */
  while (1)  {
    if (lookahead == '+') {
      match('+');
      term();
      printf(" + ");
    } else if (lookahead == '-') {
      match('-');
      term();
      printf(" - ");
    }else if (lookahead == '*') {
      match('*');
      term();
      putchar('*');
    } else if (lookahead == '/') {
      match('/');
      term();
      putchar('/');
    } else if (lookahead == '^') {
      match('^');
      term();
      putchar('^');
    }  
    else break;
  }
}

// P is
//     if next is a v
//          consume
//     else if next = "("
//          consume
//          E
//          expect( ")" )
//     else if next is a unary operator
//          consume
//          P
//     else
//          error

void term() {
  if (isdigit(lookahead)) {
      parseNumber();
     // printf("lookahead at %c",lookahead);
  } else if(lookahead =='('){
      match('(');
      expr();
      match(')');
  }
  else if (lookahead =='-' ||lookahead =='+') {
      char sign = lookahead;
      match(lookahead);
      (sign=='+'?putchar('+'):putchar('-'));
      term();
  }else {
      error();
      }
}

void match(int t) {
  if (lookahead == t)
    lookahead = getchar();
  else error();

}
void parseNumber() {
  double number = 0;
  // TODO consume spaces
  if (lookahead == '\0'|| lookahead=='\n') return;
  while (lookahead >= '0' && lookahead <= '9') {
    number = number * 10 + lookahead - '0';
    match(lookahead);
  }
  if (lookahead == '.') {
    match(lookahead);
    double weight = 1;
    while (lookahead >= '0' && lookahead <= '9') {
      weight /= 10;
      number = number + (lookahead - '0') * weight;
      match(lookahead);
    }
  }
  printf("%.2f ", number);
  //printf("\ncurrent look ahead at after exiting parseNumber %c\n",lookahead);
}
void error() {
  printf("Syntax error at lookahead %c\n",lookahead);
  exit(1);
}

Solution

  • The article you cite pretty clearly states that the recursive descent algorithm presented is not a parser: (emphasis added)

    Let's look at a recursive descent recognizer based on this grammar. I call this algorithm a recognizer rather than a parser because all it does is to recognize whether the input is in the language of the grammar or not. It does not produce an abstract syntax tree, or any other form of output that represents the contents of the input.

    That's absolutely correct; the grammar is only suitable for use in a recognizer. What is not mentioned is that if you attempt to modify the algorithm to produce some form of output (other than a simple "yes" or "no", indicating whether or not the expression is in the target language), you will get a structurally incorrect answer.

    That's because it's not really true that:

    We can transform G to an equivalent non-left-recursive grammar G1…

    Or at least, you need to be very careful about what is meant by "equivalent". The new grammar is equivalent in that it recognises the same language. But it does not parse expressions in the same way, and furthermore the left-recursion elimination algorithm eliminates information from the grammar which was necessary to produce a correct parse. (In this case, the necessary information -- the precedence and associativity of each operator -- had already been eliminated from the grammar, presumably for simplification. But even if the grammar had been accurate to start with, left-recursion elimination would have deleted the distinction between left-associative and right-associative operators.)

    Somewhat later in this presentation, under the heading The classic solution, Norvell describes a recursive descent parser which does correctly parse expressions. [Note 1] That's probably the one you wanted to code.

    By the way, your output is not Reverse Polish Notation (and nor is it unambiguous without parentheses) because you output unary operators before their operands. RPN always puts operators after their operands -- which is what makes it unambiguous without parentheses -- and requires that every operand unambiguously specify the number of operands it requires. That usually means writing unary and binary negation differently, so that it is possible to tell them apart, although another option would be to just output an extra 0 operand and let the RPN evaluator treat them as binary operators.

    But RPN is not really a very useful output from a parser. A common output from a parser is an Abstract Syntax Tree, which is a graph structure which describes the syntactic structure of the parsed text. Another common output is so-called "Three Address Code", which is virtual machine code for an imaginary machine with an infinite (or at least very large) number of registers. (Not all the VM opcodes have three addresses, but many of them do, including all the binary arithmetic operators, which name two source registers and a destination register.) And, of course, for a calculator you can just evaluate as you go instead of producing any structured representation.

    Notes:

    1. Perhaps it would be better to say that the grammar G2 would correctly parse expressions if Norvell had chosen a less idiosyncratic precedence order. We normally put the unary negation operator between multiplication and exponentiation, not between addition and multiplication. As long as you only implement multiplication and exact division, Norvell's precedence choice doesn't matter, but if you implement floor division or modulo (that is, the Python semantics for // and %), you'll find that the low precedence of unary negation results in unexpected evaluations. The mistake is possible because negation distributes over multiplication and exact division. But (-3) // 2 is not the same as -(3 // 2), and the expected result of -3 // 2 is the first one, while Norvell's precedence ordering produces the second one.

      I should add that integer division in C is truncating division, not floor division, and C's % operator is remainder, not modulo, so the problem is not evident with C. On the other hand, C lacks an exponentiation operator so you could go with the simpler solution of giving unary negation higher precedence than any binary operator, which is what C in fact does.