algorithm parsing bnf infix-notation recursive-descent

Recursive descent Parser: infix to RPN [2]

this the continuation of me trying to make a recursive descent parser--LL(1)-- that takes in infix expressions and outputs RPN. Here is a link to my first question to which @rici did an amazing job of answering and i hope i do his answer justice with this revised implementation. My new grammer is as follows(without support for unary operators):

expr -> term (+|-) term | term
term -> exponent (*|/) exponent | exponent
exponent -> factor ^ factor | factor
factor -> number | ( expr )

in his answer @rici points out with respect to Norwell's grammar:

We normally put the unary negation operator between multiplication and exponentiation

and i have tried to in-cooperate it here:

expr -> term (+|-) term | term
term -> exponent1 (*|/) exponent1 | exponent1
exponent1 -> (+|-) exponent | exponent 
exponent -> factor ^ factor | factor
factor -> number | ( expr )

Coding the first grammar made it such that uary(+/-) numbers cannot be accepted and only binary -/+ operators were the one to be accepted. And the solution works well for the number of problems that i have tried (it could be wrong and hope to learn more). However on closer inspection the second one fails and I am forced reside back to the same "hack" i used in my first. As @rici points out:

By the way, your output is not Reverse Polish Notation (and nor is it unambiguous without parentheses) because you output unary operators before their operands.

to be fair he does point out adding the extra 0 operand which is fine and i think it is going to work. However say if i do 13/-5 this whose equivalent infix would be 13/0-5 and its RPN 13 0 / 5 -. Or perhaps i am misunderstanding his point.

And finally to put the nail in the coffin @rici also points out:

left-recursion elimination would have deleted the distinction between left-associative and right-associative operators

and hence that would mean that it is pretty much impossible to determine the associativity of any of the operators, whereby all are the same and none are different. Moreover that would imply that trying to support many right and left associative operators is going to be very difficult if not impossible for simple LL(1) parsers.

Here is my C code implementation of the grammar:


#include <stdio.h>
#include <stdlib.h>

void error();
void factor();
void expr();
void term();
void exponent1();
void exponent();
void parseNumber();
void match(int t);

char lookahead;
int position=0;
int main() {
  lookahead = getchar();
  expr();
return 0;
}

void error() {
  printf("\nSyntax error at lookahead %c pos: %d\n",lookahead,position);
  exit(1);
}


void factor() {
 if (isdigit(lookahead)) {
      parseNumber();
     // printf("lookahead at %c",lookahead);
  } else if(lookahead =='('){
      match('(');
      expr();
      match(')');
  }else {
      error();
      }
}

void expr(){
    term();
    while(1){
        if(!lookahead||lookahead =='\n') break;
        if(lookahead=='+'|| lookahead=='-'){
            char token  = lookahead;
            match(lookahead);
            term();
            printf(" %c ", token);
        }else {
            break;
        }
    }
}

void term(){
    exponent1();
    while(1){
        if(!lookahead||lookahead =='\n') break;
        if(lookahead=='/'|| lookahead=='*'){
            char token = lookahead;
            match(lookahead);
            exponent1();
            printf(" %c ", token);
        }else {
            break;
        }
    }
}

void exponent1(){
    if(lookahead=='-'||lookahead=='+'){
        char token  = lookahead;
        match(lookahead);
        //having the printf here:
        printf("%c", token);
        //passes this:
        //  2+6*2--5/3      := 2.00 6.00 2.00  *  + 5.00 3.00  /  -
        // -1+((-2-1)+3)*-2 := -1.00 -2.00 1.00  - 3.00  + -2.00  *  + (not actual RPN @rici mentions)
        //but fails at:
        // -(3/2)           := -3.00 2.00  /
        // -3/2             := -3.00 2.00  /
        exponent();
        // but having the printf here
        //printf("%c ", token);
        // fails this -1+((-2-1)+3)*-2 := 1.00 - 2.00 - 1.00  - 3.00  + 2.00 -  *  +
        // since it is supposed to be
        //  1.00 - -2.00 1.00 - 3.00 + -2.00 * +
        //  but satisfies this:
        // -(3/2) := 3.00 2.00  / -
        // (-3/2) := 3.00 - 2.00  /
    }else {
        exponent();
        //error();
    }
}

void exponent(){
    factor();
        while(1){
        if(!lookahead||lookahead =='\n') break;
        if(lookahead=='^'){
            char token  = lookahead;
            match('^');
            factor();
            printf(" ^ ");
        }else {
            break;
        }
    }
}

void parseNumber() {
  double number = 0;
  if (lookahead == '\0'|| lookahead=='\n') return;
  while (lookahead >= '0' && lookahead <= '9') {
    number = number * 10 + lookahead - '0';
    match(lookahead);
  }
  if (lookahead == '.') {
    match(lookahead);
    double weight = 1;
    while (lookahead >= '0' && lookahead <= '9') {
      weight /= 10;
      number = number + (lookahead - '0') * weight;
      match(lookahead);
    }
  }
  printf("%.2f ", number);
  //printf("\ncurrent look ahead at after exiting parseNumber %c\n",lookahead);
}
void match(int t) {
  if (lookahead == t){
    lookahead = getchar();
    position++;
    }
  else error();

}

So does that mean I should give up on LL(1) parsers and perhaps look at LR parsers instead? Or can increasing the lookahead number help and if there are many paths then it could perhaps narrow things down decreasing the lookhead of the lookahead. For instance:

-(5 ;; looks weird

-( 5 ;; could be - ( exp )

--5 ;; could be many things

-- 5 ;; ought to be the -- operator and output say #

EDITs:

I think having a larger lookahead is going to be difficult to coordinate. So perhaps have something like the shunting yard algorithm where i like peek into the next operator and based on the precedence of the operator the alogrthim is going to determine function call to do. Something like using actual stack of the actual running program. So a pop would be a return and a push would be a function call. Not sure how i could coordinate that with recursive descent.
perhaps the precedence of the peek should determine the lookahead length?

Solution

Increasing lookahead doesn't help.
Here is the usual LALR(1) grammar for arithmetical expressions, including exponentiation:
```
expr -> sum
sum -> sum (+|-) product | product
product -> product (*|/) prefix | prefix
prefix -> (+|-) prefix | exponent 
exponent -> atom ^ exponent | atom
atom -> number | ( expr )
```
You can find examples of that model of constructing a grammar all over the internet, although you will also find many examples where the same non-terminal is used throughout and the resulting ambiguities are dealt with using precedence declarations.

Note the structural difference between exponent and the other binary operators. exponent is right-recursive (because exponentiation is right-associative); the other ones are are left-recursive (because the other binary operators are left-associative).
When I said that you could fix the ambiguity of the prefix operator characters by adding an explicit 0, I didn't mean that you should edit your input to insert the 0. That won't work, because (as you note) it gets the precedence of the unary operators wrong. What I meant was that the recursive descent RPN converter should look something like this:
```
void parsePrefix(void) {
  if (lookahead=='-'||lookahead=='+') {
    char token = lookahead;
    match(lookahead);
    fputs("0 ", stdout);
    parsePrefix();
    printf("%c ", token);
  }
  else {
    parseExponent();
  }
}
```
That outputs the 0 precisely where it needs to go.

Warning: The following paragraph is unadulterated opinion which does not conform to StackOverflow's policy. If that will offend you, please don't read it. (And perhaps you should just skip the rest of this answer, in that case.)

IMHO, this is really a hack, but so is the use of RPN. If you were building an AST, you would simply build a unaryOperator AST node with a single operand. There would be no issue of ambiguity since there would be no need to interpret the token again during evaluation. For whatever reason, students who go through the usual compiler theory classes seem to come out of them believing that ASTs are somehow a sophisticated technique which should be avoided until necessary, that left-recursion must be avoided at all costs, and that there is moral value in coding your own LL parser instead of just using a standard available LALR parser generator. I disagree with all of those things. In particular, I recommend that you start by creating an AST, because it will make almost everything else easier. Also, if you want to learn about parsing, start by using a standard tool and focus on writing a clear, self-documenting grammar, and using the information about the syntactic structure of the input you are trying to parse. Later on you can learn how the parser generator works, if you really find that interesting. Similarly, I would never teach trigonometry by starting with the accurate evaluation of the Taylor expansion of the sin() function. That does not provide the student any insights whatsoever about how to use the trigonometric functions (for example, to rotate by an angle), which is surely the most important part of trigonometry. Once the student has a firm understanding of trigonometry, how to use it, and particularly what the demands of precise calculation are in typical problem domains, then they might want to look at Taylor expansions and other calculation techniques. But most will be content to just call sin(), and I think that's just perfect.
If you really want to use a recursive descent parser, go for it. They can certainly be made to work.

What will happen as you code your grammar into an executable program is that you will slowly start to diverge from a representation of the grammar which could be used in other programs, like syntax colorizer and static analysers. In part, that will be because the grammar you are using has lost important aspects of the syntax, including associativity, and these features are instead coded directly into your parser code. The ultimate result is often that only the parser code is maintained, and the theoretical grammar is left to rot. When the code itself is the "grammar", it is no longer usable as practical documentation of your language's syntax.

But I'm not saying that it can't be done. It most certainly can be done, and there are lots and lots of parsers in practical use which do it that way.
The shunting yard algorithm (and operator precedence parsing in general) is a bottom-up technique, like LR parsing, and neither of them require a recursive parser. If for some reason you really want to use recursion, you could use a Pratt parser instead, but there is a huge practical advantage to bottom-up parsing over recursive descent: it eliminates the possibility of uncontrollable stack overflow. So it's hard to recommend the use of recursive descent in production parsers unless the input text is strictly controlled to avoid the possible attacks through stack overflow. That might not apply to a compiler which is not used with unverified inputs. But is that true of your compiler? Have you never downloaded a source tarball from a foreign site and then typed ./configure && make all? :-)