c++parsinglemon

lemon + re2c not getting correct rule resolution


This is my lemon parser grammar

%nonassoc IMPLICATION.
%nonassoc PERIOD.
%nonassoc NEWLINE.
%nonassoc END.
%nonassoc STRING.

program ::= in END.
in ::= .
in ::= in rule NEWLINE.
in ::= in rule.
rule ::= STRING(A) IMPLICATION STRING(B) PERIOD. {cout<<A->token<<endl; cout<<B->token<<endl;}

My input string is

p<-body1.
q<-body3.

I am expecting the output to be

p
body1
q
body3

but instead I am getting the output as

q
q
\n (Empty line here)
\n (Empty line here)

I am certain I am passing the tokens in the right order and I have verified that since the parser will throw syntax/parser error with wrong input.

Here is the code that I use to pass tokens to parser

do
{
    token = lexer.scan(); // returns an int with the type of token 
    Token* t = new Token(lexer.getTokenValue().c_str());

    lpmlnParse(pParser, token, t);
}while(token != PARSE_TOKEN_END);

I am at a loss as to what is going wrong. Can someone point me in the right direction.


Solution

  • This is still a guess, because there is no indication how the scanner works, or what the value of lexer.getTokenValue() is, or how the Token constructor uses its argument.

    But let us imagine that the lexer object includes a private std::string member which is assigned to the matched text after every token is scanned:

    struct lexer {
      // ...
      int scan() {
        int toke;
        const char* start = current_;
        /* re2c stuff */
        tstring_.assign(start, current_ - start);
        return toke;
      }
      const std::string& getTokenValue() const {
        return tstring_;
      }
      std::string tstring_;
      const char* current_;
    };
    

    And suppose that a Token includes a const char* member (instead of a std::string):

    struct Token {
      explicit Token(const char* s) : str_(s) {}
      const char* str_;
    }
    

    That would at least explain the observed behaviour.

    Each successive call to lexer.scan() overwrites the contents of tstring_. (In the general case, std::string::assign might reallocate the internal character array, but since modern C++ libraries use the short-string optimization and all the tokens in the example code are short strings, that is not going to happen here.)

    Since neither std::string::c_str nor the Token constructor make a copy of the characters, the end result is that the newly-created Token has a pointer to a mutable internal buffer which will be overwritten (or worse, deleted) as the scan progresses.

    And consequently, the string value of the Token will be different when it is observed in the reduction action that it was when the Token was first created.

    That is still not enough to explain why q is printed by the rule which presumably reduced p->body1..

    Unlike bison, the lemon parser does not attempt to optimize lookaheads. bison-generated parsers will perform reductions before the lookahead token is requested if the lookahead token is not required to decide whether to reduce or shift. In contrast, lemon-generated parsers only reduce when the lookahead token is available. In this case, the reduction of the production rule ::= STRING(A) IMPLICATION STRING(B) PERIOD. does not depend on the token following the PERIOD, but the lemon-parser will still wait for the next token.

    From the grammar, one might expect that the next token was NEWLINE, but in that case the output should show two newline characters (or four blank lines, since the semantic action also prints a newline). Since that is not the case, we might speculate that the lexer is skipping newline characters rather than returning a NEWLINE token. If that were the case, the grammar would still work because the NEWLINE token is optional (both in rule and in rule NEWLINE are valid right-hand sides). Then the lookahead token would be the following STRING token, which would be a q. And the lookahead token after q->body3. would be an END, rather than a NEWLINE, so it is plausible that the corresponding token string would be empty, rather than a newline.

    Evidently, if all of the speculation above is valid, the solution would be to make a copy of the token string, for example by replace const char* str_; with std::string str_; in the Token object. And in that case, it would be reasonable to replace the const char* constructor with a const std::string& constructor, or even a simple std::string constructor, thus avoiding the necessity to use std::string::c_str().