This is my lemon parser grammar
%nonassoc IMPLICATION.
%nonassoc PERIOD.
%nonassoc NEWLINE.
%nonassoc END.
%nonassoc STRING.
program ::= in END.
in ::= .
in ::= in rule NEWLINE.
in ::= in rule.
rule ::= STRING(A) IMPLICATION STRING(B) PERIOD. {cout<<A->token<<endl; cout<<B->token<<endl;}
My input string is
p<-body1.
q<-body3.
I am expecting the output to be
p
body1
q
body3
but instead I am getting the output as
q
q
\n (Empty line here)
\n (Empty line here)
I am certain I am passing the tokens in the right order and I have verified that since the parser will throw syntax/parser error with wrong input.
Here is the code that I use to pass tokens to parser
do
{
token = lexer.scan(); // returns an int with the type of token
Token* t = new Token(lexer.getTokenValue().c_str());
lpmlnParse(pParser, token, t);
}while(token != PARSE_TOKEN_END);
I am at a loss as to what is going wrong. Can someone point me in the right direction.
This is still a guess, because there is no indication how the scanner works, or what the value of lexer.getTokenValue()
is, or how the Token
constructor uses its argument.
But let us imagine that the lexer
object includes a private std::string
member which is assigned to the matched text after every token is scanned:
struct lexer {
// ...
int scan() {
int toke;
const char* start = current_;
/* re2c stuff */
tstring_.assign(start, current_ - start);
return toke;
}
const std::string& getTokenValue() const {
return tstring_;
}
std::string tstring_;
const char* current_;
};
And suppose that a Token
includes a const char*
member (instead of a std::string
):
struct Token {
explicit Token(const char* s) : str_(s) {}
const char* str_;
}
That would at least explain the observed behaviour.
Each successive call to lexer.scan()
overwrites the contents of tstring_
. (In the general case, std::string::assign
might reallocate the internal character array, but since modern C++ libraries use the short-string optimization and all the tokens in the example code are short strings, that is not going to happen here.)
Since neither std::string::c_str
nor the Token
constructor make a copy of the characters, the end result is that the newly-created Token
has a pointer to a mutable internal buffer which will be overwritten (or worse, deleted) as the scan progresses.
And consequently, the string value of the Token
will be different when it is observed in the reduction action that it was when the Token
was first created.
That is still not enough to explain why q
is printed by the rule which presumably reduced p->body1.
.
Unlike bison
, the lemon
parser does not attempt to optimize lookaheads. bison
-generated parsers will perform reductions before the lookahead token is requested if the lookahead token is not required to decide whether to reduce or shift. In contrast, lemon
-generated parsers only reduce when the lookahead token is available. In this case, the reduction of the production rule ::= STRING(A) IMPLICATION STRING(B) PERIOD.
does not depend on the token following the PERIOD
, but the lemon-parser will still wait for the next token.
From the grammar, one might expect that the next token was NEWLINE
, but in that case the output should show two newline characters (or four blank lines, since the semantic action also prints a newline). Since that is not the case, we might speculate that the lexer is skipping newline characters rather than returning a NEWLINE
token. If that were the case, the grammar would still work because the NEWLINE
token is optional (both in rule
and in rule NEWLINE
are valid right-hand sides). Then the lookahead token would be the following STRING
token, which would be a q
. And the lookahead token after q->body3.
would be an END
, rather than a NEWLINE
, so it is plausible that the corresponding token string would be empty, rather than a newline.
Evidently, if all of the speculation above is valid, the solution would be to make a copy of the token string, for example by replace const char* str_;
with std::string str_;
in the Token
object. And in that case, it would be reasonable to replace the const char*
constructor with a const std::string&
constructor, or even a simple std::string
constructor, thus avoiding the necessity to use std::string::c_str()
.