c++parsinggrammarlemon

Lemon Parser skips things (Or my misunderstanding)


Updated with more info

I'm having a problem parsing a simple array of elements with Lemon. Can someone enlighten me??

I'm trying to parse this string "[0 0 612 792][100 200]" with mygrammar definition and the parser always skips the first array element and duplicate the last... any idea??

The grammar file is

%token_type { char* }

%include {
    #include <assert.h>
}

%parse_accept { printf("The parser has completed successfully.\n"); }
%syntax_error { fprintf(stderr, "Syntax Error\n"); }
%parse_failure { fprintf(stderr, "Parse failure\n"); }

%start_symbol program

array_value ::= INT_LITERAL(A). {printf("Av: %s\n", A);} 

array_value_list ::=.
array_value_list ::= array_value_list array_value.

array_declaration ::= LBRACKET array_value_list RBRACKET.

array_list ::= array_declaration.
array_list ::= array_list array_declaration.

program ::= array_list END_TOKEN.

I use re2c to get the tokens and the code that call the parser for each token is

  while(token = scan(&scanner, buff_end)) { 
    // Send strings to the parser with NAME tokens  

    if(token == INT_LITERAL) {
      name_length = scanner.cur - scanner.top;
      strncpy(name_str, scanner.top, name_length);
      name_str[name_length] = '\0';
      //printf("Token:Pre: %s\tName: %s\n", tokenStr(token),name_str);
      Parse(parser, token, name_str);
    }
    else {
      //printf("Token: %s\n", tokenStr(token));
      Parse(parser, token, 0);
    }
    // Execute Parse for the last time
    if(token == END_TOKEN) {
      Parse(parser, 0, NULL);
      break;
    }
  }

For the input string "[ 0 -100 612 792][100 200]" the output is:

Av: -100
Av: 612
Av: 792
Av: 792
Av: 200
Av: 200

As you can notice, the first element doesn't appears, and the last one is duplicated.

The grammar out from lemon is:

State 0:
          array_declaration ::= * LBRACKET array_value_list RBRACKET
          array_list ::= * array_declaration
          array_list ::= * array_list array_declaration
          program ::= * array_list END_TOKEN

                      LBRACKET shift        3      
             array_declaration shift        1        /* because array_declaration==array_list */
                    array_list shift        1      
                       program accept

State 1:
          array_declaration ::= * LBRACKET array_value_list RBRACKET
          array_list ::= array_list * array_declaration
          program ::= array_list * END_TOKEN

                      LBRACKET shift        3      
                     END_TOKEN shift        4      
             array_declaration shift-reduce 5      array_list ::= array_list array_declaration

State 2:
          array_value ::= * INT_LITERAL
          array_value_list ::= array_value_list * array_value
          array_declaration ::= LBRACKET array_value_list * RBRACKET

                   INT_LITERAL shift-reduce 0      array_value ::= INT_LITERAL
                      RBRACKET shift-reduce 3      array_declaration ::= LBRACKET array_value_list RBRACKET
                   array_value shift-reduce 2      array_value_list ::= array_value_list array_value

State 3:
      (1) array_value_list ::= *
          array_value_list ::= * array_value_list array_value
          array_declaration ::= LBRACKET * array_value_list RBRACKET

              array_value_list shift        2      
                     {default} reduce       1      array_value_list ::=

State 4:
      (6) program ::= array_list END_TOKEN *

                             $ reduce       6      program ::= array_list END_TOKEN

----------------------------------------------------
Symbols:
    0: $:
    1: INT_LITERAL
    2: LBRACKET
    3: RBRACKET
    4: END_TOKEN
    5: error:
    6: array_value: INT_LITERAL
    7: array_value_list: <lambda> INT_LITERAL
    8: array_declaration: LBRACKET
    9: array_list: LBRACKET
   10: program: LBRACKET

And the output trace for the sample string is:

T__Input 'LBRACKET'
T__Shift 'LBRACKET', go to state 3
T__Return. Stack=[LBRACKET]
T__Input 'INT_LITERAL'
T__Reduce [array_value_list ::=], go to state 3.
T__Shift 'array_value_list', go to state 2
T__Shift 'INT_LITERAL'
T__Return. Stack=[LBRACKET array_value_list INT_LITERAL]
T__Input 'INT_LITERAL'
T__Reduce [array_value ::= INT_LITERAL], go to state 2.
T__Shift 'array_value'
T__Reduce [array_value_list ::= array_value_list array_value], go to state 3.
T__Shift 'array_value_list', go to state 2
T__Shift 'INT_LITERAL'
T__Return. Stack=[LBRACKET array_value_list INT_LITERAL]
T__Input 'INT_LITERAL'
T__Reduce [array_value ::= INT_LITERAL], go to state 2.
T__Shift 'array_value'
T__Reduce [array_value_list ::= array_value_list array_value], go to state 3.
T__Shift 'array_value_list', go to state 2
T__Shift 'INT_LITERAL'
T__Return. Stack=[LBRACKET array_value_list INT_LITERAL]
T__Input 'INT_LITERAL'
T__Reduce [array_value ::= INT_LITERAL], go to state 2.
T__Shift 'array_value'
T__Reduce [array_value_list ::= array_value_list array_value], go to state 3.
T__Shift 'array_value_list', go to state 2
T__Shift 'INT_LITERAL'
T__Return. Stack=[LBRACKET array_value_list INT_LITERAL]
T__Input 'RBRACKET'
T__Reduce [array_value ::= INT_LITERAL], go to state 2.
T__Shift 'array_value'
T__Reduce [array_value_list ::= array_value_list array_value], go to state 3.
T__Shift 'array_value_list', go to state 2
T__Shift 'RBRACKET'
T__Return. Stack=[LBRACKET array_value_list RBRACKET]
T__Input 'LBRACKET'
T__Reduce [array_declaration ::= LBRACKET array_value_list RBRACKET], go to state 0.
T__Shift 'array_declaration', go to state 1
T__Shift 'LBRACKET', go to state 3
T__Return. Stack=[array_declaration LBRACKET]
T__Input 'INT_LITERAL'
T__Reduce [array_value_list ::=], go to state 3.
T__Shift 'array_value_list', go to state 2
T__Shift 'INT_LITERAL'
T__Return. Stack=[array_declaration LBRACKET array_value_list INT_LITERAL]
T__Input 'INT_LITERAL'
T__Reduce [array_value ::= INT_LITERAL], go to state 2.
T__Shift 'array_value'
T__Reduce [array_value_list ::= array_value_list array_value], go to state 3.
T__Shift 'array_value_list', go to state 2
T__Shift 'INT_LITERAL'
T__Return. Stack=[array_declaration LBRACKET array_value_list INT_LITERAL]
T__Input 'RBRACKET'
T__Reduce [array_value ::= INT_LITERAL], go to state 2.
T__Shift 'array_value'
T__Reduce [array_value_list ::= array_value_list array_value], go to state 3.
T__Shift 'array_value_list', go to state 2
T__Shift 'RBRACKET'
T__Return. Stack=[array_declaration LBRACKET array_value_list RBRACKET]
T__Input 'END_TOKEN'
T__Reduce [array_declaration ::= LBRACKET array_value_list RBRACKET], go to state 1.
T__Shift 'array_declaration'
T__Reduce [array_list ::= array_list array_declaration], go to state 0.
T__Shift 'array_list', go to state 1
T__Shift 'END_TOKEN', go to state 4
T__Return. Stack=[array_list END_TOKEN]
T__Input '$'
T__Reduce [program ::= array_list END_TOKEN], go to state 0.
T__Accept!
T__Return. Stack=]

I'm stuck on this error, and I'm pretty sure that is a concept error that I don't understand. Any help is appreciated.

Thanks


Solution

  • You don't show the definition of name_str in your scanner code, but it seems likely that it is an array of char. If that's the case, you are risking a buffer overrun since you never check to make sure that name_length is less than the buffer size; furthermore, you might as well use memcpy instead of strncpy because you already know there is no NUL character in the string your copying. But neither of those things are actually the problem: the problem is that you are copying every token into the same buffer.

    What you are passing to the parser is the address of a NUL-terminated string. The parser does not copy the string; it just stores the address as the semantic value of the token. In other words, the parser assumes that it owns the passed in token string, at least until the parse is complete.

    But in fact the character buffer (name_str) is owned by the scanner, and once it has pushed the token into the parser, it assumes that it is free to do what it will with the character buffer. Which is to overwrite the buffer with the next token.

    Unlike bison, lemon does not reduce immediately when the lookahead wouldn't matter. bison would have reduced INT_LITERAL to array_value as soon as it saw the literal, because the reduction doesn't depend on the lookahead. But lemon always has a lookahead token, so it doesn't reduce INT_LITERAL to array_value until it has received the next token. Unfortunately, if the next token is also an INT_LITERAL, the next token will have overwritten the character buffer before the reduction happens, so the reduction's action will print out the string in the next token. If the next token is a ], then the character buffer will not be overwritten by the scanner, so in that case the current token will be printed, although it was already printed by the previous token's reduction.

    In general, the scanner has no way of knowing for how long the parser will require the token values. There are only two sensible ownership policies for that problem environment:

    The first policy is cleaner, since it doesn't require the two components to agree on a resource management protocol. But the parser generator is not likely to include any mechanism which would allow the implementation of that policy, since it would require some action to be taken at the top of the parser function.

    So you need to use the second policy; the scanner must make a copy of the token string in newly-allocated memory and pass this memory, along with its ownership, to the parser so that the parser must (eventually) free the copy. That's the same protocol which is used by most functioning bison parsers, for the same reason, and the various errors which ensue when the copy is not made are probably the most common obscure bison errors.

    Of course, in this simple case, you could avoid the memory management issues by having the scanner convert the string to an integer.