compiler-constructionbisonflex-lexerlanguage-translation

Bison/flex syntax issue


I can't write the correct grammar for parsing this yaml:

- Name: Qwerty
  Values:
    - Name: qq
    - Name: pp
- Name: Chirik
  Values:
    - Name: zzz
- Name:  Wasd
  Values:
    - Name: yyy
    - Name: aaa
    - Name: jjj

This is part of my bison file:

%%
prog: 
   | enum_decl value_list
   ;

enum_decl:
   TKN_DASH TKN_NAME TKN_IDENTIFIER TKN_VALUE  { strcpy (enums_tbl[enums_cnt++], $3); vals_cnt = 0;  }

value_list:
   | value_list TKN_DASH TKN_NAME TKN_IDENTIFIER { strcpy(enum_vals[enums_cnt].enum_val[vals_cnt++], $4);  }
   ;
%%

flex:

%%
"-" { return TKN_DASH; }
"Name:" { return TKN_NAME; }
"Values:" { return TKN_VALUE; }


[0-9]+                       {    yylval.integer = atoi (yytext);
                                   return TKN_INTEGER; }

"/*"                          {    RemoveComment (); }

[a-zA-Z0-9]*[a-zA-Z][_a-zA-Z0-9]*  {    strcpy (yylval.string, yytext);
                                   return TKN_IDENTIFIER; }

[ \t] { /* printf("LEX: SPACE parsed and skipped\n"); */ } ;

%%

The problem is that the syntax parser grabs the name (- Name: Chirik) and considers it part of the value_list.

I can't figure out how to fix this.

regards, max


Solution

  • The issue is that the grammar incorrectly describes the pattern you try to match. Your grammar states:

    program
        enum_decl value_list
        
    enum_decl
        TKN_DASH TKN_NAME TKN_IDENTIFIER TKN_VALUE
    
    value_list
        |
        value_list TKN_DASH TKN_NAME TKN_IDENTIFIER
        
    

    Converting your text into terminals results in:

    TKN_DASH TKN_NAME TKN_IDENTIFIER
    TKN_VALUE
    TKN_DASH TKN_NAME TKN_IDENTIFIER
    TKN_DASH TKN_NAME TKN_IDENTIFIER 
    TKN_DASH TKN_NAME TKN_IDENTIFIER 
    TKN_VALUE
    TKN_DASH TKN_NAME TKN_IDENTIFIER 
    TKN_DASH TKN_NAME TKN_IDENTIFIER 
    TKN_VALUE
    TKN_DASH TKN_NAME TKN_IDENTIFIER 
    TKN_DASH TKN_NAME TKN_IDENTIFIER 
    TKN_DASH TKN_NAME TKN_IDENTIFIER 
    

    The parser will try to match its first production rule: enum_decl value_list. By first trying to build enum_decl, which will consume the tokens: TKN_DASH TKN_NAME TKN_IDENTIFIER TKN_VALUE

    Notice that this will result in the following sequence of tokens that are leftover after evaluating enum_decl:

    TKN_DASH TKN_NAME TKN_IDENTIFIER
    TKN_DASH TKN_NAME TKN_IDENTIFIER 
    TKN_DASH TKN_NAME TKN_IDENTIFIER 
    TKN_VALUE
    TKN_DASH TKN_NAME TKN_IDENTIFIER 
    TKN_DASH TKN_NAME TKN_IDENTIFIER 
    TKN_VALUE
    TKN_DASH TKN_NAME TKN_IDENTIFIER 
    TKN_DASH TKN_NAME TKN_IDENTIFIER 
    TKN_DASH TKN_NAME TKN_IDENTIFIER 
    

    As enum_decl does not consume more tokens, the parser continues constructing the value_list nonterminal, which will consume multiples of: TKN_DASH TKN_NAME TKN_IDENTIFIER Notice that this will match with the next 3 lines, which includes the string: "- Name: Chirik". This explains why you see this underneath the value_list nonterminal.

    This evaluation shows that the grammar is actually describing a different pattern than you tried to implement.

    As I don't have any information about what you originally tried to achieve, I am guessing that you tried to match the pattern: (NAME VALUE NAME*)* If you put this in a context-free grammar you would get:

    prog:
        stmts
        ;
    
    stmts:
        | stmt stmts
        ;
        
    stmt:
        TKN_DASH TKN_NAME TKN_IDENTIFIER TKN_VALUE value_list
    
    value_list:
       | value_list TKN_DASH TKN_NAME TKN_IDENTIFIER
       ;
    

    This has 2 shift/reduce conflicts, but they are solvable by using the GLR version of bison (https://www.gnu.org/software/bison/manual/html_node/GLR-Parsers.html), or by further modifying the grammar. Either way, this insight should help you further in parsing the file.

    I have tested this with bison using the GLR parser output, and it works (I simplified the tokens to make the screenshot fit):

    AST