compiler-construction bison flex-lexer language-translation

Bison/flex syntax issue

I can't write the correct grammar for parsing this yaml:

- Name: Qwerty
  Values:
    - Name: qq
    - Name: pp
- Name: Chirik
  Values:
    - Name: zzz
- Name:  Wasd
  Values:
    - Name: yyy
    - Name: aaa
    - Name: jjj

This is part of my bison file:

%%
prog: 
   | enum_decl value_list
   ;

enum_decl:
   TKN_DASH TKN_NAME TKN_IDENTIFIER TKN_VALUE  { strcpy (enums_tbl[enums_cnt++], $3); vals_cnt = 0;  }

value_list:
   | value_list TKN_DASH TKN_NAME TKN_IDENTIFIER { strcpy(enum_vals[enums_cnt].enum_val[vals_cnt++], $4);  }
   ;
%%

flex:

%%
"-" { return TKN_DASH; }
"Name:" { return TKN_NAME; }
"Values:" { return TKN_VALUE; }


[0-9]+                       {    yylval.integer = atoi (yytext);
                                   return TKN_INTEGER; }

"/*"                          {    RemoveComment (); }

[a-zA-Z0-9]*[a-zA-Z][_a-zA-Z0-9]*  {    strcpy (yylval.string, yytext);
                                   return TKN_IDENTIFIER; }

[ \t] { /* printf("LEX: SPACE parsed and skipped\n"); */ } ;

%%

The problem is that the syntax parser grabs the name (- Name: Chirik) and considers it part of the value_list.

I can't figure out how to fix this.

regards, max

Solution

The issue is that the grammar incorrectly describes the pattern you try to match. Your grammar states:

program
    enum_decl value_list
    
enum_decl
    TKN_DASH TKN_NAME TKN_IDENTIFIER TKN_VALUE

value_list
    |
    value_list TKN_DASH TKN_NAME TKN_IDENTIFIER

Converting your text into terminals results in:

TKN_DASH TKN_NAME TKN_IDENTIFIER
TKN_VALUE
TKN_DASH TKN_NAME TKN_IDENTIFIER
TKN_DASH TKN_NAME TKN_IDENTIFIER 
TKN_DASH TKN_NAME TKN_IDENTIFIER 
TKN_VALUE
TKN_DASH TKN_NAME TKN_IDENTIFIER 
TKN_DASH TKN_NAME TKN_IDENTIFIER 
TKN_VALUE
TKN_DASH TKN_NAME TKN_IDENTIFIER 
TKN_DASH TKN_NAME TKN_IDENTIFIER 
TKN_DASH TKN_NAME TKN_IDENTIFIER

The parser will try to match its first production rule: enum_decl value_list. By first trying to build enum_decl, which will consume the tokens: TKN_DASH TKN_NAME TKN_IDENTIFIER TKN_VALUE

Notice that this will result in the following sequence of tokens that are leftover after evaluating enum_decl:

TKN_DASH TKN_NAME TKN_IDENTIFIER
TKN_DASH TKN_NAME TKN_IDENTIFIER 
TKN_DASH TKN_NAME TKN_IDENTIFIER 
TKN_VALUE
TKN_DASH TKN_NAME TKN_IDENTIFIER 
TKN_DASH TKN_NAME TKN_IDENTIFIER 
TKN_VALUE
TKN_DASH TKN_NAME TKN_IDENTIFIER 
TKN_DASH TKN_NAME TKN_IDENTIFIER 
TKN_DASH TKN_NAME TKN_IDENTIFIER

As enum_decl does not consume more tokens, the parser continues constructing the value_list nonterminal, which will consume multiples of: TKN_DASH TKN_NAME TKN_IDENTIFIER Notice that this will match with the next 3 lines, which includes the string: "- Name: Chirik". This explains why you see this underneath the value_list nonterminal.

This evaluation shows that the grammar is actually describing a different pattern than you tried to implement.

As I don't have any information about what you originally tried to achieve, I am guessing that you tried to match the pattern: (NAME VALUE NAME*)* If you put this in a context-free grammar you would get:

prog:
    stmts
    ;

stmts:
    | stmt stmts
    ;
    
stmt:
    TKN_DASH TKN_NAME TKN_IDENTIFIER TKN_VALUE value_list

value_list:
   | value_list TKN_DASH TKN_NAME TKN_IDENTIFIER
   ;

This has 2 shift/reduce conflicts, but they are solvable by using the GLR version of bison (https://www.gnu.org/software/bison/manual/html_node/GLR-Parsers.html), or by further modifying the grammar. Either way, this insight should help you further in parsing the file.

I have tested this with bison using the GLR parser output, and it works (I simplified the tokens to make the screenshot fit):