I can't write the correct grammar for parsing this yaml:
- Name: Qwerty
Values:
- Name: qq
- Name: pp
- Name: Chirik
Values:
- Name: zzz
- Name: Wasd
Values:
- Name: yyy
- Name: aaa
- Name: jjj
This is part of my bison file:
%%
prog:
| enum_decl value_list
;
enum_decl:
TKN_DASH TKN_NAME TKN_IDENTIFIER TKN_VALUE { strcpy (enums_tbl[enums_cnt++], $3); vals_cnt = 0; }
value_list:
| value_list TKN_DASH TKN_NAME TKN_IDENTIFIER { strcpy(enum_vals[enums_cnt].enum_val[vals_cnt++], $4); }
;
%%
flex:
%%
"-" { return TKN_DASH; }
"Name:" { return TKN_NAME; }
"Values:" { return TKN_VALUE; }
[0-9]+ { yylval.integer = atoi (yytext);
return TKN_INTEGER; }
"/*" { RemoveComment (); }
[a-zA-Z0-9]*[a-zA-Z][_a-zA-Z0-9]* { strcpy (yylval.string, yytext);
return TKN_IDENTIFIER; }
[ \t] { /* printf("LEX: SPACE parsed and skipped\n"); */ } ;
%%
The problem is that the syntax parser grabs the name (- Name: Chirik) and considers it part of the value_list.
I can't figure out how to fix this.
regards, max
The issue is that the grammar incorrectly describes the pattern you try to match. Your grammar states:
program
enum_decl value_list
enum_decl
TKN_DASH TKN_NAME TKN_IDENTIFIER TKN_VALUE
value_list
|
value_list TKN_DASH TKN_NAME TKN_IDENTIFIER
Converting your text into terminals results in:
TKN_DASH TKN_NAME TKN_IDENTIFIER
TKN_VALUE
TKN_DASH TKN_NAME TKN_IDENTIFIER
TKN_DASH TKN_NAME TKN_IDENTIFIER
TKN_DASH TKN_NAME TKN_IDENTIFIER
TKN_VALUE
TKN_DASH TKN_NAME TKN_IDENTIFIER
TKN_DASH TKN_NAME TKN_IDENTIFIER
TKN_VALUE
TKN_DASH TKN_NAME TKN_IDENTIFIER
TKN_DASH TKN_NAME TKN_IDENTIFIER
TKN_DASH TKN_NAME TKN_IDENTIFIER
The parser will try to match its first production rule: enum_decl value_list. By first trying to build enum_decl,
which will consume the tokens: TKN_DASH TKN_NAME TKN_IDENTIFIER TKN_VALUE
Notice that this will result in the following sequence of tokens that are leftover after evaluating enum_decl:
TKN_DASH TKN_NAME TKN_IDENTIFIER
TKN_DASH TKN_NAME TKN_IDENTIFIER
TKN_DASH TKN_NAME TKN_IDENTIFIER
TKN_VALUE
TKN_DASH TKN_NAME TKN_IDENTIFIER
TKN_DASH TKN_NAME TKN_IDENTIFIER
TKN_VALUE
TKN_DASH TKN_NAME TKN_IDENTIFIER
TKN_DASH TKN_NAME TKN_IDENTIFIER
TKN_DASH TKN_NAME TKN_IDENTIFIER
As enum_decl does not consume more tokens, the parser continues constructing the value_list nonterminal, which will consume multiples of: TKN_DASH TKN_NAME TKN_IDENTIFIER
Notice that this will match with the next 3 lines, which includes the string: "- Name: Chirik".
This explains why you see this underneath the value_list nonterminal.
This evaluation shows that the grammar is actually describing a different pattern than you tried to implement.
As I don't have any information about what you originally tried to achieve, I am guessing that you tried to match the pattern: (NAME VALUE NAME*)* If you put this in a context-free grammar you would get:
prog:
stmts
;
stmts:
| stmt stmts
;
stmt:
TKN_DASH TKN_NAME TKN_IDENTIFIER TKN_VALUE value_list
value_list:
| value_list TKN_DASH TKN_NAME TKN_IDENTIFIER
;
This has 2 shift/reduce conflicts, but they are solvable by using the GLR version of bison (https://www.gnu.org/software/bison/manual/html_node/GLR-Parsers.html), or by further modifying the grammar. Either way, this insight should help you further in parsing the file.
I have tested this with bison using the GLR parser output, and it works (I simplified the tokens to make the screenshot fit):