I am working on a project to add support for Cypher clauses on Postgres psql. I am trying to improve the parser performance, resolving conflicts between the rules.I have created a minimal example to illustrate a frequent issue in the implementation.This example is below the description.
A clause consists of commands mixed with options. The options are commands that may or may not be in the clause. In the example below, when executing the program, we can trigger the rule COMMAND id_opt B str_opt
running the clause COMMAND country A "Canada"
. Similarly, we can trigger the rule COMMAND num_opt ab_opt str_opt
running the clause COMMAND 1 A "Canada"
or COMMAND 1 B "Canada"
. The first clause returns a syntax error because of the conflict.
The problem is since id_opt
, str_opt
, and num_opt
are options and can be empty, the clause COMMAND A
can trigger both rules, resulting in a conflict and returning the following warning when compiling the project:
gram.y: warning: 1 reduce/reduce conflict [-Wconflicts-rr]
Creating a unique rule with all options, as in the following example, solves the warning. But I didn't want the id_opt
in the same clause as num_opt
. In this fictitious language, the clause COMMAND 1 name A "Canada"
does not exist. Also, id_opt
only goes with A
. Given this scenario, should I merge all options and handle invalid options later, or should I keep the conflict and avoid invalid option combinations?
command:
COMMAND num_opt id_opt ab_opt str_opt { printf("Clause parsed successfully.\n"); }
;
For a more specific example, I am working on the cypher.y file from AgeSQL repository. This problem occurs in the return_clause
rule. The motive that I am showing a minimal example of is the cypher.y file rules have almost a thousand lines. The minimal example follows below:
gram.l file:
%{
#include "gram.tab.h"
%}
%%
[ \t\n] /* ignore whitespace */
"COMMAND" { return COMMAND; }
"A" { return A; }
"B" { return B; }
[0-9]+ { return NUMBER; }
[a-zA-Z][a-zA-Z0-9_.*]* { return IDENTIFIER; }
("\"")[^"]*("\"")|("\'")[^']*("\'") { return STRING; }
%%
int yywrap(void) {
return 1;
}
gram.y file:
%{
#include <stdio.h>
#include <stdlib.h>
int yylex(void);
void yyerror(const char*);
char u;
%}
%token COMMAND A B IDENTIFIER STRING NUMBER
%%
command:
COMMAND id_opt A str_opt { printf("Clause A parsed successfully.\n"); }
| COMMAND num_opt ab_opt str_opt { printf("Clause B parsed successfully.\n"); }
;
id_opt:
/* empty */
| IDENTIFIER;
;
str_opt:
/* empty */
| STRING
;
num_opt:
/* empty */
| NUMBER
;
ab_opt:
A
| B
;
%%
void yyerror(const char *s) {
fprintf(stderr, "Parse error: %s\n", s);
exit(1);
}
int main(void) {
yyparse();
printf("Parsed variable: %c\n", u);
return 0;
}
Makefile:
gram: gram.tab.c lex.yy.c
gcc -o gram gram.tab.c lex.yy.c
gram.tab.c: gram.y
bison -d gram.y
lex.yy.c: gram.l
flex gram.l
The basic problem you have is that your grammar is ambiguous. An input like COMMAND A
can be parsed by either the clause-A or the clause-B rule. That's because the clause-B has an ab_opt
, which can be either an A
or a B
or nothing, so when the command has an A
and does not have either then NUMBER
or IDENTIFIER
before the A
that would force one clause or the other, there's no way to tell which it should be.
There are a number of ways you can rearrange things to get rid of this ambiguity -- either by splitting or combining rules. For example, you could split the clause-B rule to separate the case where it has an A:
command:
COMMAND id_opt A str_opt { printf("Clause A parsed successfully.\n"); }
| COMMAND num_opt b_opt str_opt { printf("Clause B parsed successfully.\n"); }
| COMMAND NUMBER A str_opt { printf("Clause B with an A parsed successfully.\n"); }
;
Note that this 3rd clause only applies when there is both a NUMBER and an A -- if there's just an A we assume it should be clause A
alternately, combine all the clauses together per Ken W's suggestion, though that will then accept an input like COMMAND IDENTIFIER B
which previously would have been a syntax error. You can check for this case after the parser and give a clearer message than just "syntax error"
command:
COMMAND id_num_opt ab_opt str_opt {
if (isID($2) && isB($3)) {
printf("Can't combine an ID with B in command\n");
} else {
printf("Clause parsed successfully.\n"); } }
;