How can I disable all BNFC built-in rules, like Ident
, Integer
or the spaces being used to separate tokens?
I found them useless and annoying since they interfere with the parsers I'm trying to write.
I already tried to re-define them but it seems like the lexer continues to generate the rules for them. I could manually delete them from the generated files but I'm completely against modifying machine generated code.
Long version on why they are annoying.
I'm just starting to learn how to use BNFC. The first thing I tried is to convert a previous work of mine from Alex to BNFC. In particular I want to match only "good" roman numerals. I thought it would be quite simple: A roman numeral can be seen as a sequence like
<thousand-part> <hundred-part> <tens-part> <unit-part>
Where they cannot all be empty. So a numeral either has a non-empty thousand-part
and can be whatever in the rest, or it has an empty thousand-part
and thus either hundred-
or tens-
or unit-
part
must be non empty. The same thing can be iterated until the base case of units.
So I came up with this, which is more or less a direct translation of what I did in Alex:
N1. Numeral ::= TokThousands HundredNumber ;
N2. Numeral ::= HundredNumberNE ; --NE = Not Empty
N3. HundredNumber ::= ;
N4. HundredNumber ::= HundredNumberNE ;
N5. HundredNumberNE ::= TokHundreds TensNumber ;
N6. HundredNumberNE ::= TensNumberNE ;
N7. TensNumber ::= ;
N8. TensNumber ::= TensNumberNE ;
N9. TensNumberNE ::= TokTens UnitNumber ;
N10. TensNumberNE ::= UnitNumberNE ;
N11. UnitNumber ::= ;
N12. UnitNumber ::= UnitNumberNE ;
N13. UnitNumberNE ::= TokUnits ;
token TokThousands ({"MMM"} | {"MM"} | {"M"}) ; -- No x{m,n} in BNFC regexes?
token TokHundreds ({"CM"} | {"DCCC"} | {"DCC"} | {"DC"} | {"D"} | {"CD"} | {"CCC"} | {"CC"} | {"C"}) ;
token TokTens ({"IC"} | {"XC"} | {"LXXX"} | {"LXX"} | {"LX"} | {"LX"} | {"L"} | {"IL"} | {"XL"} | {"XXX"} | {"XX"} | {"X"}) ;
token TokUnits ({"IX"} | {"VIII"} | {"VII"} | {"VI"} | {"V"} | {"IV"} | {"III"} | {"II"} | {"I"}) ;
Now, the problem is that if I try to build this parser, when giving an input like:
MMI
Or in general a numeral that has more than one of the *-part
s not empty, the parser gives an error because BNFC cannot match MMI
with a single token and thus it uses the built-in Ident
rule. Since the rule doesn't appear in the grammar it raises a parsing error, although the input string is perfectly fine by the grammar I defined, it's the bogus Ident
rule that's in the way.
Note: I verified that if I separate the different parts with spaces I get the correct input, but later on I want to put spaces to separate whole numbers, not their tokens.
According to BNFC's documentation:
These types are hard-coded and cannot be value types of rules
Which means that: there is no way to disable built-in rules without modifying the generated code. The only option would be to write a script that automatically deletes the bogus rules from the generated file and always use a Makefile
to build the lexers and parser, to avoid forgetting that step.
It seems like the authors deliberately decided to reduce the flexibility of BNFC imposing their definition of what an integer literal is, what an identifier should look like, how tokens should be separated etc. They could have provided defaults rules, allowing to disable them with some option, but they decided that if you don't agree with their definitions then you shouldn't use their tool at all.