I am implementing a grammar using the TatSu python library. My grammar is working OK but there is one rule that is eating quite a bit of time. On a block of around 3000 lines (part of a bigger grammar), if I take this full rule, it takes about 42s to parse the entire block. If I reduce this rule to just a few tokens, the runtime drops from 42s to 33s (~20% improvement).
The rule is shown below and it should match a series of event separated by the '/'.
events
=
'/'%{event}
;
event
=
'D' | 'U' | 'Z' | 'P' | 'L' | 'H' | 'x' | 'X' | 'T' | 'V' | 'l' | 'h' | 't' | 'v' | 'N' | 'A' | 'B' | 'F' | '?' | 'G' | 'R' | 'Q' | 'M'
| 'ForceDown' | 'ForceUp' | 'ForceOff' | 'ForcePrior' | 'CompareLow' | 'CompareHigh' | 'CompareUnknown' | 'CompareOff'
| 'CompareValid' | 'CompareLowWindow' | 'CompareHighWindow' | 'CompareOffWindow' | 'CompareValidWindow' | 'ForceUnknown'
| 'LogicLow' | 'LogicHigh' | 'LogicZ' | 'Unknown' | 'ExpectHigh' | 'ExpectLow' | 'ExpectOff' | 'Marker'
;
If I change event to the following, I get faster parsing.
event
=
/[DUZPLHXT]/
;
So is it possible to improve the rule above in some way to get faster processing? Thanks for any ideas.
As you noted, for a rule with many options which are just tokens it's much more efficient to use patterns (regular expressions).
But runtime ultimately depends on how some rules call upon each other.
A simple optimization you can try is to add a cut (˜
) expression so each event
is tried at most once (although a cut should be implicit in the %
expression).
event
=
(
'D' | 'U' | 'Z' | 'P' | 'L' | 'H' | 'x' | 'X' | 'T' | 'V' | 'l' | 'h' | 't' | 'v' | 'N' | 'A' | 'B' | 'F' | '?' | 'G' | 'R' | 'Q' | 'M'
| 'ForceDown' | 'ForceUp' | 'ForceOff' | 'ForcePrior' | 'CompareLow' | 'CompareHigh' | 'CompareUnknown' | 'CompareOff'
| 'CompareValid' | 'CompareLowWindow' | 'CompareHighWindow' | 'CompareOffWindow' | 'CompareValidWindow' | 'ForceUnknown'
| 'LogicLow' | 'LogicHigh' | 'LogicZ' | 'Unknown' | 'ExpectHigh' | 'ExpectLow' | 'ExpectOff' | 'Marker'
) ~
;
That said, because the rule is so much of the lexical kind, I'd opt for the regular expression.
event
=
/(?x)
'D' | 'U' | 'Z' | 'P' | 'L' | 'H' | 'x' | 'X' | 'T' | 'V' | 'l' | 'h' | 't' | 'v' | 'N' | 'A' | 'B' | 'F' | '?' | 'G' | 'R' | 'Q' | 'M'
| 'ForceDown' | 'ForceUp' | 'ForceOff' | 'ForcePrior' | 'CompareLow' | 'CompareHigh' | 'CompareUnknown' | 'CompareOff'
| 'CompareValid' | 'CompareLowWindow' | 'CompareHighWindow' | 'CompareOffWindow' | 'CompareValidWindow' | 'ForceUnknown'
| 'LogicLow' | 'LogicHigh' | 'LogicZ' | 'Unknown' | 'ExpectHigh' | 'ExpectLow' | 'ExpectOff' | 'Marker'
/
;