grammartatsu

How to optimize this grammar rule?


I am implementing a grammar using the TatSu python library. My grammar is working OK but there is one rule that is eating quite a bit of time. On a block of around 3000 lines (part of a bigger grammar), if I take this full rule, it takes about 42s to parse the entire block. If I reduce this rule to just a few tokens, the runtime drops from 42s to 33s (~20% improvement).

The rule is shown below and it should match a series of event separated by the '/'.

events
    =
    '/'%{event}
    ;
event
    =
    'D' | 'U' | 'Z' | 'P' | 'L' | 'H' | 'x' | 'X' | 'T' | 'V' | 'l' | 'h' | 't' | 'v' | 'N' | 'A' | 'B' | 'F' | '?' | 'G' | 'R' | 'Q' | 'M'
    | 'ForceDown' | 'ForceUp' | 'ForceOff' | 'ForcePrior' | 'CompareLow' | 'CompareHigh' | 'CompareUnknown' | 'CompareOff'
    | 'CompareValid' | 'CompareLowWindow' | 'CompareHighWindow' | 'CompareOffWindow' | 'CompareValidWindow' | 'ForceUnknown'
    | 'LogicLow' | 'LogicHigh' | 'LogicZ' | 'Unknown' | 'ExpectHigh' | 'ExpectLow' | 'ExpectOff' | 'Marker'
    ;

If I change event to the following, I get faster parsing.

event
  =
  /[DUZPLHXT]/
  ;

So is it possible to improve the rule above in some way to get faster processing? Thanks for any ideas.


Solution

  • As you noted, for a rule with many options which are just tokens it's much more efficient to use patterns (regular expressions).

    But runtime ultimately depends on how some rules call upon each other.

    A simple optimization you can try is to add a cut (˜) expression so each event is tried at most once (although a cut should be implicit in the % expression).

    event
        =
        (
        'D' | 'U' | 'Z' | 'P' | 'L' | 'H' | 'x' | 'X' | 'T' | 'V' | 'l' | 'h' | 't' | 'v' | 'N' | 'A' | 'B' | 'F' | '?' | 'G' | 'R' | 'Q' | 'M'
        | 'ForceDown' | 'ForceUp' | 'ForceOff' | 'ForcePrior' | 'CompareLow' | 'CompareHigh' | 'CompareUnknown' | 'CompareOff'
        | 'CompareValid' | 'CompareLowWindow' | 'CompareHighWindow' | 'CompareOffWindow' | 'CompareValidWindow' | 'ForceUnknown'
        | 'LogicLow' | 'LogicHigh' | 'LogicZ' | 'Unknown' | 'ExpectHigh' | 'ExpectLow' | 'ExpectOff' | 'Marker'
        ) ~
        ;
    

    That said, because the rule is so much of the lexical kind, I'd opt for the regular expression.

    event
        =
        /(?x)
        'D' | 'U' | 'Z' | 'P' | 'L' | 'H' | 'x' | 'X' | 'T' | 'V' | 'l' | 'h' | 't' | 'v' | 'N' | 'A' | 'B' | 'F' | '?' | 'G' | 'R' | 'Q' | 'M'
        | 'ForceDown' | 'ForceUp' | 'ForceOff' | 'ForcePrior' | 'CompareLow' | 'CompareHigh' | 'CompareUnknown' | 'CompareOff'
        | 'CompareValid' | 'CompareLowWindow' | 'CompareHighWindow' | 'CompareOffWindow' | 'CompareValidWindow' | 'ForceUnknown'
        | 'LogicLow' | 'LogicHigh' | 'LogicZ' | 'Unknown' | 'ExpectHigh' | 'ExpectLow' | 'ExpectOff' | 'Marker'
        /
        ;