pythonpyparsing

How to specify a pyparsing expression that has two parts, the length of each may varies but their sum is fixed?


I need to specify an expression that is defined as ambn, such that:

Here's how the current code looks like, simplified:

from pyparsing import Char, Combine

def ab(t):
  first = Char('a')[0, t - 1]
  second = Char('b')[1, t]
  
  expression = first + second
  expression.add_condition(
    lambda result: len(result) == t,
    message = f'Total length must be {t}'
  )
    
  return Combine(expression)

The expression, however, consumes all it could find and calls the condition function on that result without backtracking. For example:

grammar = (ab(4) | Char('b'))[1, ...].set_name('grammar')
grammar.parse_string('abbbb', parse_all = True)
ParseException: Expected grammar, found 'abbbb'  (at char 0), (line:1, col:1)

The result I want is ['abbb', 'b'], which would be achieved if the expression in question was specified as:

expression = Or([
  Char('a')[m] + Char('b')[t - m]
  for m in range(0, t)
])

...but that looks unnecessarily verbose.

Is there a better way?


Solution

  • I rewrote your second approach using the '|' operator, to create a MatchFirst expression instead of an Or:

    a = pp.Char('a')
    b = pp.Char('b')
    
    def ab(t):
        expr = b[t]
        for i in range(1, t):
            expr |= a[i] + b[t-i]
        return pp.Combine(expr)
    

    It also defines the first term using b[t] and uses a range that starts at 1 instead of the default of 0.

    Then I tested using run_tests:

    ab5 = ab(5)
    ab5.run_tests("""\
        bbbbb
        abbbb
        aabbb
        aaabb
        aaaab
        """)
    
    ab4b = ab(4) + b
    ab4b.run_tests("""\
        bbbbb
        abbbb
        aabbb
        aaabb
        """)
    

    And both expressions gave the desired results.

    Your first approach using a condition fails for the exact reason you state - the repetition of 'b's doesn't know when to stop, so it reads beyond and into the 'b' which is intended to be parsed as the second term. (unlike regex, pyparsing doesn't do any backtracking).

    So instead of using a condition, I changed the b term of ab() to be a Forward, and added a parse action to the a term to insert an expression into the b term containing the correct number of b's.

    def ab(t):
        a_s = a[0, t-1]
        b_s = pp.Forward()
        expr = a_s + b_s
    
        def dynamic_b_term(a_chars):
            b_s << b[t - len(a_chars)]
    
        a_s.add_parse_action(dynamic_b_term)
    
        return pp.Combine(expr)
    

    This also passes the run_tests for ab(5) and ab(4) + b.