I want to write a expression grammar which matches strings likes these:
words at the start ONE|ANOTHER wordAtTheEnd
---------^-------- ----^----- --^--
A: alphas B: choice C: alphas
The issue is however that part A can contain the keyword "ONE" or "ANOTHER" from part B, so only the last occurrence of the choice keywords should match part B. Here an example: The string
ZERO ONE or TWO are numbers ANOTHER letsendhere
should be parsed into the fields
A: "ZERO ONE or TWO are numbers"
B: "ANOTHER"
C: "letsendhere"
With pyparsing
I tried the "stopOn
"-keyword for the OneorMore
expression:
choice = pp.Or([pp.Keyword("ONE"), pp.Keyword("OTHER")])('B')
start = pp.OneOrMore(pp.Word(pp.alphas), stopOn=choice)('A')
end = pp.Word(pp.alphas)('C')
expr = (start + choice) + end
But this does not work. For the sample string I get the ParseException
:
Expected end of text (at char 12), (line:1, col:13)
"ZERO ONE or >!<TWO are numbers ANOTHER text"
This makes sense, because stopOn
stops on the first occurrence of choice
not the last occurrence. How can I write a grammar which stops on the last occurrence instead? Maybe I need to resort to a context-sensitive grammar?
Sometimes you have to try to "be the parser". What is it about the "last occurrence of X" that distinguishes it from other X'es? One way to say this is "an X that is not followed by any more X's". With pyparsing, you could write a helper method like this:
def last_occurrence_of(expr):
return expr + ~FollowedBy(SkipTo(expr))
Here it is in use as a stopOn argument to OneOrMore:
integer = Word(nums)
word = Word(alphas)
list_of_words_and_ints = OneOrMore(integer | word, stopOn=last_occurrence_of(integer)) + integer
print(list_of_words_and_ints.parseString("sldkfj 123 sdlkjff 123 lklj lkj 2344 234 lkj lkjj"))
prints:
['sldkfj', '123', 'sdlkjff', '123', 'lklj', 'lkj', '2344', '234']