pythonregexmatchmultiline

Python Regex get Unique Multiline Matches


Since the background would be far to complicated to explain, I am writing Pseudocode, I am only interested in the Python-Regex-Pattern, I hope one of you can help me

I have the folloing input text (lots of lines with \n as line seperator condensed to '.'):

.
.
1 Order 
order1 stuff
order1 stuff
etc
ShippingMethod: Truck
.
.
2 Order
order2 stuff
order2 stuff
etc
ShippingMethod: Truck
.
.
Order Summary
.
.

I only want to match the texts in between 'Order' and 'Truck' for each order indiviually, I would then iterate over the results further along in the program.

my Regex: ( i am splitting into "start, content, end" for better readability).

pattern = \d\s*Order + [.|\s|\S]* + Truck

When I execute this match, i get one result, beginning at 1 Order and stopping at the second Truck:

1 Order 
order1 stuff
order1 stuff
etc
ShippingMethod: Truck
.
.
2 Order
order2 stuff
order2 stuff
etc
ShippingMethod: Truck

I want (in this case) exactly two matches which only include one order's contents:

1 Order 
order1 stuff
order1 stuff
etc
ShippingMethod: Truck
2 Order
order2 stuff
order2 stuff
etc
ShippingMethod: Truck

I hope it's clear what I am looking for. Any help is greatly appreciated.
Thanks in advance, stay safe, stay healthy!

Things you might suggest:


Solution

  • The solution is deceptively simple - use the non-greedy operator ?.

    To begin with, the character class regex [] matches ANY character in it, so to match a and b the regex is [ab] and not [a|b]. So the content part of your code should be [.\s\S].
    Also, \s and \S match all spaces and non-spaces respectively, so the period (.) is irrelevant here.

    So the final content part should look like this: [\s\S]*

    Now for the actual solution:

    The greedy ? operator after any normal frequency operator like +, * and ? tells the regex to match as few of the element/s as possible. With *, you're using the default greedy version of zero-or-more, telling the regex to match as many as possible (which ends up matching even the first Truck you want!)

    So we add a non-greedy operator at the end, so the final regex looks like this:

    \d\s*Order[\s\S]*?Truck
    

    Bonus Advice:

    The character class [\s\S] is a neat way to tell the regex to match EVERY CHARACTER (because every character is either a space or not a space). But it turns out there's a way to improve efficiency by using the re.DOTALL modifier. It does what it says - it tells the regex that . (the DOT) should match ALL characters, including newlines.

    If this is the code you were using:

    re.findall(r'\d\s*Order[\s\S]*?Truck', input_text)
    

    Here's the best possible code (including the solution of the question):

    re.findall(r'\d\s*Order.*?Truck', input_text, re.DOTALL)
    

    As you can see, the .*? will now match everything (including newlines) from Order to Truck.