Since the background would be far to complicated to explain, I am writing Pseudocode, I am only interested in the Python-Regex-Pattern, I hope one of you can help me
I have the folloing input text (lots of lines with \n
as line seperator condensed to '.'):
.
.
1 Order
order1 stuff
order1 stuff
etc
ShippingMethod: Truck
.
.
2 Order
order2 stuff
order2 stuff
etc
ShippingMethod: Truck
.
.
Order Summary
.
.
I only want to match the texts in between 'Order' and 'Truck' for each order indiviually, I would then iterate over the results further along in the program.
my Regex: ( i am splitting into "start, content, end" for better readability).
pattern = \d\s*Order + [.|\s|\S]* + Truck
When I execute this match, i get one result, beginning at 1 Order
and stopping at the second Truck
:
1 Order
order1 stuff
order1 stuff
etc
ShippingMethod: Truck
.
.
2 Order
order2 stuff
order2 stuff
etc
ShippingMethod: Truck
I want (in this case) exactly two matches which only include one order's contents:
1 Order
order1 stuff
order1 stuff
etc
ShippingMethod: Truck
2 Order
order2 stuff
order2 stuff
etc
ShippingMethod: Truck
I hope it's clear what I am looking for. Any help is greatly appreciated.
Thanks in advance, stay safe, stay healthy!
Things you might suggest:
The solution is deceptively simple - use the non-greedy operator ?
.
To begin with, the character class regex []
matches ANY character in it, so to match a
and b
the regex is [ab]
and not [a|b]
. So the content part of your code should be [.\s\S]
.
Also, \s
and \S
match all spaces and non-spaces respectively, so the period (.
) is irrelevant here.
So the final content part should look like this: [\s\S]*
The greedy ?
operator after any normal frequency operator like +
, *
and ?
tells the regex to match as few of the element/s as possible. With *
, you're using the default greedy version of zero-or-more, telling the regex to match as many as possible (which ends up matching even the first Truck
you want!)
So we add a non-greedy operator at the end, so the final regex looks like this:
\d\s*Order[\s\S]*?Truck
The character class [\s\S]
is a neat way to tell the regex to match EVERY CHARACTER (because every character is either a space or not a space). But it turns out there's a way to improve efficiency by using the re.DOTALL
modifier. It does what it says - it tells the regex that .
(the DOT) should match ALL characters, including newlines.
If this is the code you were using:
re.findall(r'\d\s*Order[\s\S]*?Truck', input_text)
Here's the best possible code (including the solution of the question):
re.findall(r'\d\s*Order.*?Truck', input_text, re.DOTALL)
As you can see, the .*?
will now match everything (including newlines) from Order
to Truck
.