I have a string of words, here broken into lines to better visualize the repeating pattern:
Saint John eats less of those apples.
Saint Paul eats more of those berries.
Saint Luke eats those oranges.
From this string I need to extract all names, all fruits, and all the quantifiers. The result should be:
Array
(
[0] => Array
(
[0] => Saint John eats less of those apples.
[1] => John
[2] => less
[3] => apples
)
[1] => Array
(
[0] => Saint Paul eats more of those berries.
[1] => Paul
[2] => more
[3] => berries
)
[2] => Array
(
[0] => Saint Luke eats those oranges.
[1] => Luke
[2] =>
[3] => oranges
)
)
I have gotten as far as:
preg_match_all("|Saint (.+?) eats (.+?) of those (.+?).|", $string, $matches);
But this of course doesn't find the last (partial) match. How can I rephrase my regular expression to find it?
Notes
In the real string, there is more non-repeating text before, between, and after the repeating pattern. E.g.:
The apples have worms. That is why Saint John eats less of those apples. Unfortunately Saint John dislikes berries. Unlike Saint Paul. Saint Paul eats more of those berries. When John and Paul are gone, Saint Luke eats those oranges. Afterward, he is still hungry.
Unlike this related question, I don't want to optionally match all of the missing part, but only part of the missing part!
You may use this regex in PHP with a non-capturing optional group:
^Saint\h+(\w+)\h+eats(?:\h+(\w+)\h+of)?\h+those\h+(\w+)
RegEx Details:
^Saint
: Match Saint
at the start
\h+
: Match 1+ horizontal whitespace
(\w+)
: 1st capture group to match 1+ word characters
\h+
: Match 1+ horizontal whitespace
eats
: Match eats
(?:
: Start non-capture group
\h+
: Match 1+ horizontal whitespace(\w+)
: 2nd capture group to match 1+ word characters\h+
: Match 1+ horizontal whitespaceof
: Match of
)?
: End non-capture group.?
makes this group optional\h+
: Match 1+ horizontal whitespace
those
: Match those
\h+
: Match 1+ horizontal whitespace
(\w+)
: 3rd capture group to match 1+ word characters
PHP Code Demo (Thanks to @sin)
---
Here is another regex solution using branch reset feature supported by PCRE (php, perl etc) or by using regex
module in python:
^Saint\h+(\w+)\h+eats(?|\h+(\w+)\h+of|())\h+those\h+(\w+)