I need to fashion a regex with the following requirements:
Given sample text:
SEARCH_TERM_#1 find this text SEARCH-TERM_#2_more text_SEARCH-TERM_#3
SEARCH_TERM_#1 find this text SEARCH-TERM_#3
I want to extract the string which appears in the find this text
area
The regex should collect data after SEARCH_TERM_#1
upto but not including SEARCH_TERM_#2
or SEARCH-TERM_#3
which ever comes first. It should choose as the 'right-side' search border whatever it finds first of #2 and #3.
I've tried (?>SEARCH_TERM_#2|SEARCH_TERM_#3)
(?=(?>SEARCH_TERM_#2|SEARCH_TERM_#3))
and (?>(?=SEARCH_TERM_#2)|(?=SEARCH_TERM_#3))
. And they ALL include the second search term into the collected data and stop before the third, while I want the collected data stop before the #2 or #3 which ever comes first.
This regular expression will:
SEARCH_TERM_#1
SEARCH_TERM_#1
SEARCH_TERM_#2
or SEARCH_TERM_#3
(which ever is first^.*?SEARCH_TERM_\#1((?:(?!SEARCH-TERM_\#2|SEARCH-TERM_\#3).)*)
^
match the begining of the string, this forces the search to start at the beginning.*?
match all characters upto the next expression. note this term should be used in conjuction with the s
option which allows the dot to match new line charactersSEARCH_TERM_\#1
the first search term(
start the capture group this set of parentheses puts the matched values into the capture group 1(?:
start non capture group, this the real magic, and basically allows the contained expression to continue matching until it stumbles on either SEARCH-TERM_\#2
or SEARCH-TERM_\#3
(?!
start the negative lookahead. think of the regex engine moving a cursor through the input string. The loohahead simply looks at the characters after the cursor without moving the cursor. The negative means that if the found expression resolves as matched then deny the match, or if the expression is not found, then allow the match. SEARCH-TERM_\#2|SEARCH-TERM_\#3
look for either value. the |
is an "or" statement)
close the negative lookahead.
match any character. The expression only gets to this spot if the preceding negative lookahead didn't find it's search terms)
close the non capture group, at this point either the searching as stopped because it encountered the #2 or #3 end condition or the non capture group found a single character*
continue greedily matching all characters. You can use greedy because the end condition is contained inside the expression.)
close the capture group
You didn't specify a language so I'm including this PHP example only to show how it works.
Input Text
skip this text SEARCH_TERM_#1 find this text SEARCH-TERM_#2 more text to ignore SEARCH_TERM_#3
Code
<?php
$sourcestring="your source string";
preg_match('/^.*?SEARCH_TERM_\#1((?:(?!SEARCH-TERM_\#2|SEARCH-TERM_\#3).)*)/ims',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>
Matches
$matches Array:
(
[0] => skip this text SEARCH_TERM_#1 find this text
[1] => find this text
)
Or to use your real world example included in the comments:
Regex: ^.*?style="background-image: url\(((?:(?!&cfs=1|\)).)*)
Input text: <a href=http://i.like.kittens.com style="background-image: url(http://I.like.kittens.com?Name=Boots&cfs=1)">
Matches:
[0] => <a href=http://i.like.kittens.com style="background-image: url(http://I.like.kittens.com?Name=Boots
[1] => http://I.like.kittens.com?Name=Boots
This vaguely looks like common problem in parsing HTML using regex. If your input text is HTML then you should investigate using an HTML parsing tool rather then a regular expression.