I am using the following regex:
((FFD8FF).+?((FFD9)(?:(?!FFD8).)*))
I need to do the following with regex:
FFD8FF
FFD9
that comes before the next FFD8FF
FFD9
and not include any content afterWhat I've got does what I need except it finds and keeps any junk after the last FFD9. How can I get it to jump back to the last FFD9
?
Here's the string that I'm searching with this expression:
asdfasdfasasdaFFD8FFasdfalsjdflajsdfljasdfasdfasdfasdfFFD9asdflasdflasdfFFD9asdfasdfFFD8FFasdfalsjdflajsdfljasdfasdfasdfasdfFFD9
Thanks a lot for your help.
More info:
I have a list of start and end values I need to search for (FFD8FF and FFD9 are just one pair). They are in a list. Because of this, I'm using r.compile to dynamically create the expression in a for loop that goes through the different values. I have the following code, but it is returning 0 matches:
regExp = re.compile("FD8FF(?:[^F]|F(?!FD8FF))*FFD9")
matchObj = re.findall(regExp, contents)
In the above code, I'm just trying to use the plain regex without even getting the values from the list (that would look like this):
regExp = re.compile(typeItem[0] + "(?:[^" + typeItem[0][0] + "]|" + typeItem[0][0] + "(?!" + typeItem[0] + "))*" + typeItem[1])
Any other ideas why there aren't any matches?
EDIT:
I figured out that I forgot to include flags. Flags are now included to ignore case and multiline. I now have
regExp = re.compile(typeItem[0] + "(?:[^" + typeItem[0][0] + "]|" + typeItem[0][0] + "(?!" + typeItem[0] + "))*" + typeItem[1],re.M|re.I)
Although now I'm getting a memory error. Is there any way to make this more efficient? I am using the expression to search hundreds of thousands of lines (using the findall
expression above)
an easy way is to use this:
FFD8FF(?:[^F]|F(?!FD8FF))*FFD9
explanation:
FFD8FF
(?: # this group describe the allowed content between the "anchors"
[^F] # all that is not a "F"
| # OR
F(?!FD8FF) # a "F" not followed by "FD8FF"
)* # repeat (greedy)
FFD9 # until the last FFD9 before FFD8FF
Even if a greedy quantifier is used for the group, the regex engine will backtrack to find the last "FFD9" substring.
If you want to ensure that FFD8FF is present, you can add a lookahead at the end of the pattern:
FFD8FF(?:[^F]|F(?!FD8FF))*FFD9(?=.*?FFD8FF)
You can optimize this pattern by emulating an atomic group that will limit the backtracking and allows to use quantifier inside the group:
FFD8FF(?:(?=([^F]+|F(?!FD8FF)))\1)*FFD9
This trick uses the fact that the content of a lookahead is naturally atomic once the closing parenthesis reached. So if you enclose a group inside a lookahead with a capture group inside, you only have to put the backreference after to obtain an "atom" (an indivisable substring). When the regex engine need to backtrack, it will backtrack atom by atom instead of character by character that is much faster.
If you need a capture group before this trick, don't forget to update the number of the backreference, examples:
(FFD8FF(?:(?=([^F]+|F(?!FD8FF)))\2)*FFD9)
(FFD8FF((?:(?=([^F]+|F(?!FD8FF)))\3)*)FFD9)
working example:
>>> import re
>>> yourstr = 'asdfasdfasasdaFFD8FFasdfalsjdflajsdfljasdfasdfasdfasdfFFD9asdflasdflasdfFFD9asdfasdfFFD8FFasdfalsjdflajsdfljasdfasdfasdfasdfFFD9'
>>> p = re.compile(r'(FFD8FF((?:(?=([^F]+|F(?!FD8FF)))\3)*)FFD9)(?=.*?FFD8FF)')
>>> re.findall(p, yourstr)
[('FFD8FFasdfalsjdflajsdfljasdfasdfasdfasdfFFD9asdflasdflasdfFFD9', 'asdfalsjdflajsdfljasdfasdfasdfasdfFFD9asdflasdflasdf', 'D9asdflasdflasdf')]
variant:
(FFD8FF((?:(?=(F(?!FD8FF)[^F]*|[^F]+))\3)*)FFD9)(?=.*?FFD8FF)