[SOLVED] Regular expression matches more than expected

Regular expression matches more than expected

Given is the following python script:

text = '<?xml version="1.24" encoding="utf-8">'
mu = (".??[?]?[?]", "....")
for item in mu:
    print item,":",re.search(item, text).group()

Can someone please explain why the first hit with the regex .??[?]?[?] returns <? instead of just ?.

My explaination:

.?? should match nothing as .? can match or not any char and the second ? makes it not greedy.
[?]? can match ? or not, so nothing is good, too
[?] just matches ?

That should result in ? and not in <?

Solution

For the same reason o*?bar matches oobar in foobar. Even if the quantifier is non-greedy the regex will try to match from the first char in all possible ways, before moving on to the next.

First the .?? matches an empty string, but when the regex engine backtracks to it, it matches <, thus making the rest of the regex match, without moving the start position of the match to the next character.