pythonregex

Regular expression matches more than expected


Given is the following python script:

text = '<?xml version="1.24" encoding="utf-8">'
mu = (".??[?]?[?]", "....")
for item in mu:
    print item,":",re.search(item, text).group()

Can someone please explain why the first hit with the regex .??[?]?[?] returns <? instead of just ?.

My explaination:

That should result in ? and not in <?


Solution

  • For the same reason o*?bar matches oobar in foobar. Even if the quantifier is non-greedy the regex will try to match from the first char in all possible ways, before moving on to the next.

    First the .?? matches an empty string, but when the regex engine backtracks to it, it matches <, thus making the rest of the regex match, without moving the start position of the match to the next character.