pythoncompilationpython-refindall

re.findall giving different results to re.compile.regex


Why does re.compile.findall not find "um" if "um" is at the beginning of the string (it works fine is "um" isn't at the beginning of the string, as per the last 2 lines below)

>>> s = "um"
>>> re.findall(r"\bum\b", s, re.IGNORECASE)
['um']
>>> re.compile(r"\bum\b").findall(s, re.IGNORECASE)
[]
>>> re.compile(r"\bum\b").findall(s + " foobar", re.IGNORECASE)
[]
>>> re.compile(r"\bum\b").findall("foobar " + s, re.IGNORECASE)
['um']

I would have expected the two options to be identical. What am I missing?


Solution

  • You intended to pass re.IGNORECASE to the compile() function, but in the failing cases you're actually passing it to the findall() method. There it's interpreted as an integer giving the starting position for the search to begin. Its value as an integer isn't defined, but happens to be 2 today:

    >>> int(re.IGNORECASE)
    2
    

    Rewrite the code to work as intended, and it's fine; for example:

    >>> re.compile(r"\bum\b", re.IGNORECASE).findall(s + " foobar") # pass to compile()
    ['um']
    

    As originally written, it can't work unless "um" starts at or after position 2:

    >>> re.compile(r"\bum\b").findall(" " + s, re.IGNORECASE)
    []
    >>> re.compile(r"\bum\b").findall("  " + s, re.IGNORECASE) # starts at 2
    ['um']