pythonregexpython-3.xregex-group

Why doesn't python regex search method consistently return the matched object correctly?


I am doing a practice question on a Regex course:

How would you write a regex that matches a sentence where the first word is either Alice, Bob, or Carol; the second word is either eats, pets, or throws; the third word is apples, cats, or baseballs; and the sentence ends with a period? This regex should be case-insensitive. It must match the following:

My code is as follows:

regex=re.compile(r'Alice|Bob|Carol\seats|pets|throws\sapples\.|cats\.|baseballs\.',re.IGNORECASE)
mo=regex.search(str)
ma=mo.group()

When I pass str ='BOB EATS CATS.' or 'Alice throws Apples.', mo.group() only returns 'Bob' or 'Alice' respectively, but I was expecting it to return the whole sentence.

When I pass str='Carol throws baseballs.', mo.group() returns 'baseballs.', which is the last match.

I am confused as to why:


Solution

  • You need to tell your regex to group the lists of options somehow, or it will naturally think it's one giant list, with some elements containing spaces. The easiest way is to use capture groups for each word:

    regex=re.compile(r'(Alice|Bob|Carol)\s+(eats|pets|throws)\s+(apples|cats|baseballs)\.', re.IGNORECASE)
    

    The trailing period shouldn't be part of an option. If you don't want to use capturing groups for some reason (it won't really affect how the match is made), you can use non-capturing groups instead. Replace (...) with (?:...).

    Your original regex was interpreted as the following set of options:

    Spaces don't magically separate options. Hopefully you can see why none of the elements of Carol throws baseballs. besides baseballs. is present in that list. Something like Carol eats baseballs. would match Carol eats though.