pythonregexpython-refindall

RegEx: Python (findall). Order of elements in OR statement resulting in different output


I am trying to get my head around regular expressions and was playing with some examples trying to see what it comes out at. I am struggling to understand how the order of element in OR (|) impacts the output of the following code


import re 
uni = "University of Sheffield" 
first = re.findall(".*U|S.*U|S",uni) 
second = re.findall(".*U|S.*S|U",uni)
third = re.findall(".*S|U.*S|U",uni)

if I print first, second and third variable I get following

first -> ['U', 'S']
second -> ['U']
third -> ['University of S']

I don't understand why the output for each is the way it is. I assumed it should be the same and it should be ['University of S']. I was wondering if someone would help me understand why is it interpreted differently for each of these 3 cases?

Thank you!


Solution

  • It has to do with the order of operations involving OR (|).

    By default, OR takes everything either side of it, so your 3 expressions would be as follows:
    .*U OR S.*U OR S
    .*U OR S.*S OR U
    .*S OR U.*S OR U

    This means that for the first one, your code does find anything/nothing followed by a U (.*U). It does not find an S followed by anything/nothing followed by a U (S.*U). Then it does find an S (S). Hence the result, ["U", "S"]

    Similarly, for the second expression, your code does find anything/nothing followed by a U (.*U). It does not find an S followed by anything/nothing followed by an S (S.*S). Then it does not find a second U (U). Hence the result, ["U"]

    For the third expression, your code does not find anything/nothing followed by an S (.*S). Then it does find a U followed by anything/nothing ('niversity of ') followed by an S (U.*S). Then it does not find another U. Hence the result ["University of S"].

    I assume you meant your expression to be:
    .* (U OR S) .* (U OR S)

    To write this as valid regex, it should be:

    .*(?U|S).*(?U|S)
    

    You can also do it with match groups (...) instead of non-matching groups (?...).

    However, best practice in this case would be that you use a character class. It is written with square brackets, and matches any one of all the characters put inside. To use it in this example, it would be:

    .*[US].*[US]