I am trying to get my head around regular expressions and was playing with some examples trying to see what it comes out at. I am struggling to understand how the order of element in OR (|) impacts the output of the following code
import re
uni = "University of Sheffield"
first = re.findall(".*U|S.*U|S",uni)
second = re.findall(".*U|S.*S|U",uni)
third = re.findall(".*S|U.*S|U",uni)
if I print first, second and third variable I get following
first -> ['U', 'S']
second -> ['U']
third -> ['University of S']
I don't understand why the output for each is the way it is. I assumed it should be the same and it should be ['University of S']. I was wondering if someone would help me understand why is it interpreted differently for each of these 3 cases?
Thank you!
It has to do with the order of operations involving OR (|
).
By default, OR takes everything either side of it, so your 3 expressions would be as follows:
.*U
OR S.*U
OR S
.*U
OR S.*S
OR U
.*S
OR U.*S
OR U
This means that for the first one, your code does find anything/nothing followed by a U (.*U
). It does not find an S followed by anything/nothing followed by a U (S.*U
). Then it does find an S (S
). Hence the result, ["U", "S"]
Similarly, for the second expression, your code does find anything/nothing followed by a U (.*U
). It does not find an S followed by anything/nothing followed by an S (S.*S
). Then it does not find a second U (U
). Hence the result, ["U"]
For the third expression, your code does not find anything/nothing followed by an S (.*S
). Then it does find a U followed by anything/nothing ('niversity of ') followed by an S (U.*S
). Then it does not find another U. Hence the result ["University of S"]
.
I assume you meant your expression to be:
.*
(U
OR S
) .*
(U
OR S
)
To write this as valid regex, it should be:
.*(?U|S).*(?U|S)
You can also do it with match groups (...)
instead of non-matching groups (?...)
.
However, best practice in this case would be that you use a character class. It is written with square brackets, and matches any one of all the characters put inside. To use it in this example, it would be:
.*[US].*[US]