I made a few tests to help myself to understand non-greedy in Python, but it made me much more confused than before. Thank you for the help!
lan='From 000@hhhaaa@stephen.marquard@uct.ac.za@bbb@ccc fff@ddd eee'
print(re.findall('\S+@\S+?',lan)) # 1
print(re.findall('\S+@\S+',lan)) # 2
print(re.findall('\S+?@\S+?',lan)) # 3
print(re.findall('\S+?@\S+',lan)) # 4
Result:
['000@hhhaaa@stephen.marquard@uct.ac.za@bbb@c', 'fff@d'] # 1
['000@hhhaaa@stephen.marquard@uct.ac.za@bbb@ccc', 'fff@ddd'] # 2
['000@h', 'hhaaa@s', 'tephen.marquard@u', 'ct.ac.za@b', 'bb@c', 'fff@d'] # 3
['000@hhhaaa@stephen.marquard@uct.ac.za@bbb@ccc', 'fff@ddd'] # 4
Question:
- why result only shows one d here - @d?
Because +?
is not required to match more than once, so it doesn't.
- is normal, very clear.
- very confusing, I even do not know how to ask the logic behind... Especially when compared with 1...
Again, +?
matches as many characters as it has to - as opposed to matching as many characters as it can, which is exactly the difference between greedy and non-greedy matching.
On the example of \S+?@\S+?
matching From 000@hhhaaa@stephen.marquard@uct.ac.za@bbb@ccc
:
From
, but then it fails because there is a space.000
, then the @
matches, then \S+?
again matches as many \S
as it has to. It has to match 1 character.000@h
.
- it seems it is same as 2, so why ? before @ is so 'weak'?
Explained above.
Since email addresses can't contain spaces, why bother with non-greedy matching anyway? You could use something as simple as \S+@\S+
.