I have this pattern (I am using php):
'/\[link\=((https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?)\]/i'
When i search for this string: http://phpquest.zapto.org/users/register.php
The matches are (The order is 0-5):
'[link=http://phpquest.zapto.org/users/register.php]'
'http://phpquest.zapto.org/users/register.php'
'http://'
'phpquest.zapto'
org
''
When I replace the *
with +
inside the last subpattern like that:
'/\[link\=((https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]+)*\/?)\]/i'
The matches are(The order is 0-5):
'[link=http://phpquest.zapto.org/users/register.php]'
'http://phpquest.zapto.org/users/register.php'
'http://'
'phpquest.zapto'
org
'/users/register.php'
Can someone help me understand the difference?
Maybe a simpler example is when you compare this to this.
The regexes involved are:
(a*)*
and
(a+)*
And the test string is aaaaaa
.
What happens is that after capturing the main group (in the example I provided, the series of a
's) it attempts to match more, but cannot. But wait! It can also match nothing, because *
means 0 or more times!
Therefore, after matching all the a
's, it will match and catch a 'nothing' and since only the last captured part is stored, you get '' as result of the capture group.
In (a+)*
, after matching and catching aaaaaa
, it cannot match or catch anything more (+
prevents it to match nothing, as opposed to *
) and hence, aaaaaa
is the last match.