[SOLVED] Parse the URL components of a square braced placeholder in a string

Parse the URL components of a square braced placeholder in a string

I have this pattern (I am using php):

'/\[link\=((https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?)\]/i'

When i search for this string: http://phpquest.zapto.org/users/register.php

The matches are (The order is 0-5):

'[link=http://phpquest.zapto.org/users/register.php]'
'http://phpquest.zapto.org/users/register.php'
'http://'
'phpquest.zapto'
org
''

When I replace the * with + inside the last subpattern like that:

'/\[link\=((https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]+)*\/?)\]/i'

The matches are(The order is 0-5):

'[link=http://phpquest.zapto.org/users/register.php]'
'http://phpquest.zapto.org/users/register.php'
'http://'
'phpquest.zapto'
org
'/users/register.php'

Can someone help me understand the difference?

Solution

Maybe a simpler example is when you compare this to this.

The regexes involved are:

(a*)*

and

(a+)*

And the test string is aaaaaa.

What happens is that after capturing the main group (in the example I provided, the series of a's) it attempts to match more, but cannot. But wait! It can also match nothing, because * means 0 or more times!

Therefore, after matching all the a's, it will match and catch a 'nothing' and since only the last captured part is stored, you get '' as result of the capture group.

In (a+)*, after matching and catching aaaaaa, it cannot match or catch anything more (+ prevents it to match nothing, as opposed to *) and hence, aaaaaa is the last match.