phpregexplaceholdertext-parsingurl-parsing

Parse the URL components of a square braced placeholder in a string


I have this pattern (I am using php):

'/\[link\=((https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?)\]/i'

When i search for this string: http://phpquest.zapto.org/users/register.php

The matches are (The order is 0-5):

  1. '[link=http://phpquest.zapto.org/users/register.php]'
  2. 'http://phpquest.zapto.org/users/register.php'
  3. 'http://'
  4. 'phpquest.zapto'
  5. org
  6. ''

When I replace the * with + inside the last subpattern like that:

'/\[link\=((https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]+)*\/?)\]/i'

The matches are(The order is 0-5):

  1. '[link=http://phpquest.zapto.org/users/register.php]'
  2. 'http://phpquest.zapto.org/users/register.php'
  3. 'http://'
  4. 'phpquest.zapto'
  5. org
  6. '/users/register.php'

Can someone help me understand the difference?


Solution

  • Maybe a simpler example is when you compare this to this.

    The regexes involved are:

    (a*)*
    

    and

    (a+)*
    

    And the test string is aaaaaa.

    What happens is that after capturing the main group (in the example I provided, the series of a's) it attempts to match more, but cannot. But wait! It can also match nothing, because * means 0 or more times!

    Therefore, after matching all the a's, it will match and catch a 'nothing' and since only the last captured part is stored, you get '' as result of the capture group.

    In (a+)*, after matching and catching aaaaaa, it cannot match or catch anything more (+ prevents it to match nothing, as opposed to *) and hence, aaaaaa is the last match.