phpregexpreg-replacepcre

PHP perl regular expression - URL not preceded by equal sign and possible single or double quote


I'm trying to create a perl regular expression that matches a URL that is not preceded by an equal sign and one single or double quote (optional) ignoring whitespace. The code below gives an error: Warning: preg_replace(): Compilation failed: lookbehind assertion is not fixed length at offset 0

I know my URL regular expression isn't perfect, but I'm more focused on how to do the negative lookbehind or how to express this in some other way.

For example, in the code below, in the matches, it should output http://www.url1.com/ and http://www.url3.com/, but not the other URLs. How can I do this? The code below gives a warning and does not populate the $matches variable.

PHP Code:

$html = "
http://www.url1.com/
= ' http://www.url2.com/
'http://www.url3.com/
<a href='http://www.url4.com/'>Testing1</a>
<img src='https://url5.com'>Testing2</a>";

$url_pregex = '((http(s)?://)[-a-zA-Z()0-9@:%_+.~#?&;//=]+)';
$pregex = '(?<!\\s*=\\s*[\'"]?\\s*)'.$url_pregex;

preg_match_all('`'.$pregex.'`i', $html, $matches);

echo "Matches<br><pre>";
var_export($matches);
echo "</pre>";

Perl Regex in PHP, using ` instead of /:

'`(?<!\\s*=\\s*[\'"]?\\s*)((http(s)?://)[-a-zA-Z()0-9@:%_+.~#?&;//=]+)`i'

Solution

  • One way to work around this is to use an alternation, the first part of which matches URLs which are preceded by = (and an optional quote), and the second which just matches URLs which are then captured. This works because the first part of an alternation is always tested first and so only URLs which are not preceded by = will be captured by the second part of the alternation.

    I've removed capture groups from your $url_pregex for simplicity; if you want them in you'll need to adjust the group number on $matches in this code to get the complete matches.

    $html = "
    http://www.url1.com/
    = ' http://www.url2.com/
    'http://www.url3.com/
    <a href='http://www.url4.com/'>Testing1</a>
    <img src = 'https://url5.com'>Testing2</a>";
    
    $url_pregex = 'https?://[-a-zA-Z()0-9@:%_+.~#?&;//=]+';
    $pregex = "\\s*=\\s*['\"]?\\s*$url_pregex|($url_pregex)";
    
    preg_match_all('`' . $pregex . '`i', $html, $matches);
    
    echo "Matches<br><pre>";
    var_export(array_values(array_filter($matches[1])));
    echo "</pre>";
    

    Output:

    Matches<br><pre>array (
      0 => 'http://www.url1.com/',
      1 => 'http://www.url3.com/',
    )</pre>
    

    Demo on 3v4l.org

    Note that you need to use preg_match_all to get all matches in the text.