regexurl

Regex to match URL not followed by " or <


I'm trying to modify the url-matching regex at http://daringfireball.net/2010/07/improved_regex_for_matching_urls to not match anything that's already part of a valid URL tag or used as the link text.

For example, in the following string, I want to match http://www.foo.com, but NOT http://www.bar.com or http://www.baz.com

www.foo.com <a href="http://www.bar.com">http://www.baz.com</a>

I was trying to add a negative lookahead to exclude matches followed by " or <, but for some reason, it's only applying to the "m" in .com. So, this regex still returns http://www.bar.co and http://www.baz.co as matches.

I can't see what I'm doing wrong... any ideas?

\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))(?!["<])

Here is a simpler example too:

((((ht|f)tps?:\/\/)|(www.))[a-zA-Z0-9_\-.:#/~}?]+)(?!["<])

Solution

  • Yeah, its actually trivial to make it work if you just want to exclude trailing characters, just make your expression 'independent', then no backtracking will occurr in that segment.

    (?>\b ...)(?!["<])

    A perl test:

    use strict;
    use warnings;
    
    my $str = 'www.foo.com <a href="http://www.bar.com">http://www.baz.com</a>http://www.some.com';
    
    while ($str =~ m~
     (?>
        \b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
     )
     (?!["<])
    ~xg)
    {
       print "$1\n";
    }
    

    Output:

    www.foo.com
    http://www.some.com