[SOLVED] Regex to match URL not followed by " or <

Regex to match URL not followed by " or <

I'm trying to modify the url-matching regex at http://daringfireball.net/2010/07/improved_regex_for_matching_urls to not match anything that's already part of a valid URL tag or used as the link text.

For example, in the following string, I want to match http://www.foo.com, but NOT http://www.bar.com or http://www.baz.com

www.foo.com <a href="http://www.bar.com">http://www.baz.com</a>

I was trying to add a negative lookahead to exclude matches followed by " or <, but for some reason, it's only applying to the "m" in .com. So, this regex still returns http://www.bar.co and http://www.baz.co as matches.

I can't see what I'm doing wrong... any ideas?

\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))(?!["<])

Here is a simpler example too:

((((ht|f)tps?:\/\/)|(www.))[a-zA-Z0-9_\-.:#/~}?]+)(?!["<])

Solution

Yeah, its actually trivial to make it work if you just want to exclude trailing characters, just make your expression 'independent', then no backtracking will occurr in that segment.

(?>\b ...)(?!["<])

A perl test:

use strict;
use warnings;

my $str = 'www.foo.com <a href="http://www.bar.com">http://www.baz.com</a>http://www.some.com';

while ($str =~ m~
 (?>
    \b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
 )
 (?!["<])
~xg)
{
   print "$1\n";
}

Output:

www.foo.com
http://www.some.com