I am having a problem with a non-greedy regular expression (regex). I've seen that there are questions regarding non-greedy regex, but they don't answer my problem.
Problem: I am trying to match the href of the "lol" anchor.
Note: I know this can be done with Perl HTML parsing modules, and my question is not about parsing HTML in Perl. My question is about the regular expression itself and the HTML is just an example.
Test case: I have four tests for .*?
and [^"]
. The two first produce the expected result. However the third doesn't and the fourth just does, but I don't understand why.
.*?
and [^"]
? Shouldn't the non-greedy operator work?.*?
and [^"]
? I don't understand why including a .*
in front changes the regex (the third and fourth tests are the same except the .*
in front).I probably don't understand exactly how these regex work. A Perl Cookbook recipe mentions something, but I don't think it answers my question.
use strict;
my $content=<<EOF;
<a href="/hoh/hoh/hoh/hoh/hoh" class="hoh">hoh</a>
<a href="/foo/foo/foo/foo/foo" class="foo">foo </a>
<a href="/bar/bar/bar/bar/bar" class="bar">bar</a>
<a href="/lol/lol/lol/lol/lol" class="lol">lol</a>
<a href="/koo/koo/koo/koo/koo" class="koo">koo</a>
EOF
print "| $1 | \n\nThat's ok\n" if $content =~ m~href="(.*?)"~s ;
print "\n---------------------------------------------------\n";
print "| $1 | \n\nThat's ok\n" if $content =~ m~href="(.*?)".*>lol~s ;
print "\n---------------------------------------------------\n";
print "| $1 | \n\nWhy does not the 2nd non-greedy '?' work?\n"
if $content =~ m~href="(.*?)".*?>lol~s ;
print "\n---------------------------------------------------\n";
print "| $1 | \n\nIt now works if I put the '.*' in the front?\n"
if $content =~ m~.*href="(.*?)".*?>lol~s ;
print "\n###################################################\n";
print "Let's try now with [^]";
print "\n###################################################\n\n";
print "| $1 | \n\nThat's ok\n" if $content =~ m~href="([^"]+?)"~s ;
print "\n---------------------------------------------------\n";
print "| $1 | \n\nThat's ok.\n" if $content =~ m~href="([^"]+?)".*>lol~s ;
print "\n---------------------------------------------------\n";
print "| $1 | \n\nThe 2nd greedy still doesn't work?\n"
if $content =~ m~href="([^"]+?)".*?>lol~s ;
print "\n---------------------------------------------------\n";
print "| $1 | \n\nNow with the '.*' in front it does.\n"
if $content =~ m~.*href="([^"]+?)".*?>lol~s ;
Try printing out $&
(the text matched by the entire regex) as well as $1
. This may give you a better idea of what's happening.
The problem you seem to have is that .*?
does not mean "Find the match out of all possible matches that uses the fewest characters here." It just means "First, try matching 0 characters here, and go on to match the rest of the regex. If that fails, try matching 1 character. If the rest of the regex won't match, try 2 characters here. etc."
Perl will always find the match that starts closest to the beginning of the string. Since most of your patterns start with href=
, it will find the first href=
in the string and see if there's any way to expand the repetitions to get a match beginning there. If it can't get a match, it'll try starting at the next href=
, and so on.
When you add a greedy .*
to the beginning of the regex, matching starts with the .*
grabbing as many characters as it can. Perl then backtracks to find a href=
. Essentially, this causes it to try the last href=
in the string first, and work towards the beginning of the string.