The task is quite simple: access a url and parse it based on the result. In case there is an error (404, 500 etc etc), take appropriate action. The last piece is the one that I am having issue with.
I have listed both the pieces of code that I currently use. The longer one (LWP+TreeBuilder) works for both conditions ; the shorter one (TreeBuilder) works for the first condition but does not work for the error condition. If I use TreeBuilder and the site returns a 404 or some other error, the script simply exits ! Any ideas ?
Longer code that works
use LWP::Simple; use LWP::UserAgent; use HTML::TreeBuilder; $url="http://some_url.com/blahblah" ; $response = LWP::UserAgent->new->request( HTTP::Request->new( GET => $url )); if ($response->is_success) { $p = HTML::TreeBuilder->new(); $p->parse($response->content); } else { warn "Couldn't get $url: ", $response->status_line, "\n"; }
Shorter one that does not
use HTML::TreeBuilder; $url="http://some_url.com/blahblah" ; $tree = HTML::TreeBuilder->new_from_url($url) ;
To quote the docs:
If LWP is unable to fetch the URL, or the response is not HTML (as determined by
content_is_html
in HTTP::Headers), thennew_from_url
dies, and the HTTP::Response object is found in$HTML::TreeBuilder::lwp_response
.
Try this:
use strict;
use warnings;
use HTML::TreeBuilder 5; # need new_from_url
use Try::Tiny;
my $url="http://some_url.com/blahblah" ;
my $p = try { HTML::TreeBuilder->new_from_url($url) };
unless ($p) {
my $response = $HTML::TreeBuilder::lwp_response;
if ($response->is_success) {
warn "Content of $url is not HTML, it's " . $response->content_type . "\n";
} else {
warn "Couldn't get $url: ", $response->status_line, "\n";
}
}