perllwphtml-tree

perl html treebuilder how to handle error condition


The task is quite simple: access a url and parse it based on the result. In case there is an error (404, 500 etc etc), take appropriate action. The last piece is the one that I am having issue with.
I have listed both the pieces of code that I currently use. The longer one (LWP+TreeBuilder) works for both conditions ; the shorter one (TreeBuilder) works for the first condition but does not work for the error condition. If I use TreeBuilder and the site returns a 404 or some other error, the script simply exits ! Any ideas ?

Longer code that works

use LWP::Simple;
use LWP::UserAgent;
use HTML::TreeBuilder;

$url="http://some_url.com/blahblah" ;
$response = LWP::UserAgent->new->request( HTTP::Request->new( GET => $url ));
    if ($response->is_success) {

    $p = HTML::TreeBuilder->new();
    $p->parse($response->content);

    } else {

    warn "Couldn't get $url: ", $response->status_line, "\n";

    }

Shorter one that does not

use HTML::TreeBuilder;

$url="http://some_url.com/blahblah" ;

$tree = HTML::TreeBuilder->new_from_url($url) ;

Solution

  • To quote the docs:

    If LWP is unable to fetch the URL, or the response is not HTML (as determined by content_is_html in HTTP::Headers), then new_from_url dies, and the HTTP::Response object is found in $HTML::TreeBuilder::lwp_response.

    Try this:

    use strict;
    use warnings;
    use HTML::TreeBuilder 5; # need new_from_url
    use Try::Tiny;
    
    my $url="http://some_url.com/blahblah" ;
    my $p = try { HTML::TreeBuilder->new_from_url($url) };
    unless ($p) {
        my $response = $HTML::TreeBuilder::lwp_response;
        if ($response->is_success) {
            warn "Content of $url is not HTML, it's " . $response->content_type . "\n";
        } else {
            warn "Couldn't get $url: ", $response->status_line, "\n";
        }
    }