perlscreen-scrapingperl-modulelwphtml-tableextract

HTML::TableExtract an HTTPS site


I've created a perl script to use HTML::TableExtract to scrape data from tables on a site.

It works great to dump out table data for unsecured sites (i.e. HTTP site), but when I try HTTPS sites, it doesn't work (the tables_report line just prints blank.. it should print a bunch of table data).

However, if I take the content of that HTTPS page, and save it to an html file and then post it on an unsecured HTTP site (and change my content to point to this HTTP page), this script works as expected.

Anyone know how I can get this to work over HTTPS?

#!/usr/bin/perl
use lib qw( ..); 
use HTML::TableExtract; 
use LWP::Simple; 
use Data::Dumper; 
# DOESN'T work:
my $content = get("https://datatables.net/"); 
# DOES work:
#   my $content = get("http://www.w3schools.com/html/html_tables.asp"); 
my $te = HTML::TableExtract->new();
$te->parse($content);
print $te->tables_report(show_content=>1);
print "\n";
print "End\n";

The sites mentioned above for $content are just examples.. these aren't really the sites I'm extracting, but they work just like the site I'm really trying to scrape.

One option I guess is for me to use perl to download the page locally first and extract from there, but I'd rather not, if there's an easier way to do this (anyone that helps, please don't spend any crazy amount of time coming up with a complicated solution!).


Solution

  • The problem is related to the user agent that LWP::Simple uses, which is stopped at that site. Use LWP::UserAgent and set an allowed user agent, like this:

    use strict;
    use warnings;
    use LWP::UserAgent;
    
    my $ua = LWP::UserAgent->new;
    my $url = 'https://datatables.net/';
    
    $ua->agent("Mozilla/5.0");  # set user agent
    my $res = $ua->get($url);   # send request
    
    # check the outcome
    if ($res->is_success) {
       # ok -> I simply print the content in this example, you should parse it
       print $res->decoded_content;
    }
    else {
       # ko
       print "Error: ", $res->status_line, "\n";
    }