perlweb-scraping

Perl: how to extract and combine two sub-sections of a table using HTML::TableExtract?


I am trying to scrape and extract multiple pages of data holding tables showing the value and quantity (volume) of various goods imported into China every month. Ultimately I want to write this data out to a text file for further processing. Here is a screenshot of such a page.

Chinese data imported

Specifically I would like to extract the name of the good being imported, the unit of volume (e.g. tons, kg), the actual value and the volume, a total of 4 fields. The problem I have run into is that the tables I want to extract seem to be at different depths.

I can extract the fields for 'volume' and 'value' because they are at the same depth, so I get debug output like this:

2,314 --- 2,943
20,082 --- 80,176
7 --- 62,426

The 'name' and 'units' fields are at different levels (I think) to the 'volume' and 'value' fields so they are not picked up when I use headers for all 4 fields. However, if I try to extract them as a sub-table it works fine, giving this debug output:

啤酒 --- 千升
葡萄酒 --- 千升
饲料用鱼粉 --- 万吨

How should I resolve this? My first thought is to extract each table separately, loop through each row of each table, add the 2 fields in one table and the 2 fields in the other to an array that would have 4 elements for each row. (In R I guess I would create a data frame and use cbind for this.) This seems feasible but it doesn't feel optimal. So first I want to ask:

1) Is there a straightforward way of telling HTML::TableExtract to extract both subsets of the table and combine them?

2) If I have to extract the data as two separate tables and combine them, what is the most efficient way of doing this?

The code I have so far follows:

use strict;
use HTML::TableExtract;
use Encode;
use utf8;
use WWW::Mechanize;
use Data::Dumper;

binmode STDOUT, ":utf8";

# Chinese equivalents of the various headings
my $txt_header = "单位:千美元";
my $txt_name = "商品名称";
my $txt_units = "计量单位";
my $txt_volume = "数量";
my $txt_value = "金额";

# Chinese Customs site
my $url = "http://www.chinacustomsstat.com/aspx/1/newdata/record_class.aspx?page=2&guid=951";

my $mech = WWW::Mechanize->new( agent => 'Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)');
my $page = $mech->get( $url );
my $htmlstuff = $mech->content();

print ("\nFirst table with two headers (volume and value) at same depth\n\n");
my $te = new HTML::TableExtract( depth => 1,  headers => [ ( $txt_volume, $txt_value ) ]);
$te->parse($htmlstuff);

# See what we have
foreach my $ts ( $te->tables ) {
    print "Table (", join( ',', $ts->coords ), "):\n";
    foreach my $row ( $ts->rows ) {
        print join( ' --- ', @$row ), "\n";
    }
}

print ("\nSecond table with 'name' and 'units'\n");

$te = new HTML::TableExtract( headers => [ ( $txt_name, $txt_units ) ]);
$te->parse($htmlstuff);

# See what we have in the other table
foreach my $ts ( $te->tables ) {
    print "Table (", join( ',', $ts->coords ), "):\n";
    foreach my $row ( $ts->rows ) {
        print join( ' --- ', @$row ), "\n";
    }
}

Solution

  • It does look like the table is implemented nested tables. That being the case, it seems to me that you're application is going to need to compensate for that by parsing the tables individually and stitching the results together afterward.

    I took some liberties with your sample code and came up with this:

    #!/usr/bin/perl
    
    use utf8;
    use strict;
    use warnings;
    {
        use WWW::Mechanize;
        use HTML::TableExtract;
        use Data::Dumper;
        use Text::FormatTable;
    }
    
    binmode STDOUT, ':utf8';
    
    my $txt_name   = '商品名称';
    my $txt_units  = '计量单位';
    my $txt_volume = '数量';
    my $txt_value  = '金额';
    
    my $url
        = 'http://www.chinacustomsstat.com'
        . '/aspx/1/newdata/record_class.aspx'
        . '?page=2&guid=951';
    
    my $mech = WWW::Mechanize->new(
        agent => 'Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)' );
    
    my $page = $mech->get($url);
    my $html = $mech->content();
    
    my %data_for;
    {
        my %config_for = (
            products => {
                values  => [],
                headers => [ $txt_name, $txt_units ],
            },
            data => {
                values  => [],
                headers => [ $txt_volume, $txt_value ],
            },
        );
    
        for my $type ( keys %config_for ) {
    
            my $config_rh = $config_for{$type};
    
            my $te = HTML::TableExtract->new( headers => $config_rh->{headers} );
    
            $te->parse($html);
    
            for my $ts ( $te->tables() ) {
    
                for my $row_ra ( $ts->rows() ) {
    
                    if ( defined $row_ra->[0] ) {
    
                        push @{ $config_rh->{values} }, $row_ra;
                    }
                }
            }
        }
    
        if ( @{ $config_for{products}->{values} }
            != @{ $config_for{data}->{values} } )
        {
            warn 'not as many value rows were parsed as product rows';
        }
    
        for my $i ( 0 .. $#{ $config_for{products}->{values} } ) {
    
            my $product_ra = $config_for{products}->{values}->[$i];
            my $data_ra    = $config_for{data}->{values}->[$i];
    
            my ( $product, $units ) = @{$product_ra};
            my ( $volume,  $value ) = @{$data_ra};
    
            $data_for{$product} = {
                units  => $units,
                volume => $volume,
                value  => $value,
            };
        }
    }
    
    # process results in %data_for hash
    {
        my $table = Text::FormatTable->new('| l | l | l | l |');
    
        $table->head( $txt_name, $txt_units, $txt_volume, $txt_value, );
        $table->rule('=');
    
        for my $product ( keys %data_for ) {
    
            $table->row(
                $product,
                @{ $data_for{$product} }{qw( units volume value )}
            );
            $table->rule('-');
        }
    
        print $table->render();
    }
    
    __END__
    

    I'm a little disappointed with the way that Text::FormatTable handles (or doesn't handle) wide characters. But I think that's beside the point for this example.