perlhtml-parsinghtml-tableextract

How to use the Perl TableExtract rows method when there are duplicate Header fields


I'm trying to use TableExtract to parse a table where some of the headers have the same name, but different data. The rows method is returning the values for the first headers for both sets of headers.

my @headers = qw(Flight Rating Airline Sched Actual Gate Sched Actual Gate Status Equip Track);
my $te = HTML::TableExtract->new(
headers => \@headers);
$te->parse_file($myfile);
my ($table) = $te->tables;

I print the data like this:

for my $row ($te->rows ) {
 foreach (@$row) {
       $_ =~ s/\n//g;
       $_ =~ s/\r//g;
     }
print OUT join("\t", @$row),"\n";
}

The result is this:

AA 1251     American Airlines   9:30 PM 10:22 PM    T-CC37  9:30 PM 10:22 PM    T-CC37  Landed 68 min   M80 

It should have been this:

AA 1251     American Airlines   9:30 PM 10:22 PM    T-CC37  11:00 PM    12:08 AM    T-C77   Landed 68 min   M80 

The first "Schedule Actual Gate" data (representing Departure) is duplicated in the second "Schedule Actual Gate" columns (representing Arrival).

I can see the correct data when I dump the entire table with Dumper($table)

How do I get the rows method to properly parse tables with duplicate header fields?


Solution

  • I found the answer: It is necessary to add the "slice_columns=> 0" attribute to the HTML::TableExtract constructor.

    I'm not exactly sure why this is necessary. The help for TableExtract at CPAN says "Columns that are not beneath one of the provided headers will be ignored unless slice_columns was set to 0. Columns will, by default, be rearranged into the same order as the headers you provide (see the automap parameter for more information) unless slice_columns is 0."

    In my table, every column is under a provided header. There must be an interaction in the case where headers are not unique, and setting slice_columns to 0 avoids the issue.

    my $te = HTML::TableExtract->new(
    headers => \@headers,
    slice_columns=> 0
    );