I'm trying to use TableExtract to parse a table where some of the headers have the same name, but different data. The rows method is returning the values for the first headers for both sets of headers.
my @headers = qw(Flight Rating Airline Sched Actual Gate Sched Actual Gate Status Equip Track);
my $te = HTML::TableExtract->new(
headers => \@headers);
$te->parse_file($myfile);
my ($table) = $te->tables;
I print the data like this:
for my $row ($te->rows ) {
foreach (@$row) {
$_ =~ s/\n//g;
$_ =~ s/\r//g;
}
print OUT join("\t", @$row),"\n";
}
The result is this:
AA 1251 American Airlines 9:30 PM 10:22 PM T-CC37 9:30 PM 10:22 PM T-CC37 Landed 68 min M80
It should have been this:
AA 1251 American Airlines 9:30 PM 10:22 PM T-CC37 11:00 PM 12:08 AM T-C77 Landed 68 min M80
The first "Schedule Actual Gate" data (representing Departure) is duplicated in the second "Schedule Actual Gate" columns (representing Arrival).
I can see the correct data when I dump the entire table with Dumper($table)
How do I get the rows method to properly parse tables with duplicate header fields?
I found the answer: It is necessary to add the "slice_columns=> 0" attribute to the HTML::TableExtract constructor.
I'm not exactly sure why this is necessary. The help for TableExtract at CPAN says "Columns that are not beneath one of the provided headers will be ignored unless slice_columns was set to 0. Columns will, by default, be rearranged into the same order as the headers you provide (see the automap parameter for more information) unless slice_columns is 0."
In my table, every column is under a provided header. There must be an interaction in the case where headers are not unique, and setting slice_columns to 0 avoids the issue.
my $te = HTML::TableExtract->new(
headers => \@headers,
slice_columns=> 0
);