perlhtml-tableextract

How to merge content from multiple HMTL files in a single one?


I have more than 100 html files with the following structure.

<html>
<head>
<body>
    <TABLE>
      ...
    </TABLE>
    <TABLE>
        <TR>
            <td rowspan=2><img src="http://www.example.com" width=10></td>
            <TD width=609 valign=top>
                <!-- Content of file1 -->
                <p>abc</p>
                ...
                ...
                ...
                <p>xyz</p>
            </TD>
        </TR>
        <TR>
            <TD align="center" ...alt="top"></a></TD>
        </TR>
    </TABLE>        
</body>
</html>

and I´d like to merged in a single HTML the content inside the column #2 of 1rst row from 2nd table (TABLE[2]ROW[1]COLUMN[2]) of each file to get an output like this

<html>
<head>
<body>
    <!-- Content of file1 -->
    <p>abc</p>
    ...
    ...
    ...
    <p>xyz</p>

            <!-- Content of file2 -->
    <p>some text</p>
    ...
    ...
    ...
    <p>some text</p>

    ..
    ..
    ..
            <!-- Content of fileN -->
    <p>some text</p>
    ...
    ...
    ...
    <p>some text</p>
</body>
</html>

I´m new to perl, and I ask for some help in order to point me out in how to do it. Thanks in advance.

Below begginig a essay for file1, but I´m not sure if I go in correct way.

use HTML::TableExtract;

open (my $html,"<","file1.html");

my $table = HTML::TableExtract->new(keep_html=>0, depth => 1, count => 2, br_translate => 0 );
$table->parse($html);

foreach my $row ($table->rows) {
    print join("\t", @$row), "\n";
}

Solution

  • Documentation HTML::TableExtract states that depth, count, row, col starts from 0.

    Following code is a skeleton of the code with an assumption that all html files will be stored in one directory.

    With an assistance of glob we obtain names of html files.

    Then we write a subroutine extract_table_cell which we pass parameters depth,count,row,col to extract data located at this position.

    Now for each filename we call extract_table_cell subroutine and store return data in an array @data.

    Also we write subroutine gen_html which take reference to @data array and returns html code representing these data.

    At this point we call say with subroutine gen_html as an argument to output result.

    NOTE: you will require to change subroutine extract_table_cell to achieve desired format of cell data

    use strict;
    use warnings;
    use feature 'say';
    
    use HTML::TableExtract;
    
    my($depth,$table,$row,$col) = (0,1,0,1);
    my @data;
    
    for (glob("*.html")) {
        push @data, extract_table_cell($_,$depth,$table,$row,$col);
    }
    
    say gen_html(\@data);
    
    sub gen_html {
        my $data = shift;
    
        my($html,$block);
    
        for ( @{$data} ) {
            $block .= "\t\t$_\n";
        }
    
        $html =
    "
    <html>
        <head>
        </head>
        <body>
        $block
        </body>
    </html>
    ";
    
        return $html;
    }
    
    sub extract_table_cell {
        my($file,$depth,$count,$row,$col) = @_;
    
        my $te = HTML::TableExtract->new( depth => $depth, count => $count );
    
        $te->parse_file($file);
    
        my $table = $te->first_table_found;
    
        return ${ $table->{grid}[$row][$col] };
    }
    

    Output

    <html>
        <head>
        </head>
        <body>
            B 1.2
            D 1.2
    
        </body>
    </html>
    

    Test data files:

    table_1.html

    <html>
        <head>
        </head>
        <body>
            <table>
                <tr><td>A 1.1</td><td>A 1.2</td><td>A 1.3</td></tr>
                <tr><td>A 2.1</td><td>A 2.2</td><td>A 2.3</td></tr>
                <tr><td>A 3.1</td><td>A 3.2</td><td>A 3.3</td></tr>
                <tr><td>A 4.1</td><td>A 4.2</td><td>A 4.3</td></tr>
            </table>
    
            <table>
                <tr><td>B 1.1</td><td>B 1.2</td><td>B 1.3</td></tr>
                <tr><td>B 2.1</td><td>B 2.2</td><td>B 2.3</td></tr>
                <tr><td>B 3.1</td><td>B 3.2</td><td>B 3.3</td></tr>
                <tr><td>B 4.1</td><td>B 4.2</td><td>B 4.3</td></tr>
            </table>
        </body>
    </html>
    

    table_2.html

    <html>
        <head>
        </head>
        <body>
            <table>
                <tr><td>C 1.1</td><td>C 1.2</td><td>C 1.3</td></tr>
                <tr><td>C 2.1</td><td>C 2.2</td><td>C 2.3</td></tr>
                <tr><td>C 3.1</td><td>C 3.2</td><td>C 3.3</td></tr>
                <tr><td>C 4.1</td><td>C 4.2</td><td>C 4.3</td></tr>
            </table>
    
            <table>
                <tr><td>D 1.1</td><td>D 1.2</td><td>D 1.3</td></tr>
                <tr><td>D 2.1</td><td>D 2.2</td><td>D 2.3</td></tr>
                <tr><td>D 3.1</td><td>D 3.2</td><td>D 3.3</td></tr>
                <tr><td>D 4.1</td><td>D 4.2</td><td>D 4.3</td></tr>
            </table>
        </body>
    </html>