Here is a basic HTML table :
<table>
<thead>
<td class="foo">bar</td>
</thead>
<tbody>
<td>rows</td>
…
</tbody>
</table>
Suppose there are several such tables in the source file. Is there an option of hxextract
, or a CSS3 selector I could use with hxselect
, or some other tool, which would allow to extract one particular table, either based on the content of thead
or on its class if it exists ? Or am I stuck with not so simple awk
(or maybe perl, as found before submitting) scripting ?
Update :
For content-based extraction, perl's HTML::TableExtract
does the trick :
#!/usr/bin/env perl
use open ':std', ':encoding(UTF-8)';
use HTML::TableExtract;
# Extract tables based on header content, slice_columns helpful if colspan issues
$te = HTML::TableExtract->new( headers => ['Multi'], slice_columns => 0);
$te->parse_file('mywebpage.html');
# Loop on all matching tables
foreach $ts ($te->tables())
{
# Print table identification
print "Table (", join(',', $ts->coords), "):\n";
# Print table content
foreach $row ($ts->rows)
{
print join(':', @$row), "\n";
}
}
However in some cases a simple lynx -dump mywebpage.html
coupled wih awk
or whatever can be just as efficient.
This would require a parent selector or a relational selector, which does not as yet exist (and by the time it does exist, hxselect
may not implement it because it does not even fully implement the current standard as of this writing). hxextract
appears to only retrieve an element by its type and/or class name, so the best it'd do is td.foo
, which would return the td
only, not its thead
or table
.
If you are processing this HTML from the command line, you will need a script.