htmlperlparsinghtml-parsinghtml-parser

How HTML::Query autocomplete invalid HTML in perl 5?


I have invalid HTML code with lacking tr in thead. Despite this I am trying to select elements of this HTML using HTML::Query but selectors have counter intuitive behavior.

This is my code:

#!/usr/bin/env perl

require HTML::Query;
use JSON;

my $q = HTML::Query->new( text => '
<table>
    <thead>
        <th>A</th>
        <th>B</th>
        <th>C</th>
    </thead>
    <tbody>
    <tr>
        <td>E</td>
        <td>F</td>
        <td>G</td>
    </tr>
    </tbody>
</table>
' );

my %data = (
    tr => $q->query('tr')->first->as_trimmed_text,
    tbody => $q->query('tbody')->first->as_trimmed_text
);

print JSON->new->utf8(0)->encode( \%data );

Result:

{
  "tbody": "",
  "tr": "BC"
}

Of course if I use correct HTML code with missing tr:

<table>
    <thead>
    <tr>
        <th>A</th>
        <th>B</th>
        <th>C</th>
    </tr>
    </thead>
    <tbody>
    <tr>
        <td>E</td>
        <td>F</td>
        <td>G</td>
    </tr>
    </tbody>
</table>

Then program print suspected intuitive output:

{
  "tr": "ABC",
  "tbody": "EFG"
}

My questions:


Solution

  • I'm the maintainer of this package.

    HTML::Query provides selector magic on top of the parsing capabilities provided by HTML::Tree. HTML::Query itself provides no parsing itself. Your problem is with HTML::Tree, not with HTML::Query.

    HTML:Tree largely predates specification conformity. It was originally created back when IE ruled the internet (1999) and handles HTML based upon "real world" usage. It does a pretty good job of handling HTML 4 documents, but as you note there are issues with unconventional markup that is otherwise legal in HTML 4. There is no recourse for handling these edge cases, the library cannot and will not handle them, as there are thousands of organizations who are reliant on the existing implementation to continue working as it has been.

    HTML::Tree does not properly support HTML5. The author of the underlying library HTML::TagSet refuses to support it, and argues with (or ignores) anyone who offers a solution, or offers to take over the library. This stance effectively prevents ALL derivative projects from properly handling HTML5 - HTML::Query and CSS::Inliner are no exception.

    As for the suggestion to use HTML::HTML5::Parser or "whatever other Perl modules that provide an HTML parser" - I welcome patches. That being said: there are no sufficiently maintained Perl libraries that bring to the table what HTML::Tree did, so any such attempt is likely to fail, but let's see what you can come up with.