phphtmlparsingdomdocument

Domdocument loading


I want to parse a html file.

$html =htmlentities( file_get_contents('http://forums.heroesofnewerth.com/showthread.php?553261'));
$dom = new DOMDocument();
$dom->loadHTML($html);//line 30

I'm getting these errors

Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 113 in D:\Projects\Web projects\done\honscript\index.php on line 30

Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 113 in D:\Projects\Web projects\done\honscript\index.php on line 30

Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 200 in D:\Projects\Web projects\done\honscript\index.php on line 30

Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 200 in D:\Projects\Web projects\done\honscript\index.php on line 30

Changed to using htmlenttities and getting

Warning: DOMDocument::loadHTML(): Empty string supplied as input in D:\Projects\Web projects\done\honscript\index.php on line 30

Solution

  • The document you're trying to load is not valid HTML and thus not valid DOM (see http://validator.w3.org/check?verbose=1&uri=http%3A%2F%2Fforums.heroesofnewerth.com%2Fshowthread.php%3F553261 for an extensive list of HTML errors on that page).

    So PHP basically has to guess what's meant by the HTML it's provided with and warns about that (it might guess wrong).

    The & is a special character in HTML which is used to escape special characters (for example, to print < in an HTML page you'd have to write &amp;lt;. It also has a special meaning in URLs as a separator for request variables (e.g., http://example.com?foo=bar&braz=idec) and thus appears a lot in websites. The correct way to write an & within HTML is &amp;amp;.

    Probably the guesses are correct and the DOMDocument will work just fine. So you could just suppress this warning like so:

    @$dom->loadHTML($html);
    

    Otherwise, you'd have to fix the HTML somehow. Just running it through HTML entities as mentioned above will not work, since it'll also escape all tag markers etc.

    What probably might work is replacing all &amp; with &amp;amp;, although this might lead to other consequences as &amp;amp; would become &amp;amp;amp; so you'd have to only replace those &s that aren't followed by an amp;.