phpcharacter-encodingsimple-html-dom

How to parse Simple HTML DOM with ampersand (etc) character errors


There are four or five questions on SO that address this specific issue (an example); however they are quite aged (+10 years) and none of them adequately address the issue with specifics. I'm hoping that answers to this question might both address my specific issue while clearing up the confusion for the community at the same time.

I am trying to parse a client's site, to build a summary of current content for their IT department. (Please don't ask me why they can't do this themselves.)

In the past I have used the PHP library Simple HTML DOM Parser to do tasks such as this. I have not used this library for about seven years, but I've never run into this issue before.

When loading the document to an object using…

$dom = new DOMDocument('1.0','UTF-8');
$dom->loadHTMLFile($url); // run WITH error output

PHP returns warnings along this line:

Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in https://thehtmlfilename.html, line: 45 in /myScript/index.php on line 47

Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: no name in https://thehtmlfilename.html, line: 88 in /myScript/index.php on line 47

These warnings neither seem to prevent the loading of the DOM, nor do they stop the script from running. However, when I attempt to access the href group using $anchors = $dom->getElementsByTagName('a');, the script will run through the first three or four (well-constructed) hrefs, then meet a line like these:

<li class="">
    <a href="https://www.thecompany.com/campus_staff.html">Campus & Staff</a>
</li>
<li class="">
    <a href="https://www.thecompany.com/parents-and-families.html">Family & Friends</a>
</li>

Careful analysis determines that it is lines like these that produce the warnings above. Both of these lines produce the "expecting ';'" warning.

When I var_dump the $anchors object, all that is returned is this:

object(DOMNodeList)#2 (1) {
  ["length"]=>
  int(90)
}

Other answers, such as the linked question above, mention

My best guess then is that there is an unescaped ampersand (&) somewhere in the HTML. This will make the parser think we're in an entity reference (e.g. ©). When it gets to ;, it thinks the entity is over. It then realises what it has doesn't conform to an entity, so it sends out a warning and returns the content as plain text.

Which suggests that I am on the right track.

Various resolutions that have been suggested all prescribe the changing of the & to a non-& character using various means: str_replace, pre_replace, htmlentities, &tc.

I understand a contradiction in these answers. The & character seems to be interrupting the loading process that is initiated by loadHTMLFile() and which creates the DOM object. If that is the case, the programmer has no ability to replace the & character prior to processing.

How then? It's a great step forward to identify the problem, as in the linked questions; but how do we solve that problem? How do we pull these href links from this page?

It's worth noting that the ampersand that we find in…

<a href="https://www.thecompany.com/campus_staff.html">Campus & Staff</a>

… is not in the href itself, but in the link text (between the <a> tags).


Solution

  • Fetch the content as a string first, then replace the ampersand instances with something parsable.

    $html = file_get_contents('/path/to/file.html');
    $html = preg_replace('/&(?=\s)/', '&amp;', $html);
    $doc = new DOMDocument();
    $doc->loadHTML($html);
    $anchors = $doc->getElementsByTagName('a');
    foreach ($anchors as $anchor) {
      print $anchor->firstChild->wholeText;
    }