javascriptphpdomdocument

Is there a way to use PHP's DOMDocument to parse HTML containing javascript which itself contains HTML strings?


I have an HTML string containing a <script> tag which contains the javascript to create a shadow DOM element via window.customElements.define(...) this in turn contains an innerHTML definition which defines the custom element's HTML as a string.

This is valid HTML which I'm attempting to process using PHP's DOMDocument, however it appears that DOMDocument is confused by the content of the innerHTML and starts treating it's content as nodes it needs to process.

Is there any way to work around this so it no longer confuses DOMDocument?

the pertinent part of the HTML looks somewhat like this:

<script>
class ExampleElement extends HTMLElement {
   constructor() {
      super();
      this.attachShadow({ mode: 'open' })
          .innerHTML = '<label>this is what confuses DOMDocument</label>'
  }
}
window.customElements.define('example-element', ExampleElement);
</script>

this is then processed in PHP like this

$doc = new DOMDocument();
$doc->loadHTML($html, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);

libxml then generates an error about the </label> not matching : "Unexpected end tag : label in Entity"

obviously I can either
- break up the innerHTML so that DOMDocument no longer identifies the <label> and </label> as tags using string concatenation
or
- build the element's content via document.createElement(...) etc

however since this is valid HTML it would be useful to know if it can be parsed as i stands.


Solution

  • Per: https://bugs.php.net/bug.php?id=80095

    libxml uses HTML 4 rules which say that </ is an ending tag. Even if the tag doesn't match the last opening tag. To avoid this problem, write the ending tags in your script as "<\/".

    So change </label> to <\/label>.

    It will parse clean and JS should interpret \/ as a literal / in the string.