javascripthtmldomparser

How can I parse HTML into a DOM tree, taking into account its origin location?


I am writing a user script that runs on https://example.net and makes fetch requests for HTML documents from https://example.com that I want to parse into HTML DOM trees.

The fetch API only gives me the raw HTML source. I can parse it myself using DOMParser, but I run into a problem with relative links. Suppose the document from https://example.com contains something like this:

<!DOCTYPE html>
<html>
  <head>
  <body>
    <p> <a href="/foo">hello!</a>

If I obtain the DOM node for that body > p > a element and read its href property, the value I obtain will be https://example.net/foo. This is because DOMParser assigns the source location of the ambient document to the parsing result. I want to assign it the actual source of the document so that relative links resolve correctly.

Right now the only workarounds I can think of are:

I also realise that parsing HTML from Unicode text obtained by .text() will bypass the HTML encoding detection algorithm. I can live with that myself, because the site I am interested in exclusively uses UTF-8 correctly denoted in headers, but this is also a flaw that should be noted. Ideally, an HTML document ought to be parsed directly from a Blob or even a ReadableStream.

Is there a better way to accomplish what I want?


Solution

  • Instead of using fetch, use XMLHttpRequest, which has the built-in capability to parse HTML into a Document.

    You have to explicitly request a document by assigning the string "document" to the responseType property of the XMLHttpRequest object after calling open() but before calling send().

    const xhr = new XMLHttpRequest();
    xhr.onload = () => {
      console.log(
        Array.from(xhr.responseXML.links).map(({ href }) => href)
      );
    }
    xhr.open("GET", "https://example.com");
    xhr.responseType = "document";
    xhr.send();
    

    In my tests relative URLs are converted to absolute URLs based on the source document.