I am writing a user script that runs on https://example.net
and makes fetch
requests for HTML documents from https://example.com
that I want to parse into HTML DOM trees.
The fetch
API only gives me the raw HTML source. I can parse it myself using DOMParser
, but I run into a problem with relative links. Suppose the document from https://example.com
contains something like this:
<!DOCTYPE html>
<html>
<head>
<body>
<p> <a href="/foo">hello!</a>
If I obtain the DOM node for that body > p > a
element and read its href
property, the value I obtain will be https://example.net/foo
. This is because DOMParser
assigns the source location of the ambient document to the parsing result. I want to assign it the actual source of the document so that relative links resolve correctly.
Right now the only workarounds I can think of are:
<base>
element into the DOM tree, which may interfere with a <base>
tag present in the actual HTML sourcedocument.implementation.createHTMLDocument()
and then .write()
, which gives me a document with a blank source location, where relative links are at least not resolved incorrectly (but will not be resolved at all). Except this doesn't work in a user script: it throws a SecurityError
.Proxy
to intercept accesses to the href
property, which seems too heavyweight to comfortably fit in a user scriptI also realise that parsing HTML from Unicode text obtained by .text()
will bypass the HTML encoding detection algorithm. I can live with that myself, because the site I am interested in exclusively uses UTF-8 correctly denoted in headers, but this is also a flaw that should be noted. Ideally, an HTML document ought to be parsed directly from a Blob
or even a ReadableStream
.
Is there a better way to accomplish what I want?
Instead of using fetch
, use XMLHttpRequest
, which has the built-in capability to parse HTML into a Document
.
You have to explicitly request a document by assigning the string "document"
to the responseType
property of the XMLHttpRequest
object after calling open()
but before calling send()
.
const xhr = new XMLHttpRequest();
xhr.onload = () => {
console.log(
Array.from(xhr.responseXML.links).map(({ href }) => href)
);
}
xhr.open("GET", "https://example.com");
xhr.responseType = "document";
xhr.send();
In my tests relative URLs are converted to absolute URLs based on the source document.