
TagSoup and XPath

I'm trying to use TagSoup with XPath (JAXP). I know how to obtain SAX parser from TagSoup (or XMLReader). But I failed to find how to create DocumentBuilder that will use that SAX parser. How do I do that?

Thank you.

EDIT: Sorry for being so general but Java XML API is such a pain.


Problem solved:

public static void main(String[] args) throws XPathExpressionException, IOException,
        SAXNotRecognizedException, SAXNotSupportedException,
        TransformerFactoryConfigurationError, TransformerException {

    XPathFactory xpathFac = XPathFactory.newInstance();
    XPath xpath = xpathFac.newXPath();

    InputStream input = new FileInputStream("/tmp/g.html");

    XMLReader reader = new Parser();
    reader.setFeature(Parser.namespacesFeature, false);
    Transformer transformer = TransformerFactory.newInstance().newTransformer();

    DOMResult result = new DOMResult();
    transformer.transform(new SAXSource(reader, new InputSource(input)), result);

    Node htmlNode = result.getNode();
    NodeList nodes = (NodeList) xpath.evaluate("//span", htmlNode, XPathConstants.NODESET);


Link that helped me:


  • Java XML API is such a pain

    Indeed it is. Consider moving to XSLT 2.0 / XPath 2.0 and using Saxon's s9api interface instead. It would look roughly like this:

    Processor proc = new Processor();
    InputStream input = new FileInputStream("/tmp/g.html");
    XMLReader reader = new Parser();
    reader.setFeature(Parser.namespacesFeature, false);
    Source source = new SAXSource(parser, input);
    DocumentBuilder builder = proc.newDocumentBuilder();
    XdmNode input =;
    XPathCompiler compiler = proc.newXPathCompiler();
    XdmValue result = compiler.evaluate("//span", input);