I need one specific 'div'-tag (identified by 'id') from a html site. To parse the page I'm using cyberneko.
def doc = new XmlParser( new org.cyberneko.html.parsers.SAXParser() ).parse(htmlFile)
divTag = doc.depthFirst().DIV.find{ it['@id'] == tagId }
So far no problem, but at the end I don't need XML, but the original content of the whole 'div' tag. Unfortunatly I can't figure out how to do this...
EDIT: Responding to first comment.
This works:
def html = """
<body>
<div id="breadcrumbs">
<b>
crumb1
</b>
</div>
</body>
"""
def doc = new XmlSlurper(new org.cyberneko.html.parsers.SAXParser()).parseText(html)
divTag = doc.BODY.DIV.find { it.@id == 'breadcrumbs' }
println "" << new groovy.xml.StreamingMarkupBuilder().bind {xml -> xml.mkp.yield divTag}
It looks like cyberneko will return a well formed HTML document, regardless of whether the original markup was. i.e., doc's root will be a HTML element, and there will also be a HEAD element. Neat.