javahtmlparsingjericho-html-parser

Jericho-html: is it possible to extract text with reference to positions in source file?


I use Jericho HTML Parser 3.1.

I need to extract text from html, handle it and according to this, I need to insert tags to original html.

But for this I need matching between extracted text and source html.

net.htmlparser.jericho.TextExtractor extracts text pretty good, but I was not able to find how to find the location in original file.

Is it possible to do so with Jericho-html?


Solution

  • You cann't do this with the TextExtractor as is, but I've needed to do similar things in the past and the simplest solution is to copy Jericho's TextExtractor implementation and edit it to add your own custom behaviour. It's a pretty simple class so you'll be able to easily see where to add your own hooks.