htmldelphihtml-parsinghtml-content-extraction

Get the rendered text from HTML (Delphi)


I have some HTML and I need to extract the actual written text from the page.

So far I have tried using a web browser and rendering the page, then going to the document property and grabbing the text. This works, but only where the browser is supported (IE com object). The problem is I want this to be able to run under wine also, so I need a solution that doesn't use IE COM.

There must be a programatic way to do this that is reasonable.


Solution

  • I'm not sure what the recommended way of parsing HTML in Delphi is, but if it were me, I'd be tempted to just bundle a copy of html2text (either the older C++ program by that name or the newer Python program) and spawn a call to one of those.

    You can turn the Python html2text into an executable using py2exe. Both html2text programs are licensed under the GPL, but as long as you merely bundle their executable with your app and make their source available according to the GPL's restrictions, then you ought to be okay.