I parse files with the great Apache Tika library. I want to extract the metatags with my own parser and then get the content only from the <body>-tag as HTML and store it in a database.
I have tried this now for hours/days :-(, but cannot find a solution:
ToHTMLContentHandler after the <body>-tag I get exceptions with an invalid namespace without the <html>-tag.BodyContentHandler just returns the body text without HTML tags.tika-app seems to use a TransformerHandler to get HTML (I have never heard of this kind of handlers before.) Can I use this to just get the HTML from the <body>-tag and parse the meta-tags myself? Is this a better way than to use a ToHTMLContentHandler?Check to see if following links help you a bit..
Content Detection, Metadata and Content Extraction with Apache Tika