javaxml-parsingapache-tika

Parse meta tag and get HTML content from body with Tika


I parse files with the great Apache Tika library. I want to extract the metatags with my own parser and then get the content only from the <body>-tag as HTML and store it in a database.

I have tried this now for hours/days :-(, but cannot find a solution:


Solution

  • Check to see if following links help you a bit..

    Content Detection, Metadata and Content Extraction with Apache Tika

    Parsing HTML with Apache Tika