apache-tika

Tika configuration for ZIP parser


Is it possible to tell Tika or the parser that a ZIP can only contain files with a certain MimeType or file extension?

What iam currently use is the recursive parser to get all the information for every file.

     final ParseContext context = new ParseContext();
     final ContentHandlerFactory contentHandlerFactory = new BasicContentHandlerFactory( BasicContentHandlerFactory.HANDLER_TYPE.TEXT, -1 );
     final RecursiveParserWrapperHandler recursiveParserWrapperHandler = new RecursiveParserWrapperHandler( contentHandlerFactory );
     final RecursiveParserWrapper parser = new RecursiveParserWrapper( autoDetectParser );
     context.set( Parser.class, parser );
     parser.parse( tikaInputStream, recursiveParserWrapperHandler, metadata, context );

I am looking for a solution that the zip can only contain one file type and cannot contain any other zip/container. Currently I'm doing this by hand, but maybe there's a better solution. Especially with regard to zip bombing, another solution makes more sense.

        final String contentType = metadata1.get( Metadata.CONTENT_TYPE );
        final MediaType mediaType = MediaType.parse( contentType );
        final MediaType expectedMediaType = MediaType.text( "turtle" );
        final String depth = metadata1.get( TikaCoreProperties.EMBEDDED_DEPTH );

        if ( MediaType.APPLICATION_ZIP.equals( mediaType ) ) {
           if ( Integer.parseInt( depth ) > 0 ) {
              throw new RuntimeException( "Not allowed depth path" );
           }
           return;
        }

        if ( !expectedMediaType.equals( mediaType ) ) {
           throw new RuntimeException( "Not allowed media type" );
        }

Solution

  • I added a own RecursiveParserWrapperHandler. Here is an example when the maximum embedded count is reached an exception is thrown.

    public class ZipHandler extends RecursiveParserWrapperHandler {
    
      private static final int MAX_EMBEDDED_ENTRIES = 10000;
    
      public ZipHandler( final ContentHandlerFactory contentHandlerFactory ) {
         super( contentHandlerFactory, MAX_EMBEDDED_ENTRIES );
      }
    
      @Override
      public void endDocument( final ContentHandler contentHandler, final Metadata metadata ) throws SAXException {
         if ( hasHitMaximumEmbeddedResources() ) {
            throw new SAXException( "Max embedded entries reached: " + MAX_EMBEDDED_ENTRIES );
         }
         super.endDocument( contentHandler, metadata );
      } 
    }