javainputstreamapache-tikaboilerpipe

Can't read the same InputStream twice


This is my code:

// getFile() method returns the input stream of a local or online file
InputStream fileStream = getFile(source);
// Convert an InputStream to an InputSource
org.xml.sax.InputSource fileSource = new org.xml.sax.InputSource(fileStream);
// Extract text via the Boilerpipe DefaultExtractor
String text = DefaultExtractor.INSTANCE.getText(fileSource);

// Extract text and metadata via Apache Tika
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
AutoDetectParser parser = new AutoDetectParser();
parser.parse(fileStream, handler, metadata, context);

I can't figure out why just the first extractor works.

In this case just Boilerpipe (the first extractor) works, while Apache Tika (the second extractor) is not able to extract anything.

I tried to create a copy of fileStream (via InputStream fileStream2 = fileStream;) and to pass fileStream to one reader and fileStream2 to another reader, but it didn't work either.

I also tried passing to Boilerpipe the HTML extracted from fileStream, and fileStream to Tika, but the result was the same.

I suspect that the problem is that the same InputStream cannot be read twice.

Could you please help me how to pass the content of 1 InputStream to 2 readers?

EDIT: I found the solution and I posted it below


Solution

  • I find out that an InputStream can't be read twice as Tika and Boilerpipe did in my old code, so I figured out that I could read fileStream and convert it to String, pass it to Boilerpipe, convert the String to a ByteArrayInputStream and pass that to Tika. This is my new code.

    // getFile() method returns the input stream of a local or online file
    InputStream fileStream = getFile(source);
    
    // Read the value of the InputStream and pass it to the
    // Boilerpipe DefaultExtractor in order to extract the text
    String html = readFromStream(fileStream);
    String text = DefaultExtractor.INSTANCE.getText(html);
    
    // Convert the value read from fileStream to a new ByteArrayInputStream
    fileStream = new ByteArrayInputStream(html.getBytes("UTF-8"));
    
    // Extract text and metadata via Apache Tika
    BodyContentHandler handler = new BodyContentHandler();
    Metadata metadata = new Metadata();
    ParseContext context = new ParseContext();
    AutoDetectParser parser = new AutoDetectParser();
    parser.parse(fileStream, handler, metadata, context);