This is my code:
// getFile() method returns the input stream of a local or online file
InputStream fileStream = getFile(source);
// Convert an InputStream to an InputSource
org.xml.sax.InputSource fileSource = new org.xml.sax.InputSource(fileStream);
// Extract text via the Boilerpipe DefaultExtractor
String text = DefaultExtractor.INSTANCE.getText(fileSource);
// Extract text and metadata via Apache Tika
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
AutoDetectParser parser = new AutoDetectParser();
parser.parse(fileStream, handler, metadata, context);
I can't figure out why just the first extractor works.
In this case just Boilerpipe (the first extractor) works, while Apache Tika (the second extractor) is not able to extract anything.
I tried to create a copy of fileStream
(via InputStream fileStream2 = fileStream;
) and to pass fileStream
to one reader and fileStream2
to another reader, but it didn't work either.
I also tried passing to Boilerpipe the HTML extracted from fileStream
, and fileStream
to Tika, but the result was the same.
I suspect that the problem is that the same InputStream
cannot be read twice.
Could you please help me how to pass the content of 1 InputStream
to 2 readers?
EDIT: I found the solution and I posted it below
I find out that an InputStream
can't be read twice as Tika and Boilerpipe did in my old code, so I figured out that I could read fileStream
and convert it to String
, pass it to Boilerpipe, convert the String
to a ByteArrayInputStream
and pass that to Tika.
This is my new code.
// getFile() method returns the input stream of a local or online file
InputStream fileStream = getFile(source);
// Read the value of the InputStream and pass it to the
// Boilerpipe DefaultExtractor in order to extract the text
String html = readFromStream(fileStream);
String text = DefaultExtractor.INSTANCE.getText(html);
// Convert the value read from fileStream to a new ByteArrayInputStream
fileStream = new ByteArrayInputStream(html.getBytes("UTF-8"));
// Extract text and metadata via Apache Tika
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
AutoDetectParser parser = new AutoDetectParser();
parser.parse(fileStream, handler, metadata, context);