java-8 fileinputstream bufferedinputstream bytearrayinputstream

Dangers/Guarantees for using ByteArrayInputStream to have "Correct" mark/reset behaviour

The question may be generic but I am trying to understand the major implications here.

I am trying to do some byte code engineering using BCEL library and part of the workflow requires me to read the same byte code file multiple times (from the beginning). The flow is the following

// 1. Get Input Stream

// 2. Do some work

// 3. Finish

// 4. Do some other work.

At step 4, I will need to reset the mark or get the stream as though it's from beginning. I know of the following choices.

1) Wrap the stream using BufferedInputStream - chance of getting "Resetting to invalid mark" IOException

2) Wrap it using ByteArrayInputStream - it always works even though some online research suggests that it's erroneous?

3) Simply call getInputStream() if I need to read from the stream again.

I am trying to understand which option would be better for me. I don't want to use BufferedInputStream because I have no clue where the last mark is called, so calling reset for a higher mark position will cause IOException. I would prefer using ByteArrayInputStream since it requires the minimum code change for me, but could anyone suggest whether option#2 or option#3 will be better?

I know that implementations for mark() and reset() are different for ByteArrayInputStream and BufferedInputStream in JDK.

Regards

Solution

The problem of mark/reset is not only that you have to know in advance the maximum amount of data being read between these calls, you also have to know whether the code you’re delegating to will use that feature for itself internally, rendering your mark obsolete. It’s impossible for code using mark/reset to remember and restore a previous mark for the caller.

So while it would be possible to fix the maximum issue by specifying the total file size as maximum readlimit, you can never rely on a working mark when passing the InputStream to an arbitrary library function that does not explicitly document to never use the mark/reset feature internally.

Also, a BufferedInputStream getting a readlimit matching the total file size would not be more efficient than a ByteArrayInputStream wrapping an array holding the entire file, as both end up maintaining a buffer of the same size.

The best solution would be to read the entire class file into an array once and directly use the array, e.g. for code under your control or when you have a choice regarding the library (ASM’s ClassReader supports using a byte array instead of an InputStream, for example).

If you have to feed an InputStream to a library function insisting on it, like BCEL, then wrap the byte array into a ByteArrayInputStream when needed, but create a new ByteArrayInputStream each time you have to re-parse the class file. Constructing the new ByteArrayInputStream costs nothing, as it is a lightweight wrapper and is reliable, as it does not depend on the state of an older input stream in any way. You could even have multiple ByteArrayInputStream instances reading the same array at the same time.

Calling getInputStream() again would be an option, if you have to deal with really large files for which buffering the entire contents is not an option, however, this is not the case for class files.