java compression lzma apache-commons-compress

Getting java.io.EOFException while reading a SQLite file from temp directory

I am seeing an EOFException exception while reading a SQLite file from temp directory. Following is the code for reading the file. And also the exception is not seen always. Consider out of 50K files it is coming for 3 to 4 times.

public static byte[] decompressLzmaStream(InputStream inputStream, int size) 
    throws CompressorException, IOException {

    if(size < 1) {
        size = 1024 * 100;
    }

    try(LZMACompressorInputStream lzmaInputStream = 
                                           new LZMACompressorInputStream(inputStream);
        ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream(size)) {
        byte[] buffer = new byte[size];

        int length;
        while (-1 != (length = lzmaInputStream.read(buffer))) {
            byteArrayOutputStream.write(buffer, 0, length);
        }
        byteArrayOutputStream.flush();
        return byteArrayOutputStream.toByteArray();
    }
}

I am using the following dependency for the decompression

 <dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-compress</artifactId>
    <version>1.20</version>
</dependency>

The exception is thrown at while (-1 != (length = lzmaInputStream.read(buffer))) { line. Following is the exception.

java.io.EOFException: null at java.io.DataInputStream.readUnsignedByte(DataInputStream.java:290) 
at org.tukaani.xz.rangecoder.RangeDecoderFromStream.normalize(Unknown Source) 
at org.tukaani.xz.rangecoder.RangeDecoder.decodeBit(Unknown Source) 
at org.tukaani.xz.lzma.LZMADecoder.decode(Unknown Source) 
at org.tukaani.xz.LZMAInputStream.read(Unknown Source) 
at org.apache.commons.compress.compressors.lzma.
    LZMACompressorInputStream.read(LZMACompressorInputStream.java:62) 
at java.io.InputStream.read(InputStream.java:101)

Anyone have any idea about the following constructor of commons-compress.

// I am using this constructor of LZMACompressorInputStream

public LZMACompressorInputStream(InputStream inputStream) throws IOException {
    this.in = new LZMAInputStream(this.countingStream = new CountingInputStream(inputStream), -1);
} 

// This is added in later version of commons-compress, what is memoryLimitInKb
public LZMACompressorInputStream(InputStream inputStream, int memoryLimitInKb) throws IOException {
    try {
        this.in = new LZMAInputStream(this.countingStream = new CountingInputStream(inputStream), memoryLimitInKb);
    } catch (MemoryLimitException var4) {
        throw new org.apache.commons.compress.MemoryLimitException((long)var4.getMemoryNeeded(), var4.getMemoryLimit(), var4);
    }
}

As I read for LZMA streams we need to pass the uncompressed size to the constructor here --> https://issues.apache.org/jira/browse/COMPRESS-286?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=14109417#comment-14109417

Solution

An LZMA decoder needs to know when the compressed stream ends. If uncompressed size was known during compression, the header of the stream (located at the beginning of the stream) will contain the uncompressed size. When the output of the decoder reaches this size, the decoder knows that the end of the stream is reached. If uncompressed size was not kown during compression, the header will not contain the size. In this case the encoder assumes that the stream is explicitely terminated with an end of stream marker.

Since LZMA streams are also used in container formats like 7z and xz, the LZMAOutputStream and LZMAInputStream classes also provide contructors for reading/writing streams without an header.

COMPRESS-286 is about decompressing a 7z archive that contains an entry with LZMA compression. A 7z archive contains LZMA streams without an header. The information that is usually stored in the header of the LZMA is stored separated from the stream. Apache commons SevenZFile class for reading 7z archives creates LZMAInputStream objects with the following constructor:

LZMAInputStream(InputStream in, long uncompSize, byte propsByte, int dictSize)

The additional parameters of the constructor represent the information that is usually stored in the header at the beginning of the LZMA stream. The fix of COMPRESS-286 ensured that also the uncompressed size (was missing before) is handed over to the LZMAInputStream.

LZMACompressorInputStream makes also use of LZMAInputStream but it assumes that the compressed stream contains an explicit header. Therefore it is not possible to hand over the information through it's constructor.

The memoryLimitInKb parameter only limits the memory that is used for decompression and has nothing to do with uncompressed size. Main contributor to required memory is the selected size of the dictionary. This size is specified during compression and is also stored in the header of the stream. Its maximum value is 4 GB. Usually the size of the dictionary is smaller than the uncompressed size. A dictionary greater than the uncompressed size is an absolute waste of memory. A corrupted LZMA header can easily lead to an OOM error and a manipulated stream even opens the doors for a denial of service attacks. Therefore it is wise to limit maximum memory usage when you read an unverified LZMA stream.

To sum it up: Since you do not read a 7z archive with an LZMA compressed entry, COMPRESS-286 has nothing to do with your issue. But the similar stacktrace may be an indicator that something is wrong with the headers of your stream.

Ensure that data is compressed with an instance of LZACompressorOutputStream(automatically selects dictionary size, all other parameters and ensures that a header is written). If you should use LZAOutputStream directly, make sure that you use an instance that actually writes a header.