java gzip ioexception gzipstream gzipinputstream

Failing GZIPInputStream

My Java app writes out a gzip compressed object to a file using try-with resources statement. The object is a very basic with a couple of primitive fields and an ArrayList of Integers. It has no Strings or more complex objects within. On some machines once the object is written out, reading it back fails with the error code of the file not being in proper gzip format. When examining the file it is full of zero values.

Here is the code that compresses and writes out the object:

public static void write_SerialisedCompressed_Object(String folder, String fileName, Object objectToBeSerialised) {
        File node = newFile(folder, fileName);
        try (OutputStream os = Files.newOutputStream(node.toPath());
             GZIPOutputStream gOS = new GZIPOutputStream(os);
             ObjectOutputStream oOS = new ObjectOutputStream(gOS)) {
            oOS.writeObject(objectToBeSerialised);
        } catch (IOException ex) {
            Base.print(ex.getMessage(), Base.TEXT_TYPE.ERROR);
        }
    }

Here is the code that reads the file:

public static Object readCompressedSerialisedObject_File(String folder, String fileName) {
        try (InputStream is = Files.newInputStream(newFile(folder, fileName).toPath());
             GZIPInputStream gIS = new GZIPInputStream(is);
             ObjectInputStream oIS = new ObjectInputStream(gIS)) {
            return oIS.readObject();
        } catch (Exception ex) {
            String error = "Error in: " + folder + fileName + " readCompressedSerialisedObject_File " + ex.getMessage() + " " + ex.getClass() + " " + ex.getCause();
            Base.print(error, Base.TEXT_TYPE.ERROR);
            throw new RuntimeException(error);
        }
    }

I have a feeling that this might be related to encoding. But if the file has been written out on the same machine that it is being read back why would that matter? Any help is welcome. Thanks!

Solution

I have a feeling that this might be related to encoding.

That's not even relevant.

memory and disk space (and network pipes and just about every other low-level comms channel a computer system offers) consists of bytes. That means you can send a sequence of values, each value between 0 and 255.

Text is something quite different. There are vastly more than 255 characters one could feasibly want to send, so, how do we do that?

That is what encoding is: An algorithm that turns a sequence of characters into a sequence of bytes. And vice versa.

The US-ASCII protocol, for example, maps byte values 0-31 to certain control concepts (Such as 'newline' or 'tab'), 32-126 to certain symbols (such as 'A') and decrees that no other symbols can exist (and further decrees that the bytestream will simply never contain any values above 127 - the top bit is always 0). If you want to send an é, tough - you can't. It's not one of the 94 symbols that the US-ASCII spec defines. The advantage is, it's a very simple encoding with a few nice properties (such as: The length of the data in bytes is more or less identical to the length of the data in characters).

The UTF-8 protocol lets you send any unicode character, at the cost of taking anywhere between 1 and 5 bytes to encode a single character. And so on.

Point is, "encoding" is fundamentally a property of text.

GZipped data isn't. text, therefore 'what encoding is this gzip file' is about as sensical a question as 'what color is the taste of apple pie'. The question is a non-sequitur.

} catch (IOException ex) {
     Base.print(ex.getMessage(), Base.TEXT_TYPE.ERROR);
}

This isn't good. An exception consists of in basis 5 useful properties:

Its type (For example, NoSuchFileException)
Its message (For example, 'disk unmounted')
Its causal chain (sometimes an exception is caused by another and the cause is often more useful. And that cause can itself have a cause, and so on).
Its stack trace, which tells you where in the code the problem occurs.
Its suppressed exceptions (this is rarely useful).
Exceptions are java types and can have more info. For example, SQLException contains the SQL-side 'error state' number which can be useful.

You're tossing all that in the garbage, except for 'the message', which is generally defined as not being meaningful without also having the context of the exception type.

This code boils down to the following error handling scheme:

An error has occurred.
Take the top-level explanation of what the problem is and chuck it in the garbage. (ditch the exception's type)
Take the relevant details about what went wrong under the hood and rip it to pieces. (ditch the causal chain).
Take the info about where the problem occurred in the code so you know as developer where to go to fix it, and destroy it. (ditch the stack trace).
Print the 'middle-level' explanation, which is likely a non-sensical string without all the other context, to somewhere.
Now continue as if nothing was wrong.

None of this is right.

And that last bullet can easily explain why you are seeing all zeroes: If the exception does occur you print something and just keep on going. A half-baked truncated gzip file (which would cause 'not in gzip format' when you try to read the half-baked product) can of course occur if you ignore an error that occurs halfway through the process.

The code as pasted will not result in all-zeroes unless your JVM or kernel is corrupted, or you're not running this code, or you're editing the file later on, or the storage layer is corrupted (disk failure), or something similarly exotic.

Hence, keeping a better eye out on the error is the first thing to check. It is very very rare that writing to a disk is completely fine (no errors at all), but when reading, due to disk corruption, you get all zeroes back. Writing to disk, getting an error, and then having the half-baked remains of this process read back as all zeroes is still rare but not nearly as rare.