javac#gzipgzipstream

Java vs C# GZip Compression


Any idea why Java's GZIPOutputStream compressed string is different from my .NET's GZIP compressed string?

Java Code:

package com.company;

import java.io.IOException;
import java.nio.ByteBuffer;
import java.util.Base64;

public class Main {

    public static void main(String[] args) {
        String myValue = "<Grid type=\"mailing_activity_demo\"><ReturnFields><DataElement>mailing_id</DataElement></ReturnFields></Grid>";

        int length = myValue.length();

        byte[] compressionResult = null;

        try {
            compressionResult = MyUtils.compress(myValue);
        } catch (IOException e) {
            e.printStackTrace();
        }

        byte[] headerBytes = ByteBuffer.allocate(4).putInt(length).array();

        byte[] fullBytes = new byte[headerBytes.length + compressionResult.length];

        System.arraycopy(headerBytes, 0, fullBytes, 0, headerBytes.length);

        System.arraycopy(compressionResult, 0, fullBytes, headerBytes.length, compressionResult.length);

        String result = Base64.getEncoder().encodeToString(fullBytes);
        System.out.println((result));
    }
}




package com.company;

import javax.sound.sampled.AudioFormat;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.nio.Buffer;
import java.nio.ByteBuffer;
import java.nio.charset.StandardCharsets;
import java.util.zip.GZIPOutputStream;

public class MyUtils
{

    private static Object BitConverter;

    public static byte[] compress(String data) throws IOException
    {
        ByteBuffer buffer = StandardCharsets.UTF_8.encode(data);
        System.out.println(buffer.array().length);
        System.out.println(data.length());
        ByteArrayOutputStream bos = new ByteArrayOutputStream(data.length());

        GZIPOutputStream gzip = new GZIPOutputStream(bos);

        gzip.write(data.getBytes());

        gzip.close();

        byte[] compressed = bos.toByteArray();

        bos.close();

        return compressed;

    }

}

The string that I get from above is:

AAAAbB+LCAAAAAAAAP+zcS/KTFEoqSxItVXKTczMycxLj09MLsksyyypjE9Jzc1XsrMJSi0pLcpzy0zNSSm2s3FJLEl0zUnNTc0rsYPpyEyx0UcWt9FH1aMPssUOAKHavIJsAAAA

from the .NET c# code:

    public static string CompressData(string data)
    {
        using (MemoryStream memoryStream = new MemoryStream())
        {
            byte[] plainBytes = Encoding.UTF8.GetBytes(data);

            using (GZipStream zipStream = new GZipStream(memoryStream, CompressionMode.Compress, leaveOpen: true))
            {
                zipStream.Write(plainBytes, 0, plainBytes.Length);
            }

            memoryStream.Position = 0;

            byte[] compressedBytes = new byte[memoryStream.Length + CompressedMessageHeaderLength];

            Buffer.BlockCopy(
                BitConverter.GetBytes(plainBytes.Length),
                0,
                compressedBytes,
                0,
                CompressedMessageHeaderLength
            );

            // Add the header, which is the length of the compressed message.
            memoryStream.Read(compressedBytes, CompressedMessageHeaderLength, (int)memoryStream.Length);

            string compressedXml = Convert.ToBase64String(compressedBytes);

            return compressedXml;
        }
    }

Compressed string:

bAAAAB+LCAAAAAAABACzcS/KTFEoqSxItVXKTczMycxLj09MLsksyyypjE9Jzc1XsrMJSi0pLcpzy0zNSSm2s3FJLEl0zUnNTc0rsYPpyEyx0UcWt9FH1aMPssUOAKHavIJsAAAA

Any idea what am I doing wrong in Java code?


Solution

  • To add to @MarcGravell's answer about differences in GZip encoding, it's worth noting that it looks like you've got an endianness issue with your header bytes, which will be messing up a decoder.

    Your header is 4 bytes, which encodes to 5 1/3 base64 characters. The .NET version outputs bAAAAB (the first 4 bytes of which are 6c 00 00 00), whereas the Java version outputs AAAAbB (the first 4 bytes of which are 00 00 00 6c). The fact that the b is moving by around 5 characters among a sea of A's is your first clue (A represents 000000 in base64), but decoding it makes the issue obvious.

    .NET's BitConverter uses your machine architecture's endianness, which on x86 is little-endian (check BitConverter.IsLittleEndian). Java's ByteBuffer defaults to big-endian, but is configurable. This explains why one is writing little-endian, and the other big-endian.

    You'll want to decide on an endianness, and then align both sides. You can change the ByteBuffer to use little-endian by calling .order(ByteBuffer.LITTLE_ENDIAN). In .NET, you can use BinaryPrimitives.WriteInt32BigEndian / BinaryPrimitives.WriteInt32LittleEndian to write with an explicit endianness if you're using .NET Core 2.1+, or use IPAddress.HostToNetworkOrder to switch endianness if necessary (depending on BitConverter.IsLittleEndian) if you're stuck on something earlier.