arraysmemoryrandomcompressionlossless-compression

can a random byte-array be decompressed? What algorithm would be used?


I can insert any byte-array into a compression-algorithm like Lempel-Ziv and i will get a lossless compressed byte-array back. I can decompress this again.

But is there an algorithm that can decompress any random byte-array? I tried it using gzip, but it returned an error System.IO.InvalidDataException: 'The archive entry was compressed using an unsupported compression method.'

What would be the outcome? Will it be something random, but with repeating patterns? Let me know what you think!

Oh and btw, here is the code that i used:

public class Compresser
{
    public static string CompressString(string text)
    {
        byte[] byteArray = Encoding.UTF8.GetBytes(text);

        using (MemoryStream memoryStream = new MemoryStream())
        {
            using (GZipStream gzipStream = new GZipStream(memoryStream, CompressionMode.Compress))
            {
                gzipStream.Write(byteArray, 0, byteArray.Length);
            }

            return Convert.ToBase64String(memoryStream.ToArray());
        }
    }

    public static string DecompressString(string compressedText)
    {
        byte[] byteArray = Convert.FromBase64String(compressedText);

        using (MemoryStream memoryStream = new MemoryStream(byteArray))
        {
            using (GZipStream gzipStream = new GZipStream(memoryStream, CompressionMode.Decompress))
            {
                using (MemoryStream decompressedStream = new MemoryStream())
                {
                    gzipStream.CopyTo(decompressedStream);
                    byte[] decompressedBytes = decompressedStream.ToArray();
                    return Encoding.UTF8.GetString(decompressedBytes);
                }
            }
        }
    }
}

    public class Base64RandomFiller
    {
        private static Random random = new Random();

        public static string GenerateRandomBase64String(int length)
        {
            byte[] randomBytes = new byte[(length * 3) / 4];
            using (var rng = new System.Security.Cryptography.RNGCryptoServiceProvider())
            {
                rng.GetBytes(randomBytes);
            }
            return Convert.ToBase64String(randomBytes);
        }
    }

string randomBase64String = Base64RandomFiller.GenerateRandomBase64String(10);
Console.WriteLine("Random String: " + randomBase64String);

Thread.Sleep(1000);

string decompressedRandomString = Compresser.DecompressString(randomBase64String);
Console.WriteLine("Decompressed Random String: " + decompressedRandomString);

Solution

  • For gzip, no. If you feed it random data, even if you get past the header by chance, you will eventually run into invalid compressed data, with near certaintly.

    Of course, it is possible to randomly generate a valid gzip stream, but the probability is vanishingly small. The shortest such stream is 20 bytes, of which effectively 42 bits can be any value, leaving 118 bits with a single value. If you generate 20 random bytes, there is a probability of 2-118 ~ 3x10-36 that it will be a valid gzip stream. (If you're somehow generating a terabyte of random data per second, it would take ten quadrillion years on average to get a valid gzip stream.)

    This is true of most any commonly used decompressor, all of which constrain the set of possible valid bit streams. E.g. xz, zstd, lz4, etc. Even if you start with a valid header and then feed them random bytes from there on, they will eventually detect invalid data with near certainty. Even if you somehow generate valid compressed data, you then need to make it through a 32 or 64-bit integrity check.

    It is possible to write a bijective compressor-decompressor, where all possible bit streams are valid. In principle, making use of all possible compressed bit streams maximizes the efficiency of compression. However that particular maximization is inconsequential as compared to the other things you do to improve compression, so you will not find any bijective compressor-decompressors in common use. They have been written though. You can find more information on a no-longer-up web page, saved in the Wayback Machine. (Consider donating to them.)

    There is an argument for a bijective compressor-decompressor for use with encryption. Then there is no plain-text attack on the just the compressed data, since all bit streams are valid. However this is addressed in the real world by simply preceding the compressed data with a short string of random data.

    Brotli is the closest I've seen in common usage to being able to decompress random data. It has a much higher probability of random data being valid than the others mentioned here, assuming that there is no header or integrity check. See this answer for why there is about a 5% chance of randomly generating correct Brotli compressed data.