compression

Compressing simple text to text


I have a bunch of long strings (16200 characters) that I want to compress. The entire string only uses 12 different characters (currently _oOwWgGmdDsS and, but those can change if needed).

I'm looking to compress this string. I currently made a compression scheme myself, where each time I first put the character, and then how many times it appears before another one is in the string. So if the uncompressed text looks like this:

ooooooWW_

Then the compressed becomes

o6W2_1

For the strings I currently have this reduced the size from about 128MB to 4MB. However, as you can see, for the W's there is no saving, and for the _ there's even a loss.

So I was wondering, are there more sophisticated compression schemes I can use? The end result has to be plain text however, not binary data.

Note: It would also be awesome if there exists a library for both Python and Lua for them.


Solution

  • Use zlib to compress to binary, and then base64 to expand the binary to plain text. Python has both built in. A little googling will turn up Lua bindings for zlib and base64 code.

    Example:

    import zlib
    import base64
    text = input('Text to compress > ')
    compressed = base64.b64encode(zlib.compress(text.encode())).decode()
    print('Compressed Text:', compressed)
    text = input('Text do decompress > ')
    decompressed = zlib.decompress(base64.b64decode(text.encode())).decode()
    print('Decompressed Text:', decompressed)