pythonnumpycompressionlzmapylzma

Is there a way to compress a .npy file more tightly than by using the LZMA algorithm?


I am trying to compress some .npy files as tightly as possible. What I have read is that typically to do this you use the LZMA algorithm.

So far I have tried xz tar compression level 9, and python lzma compression. This seems effective but I was wondering if anybody had tried something better? Is LZMA really the best algorithm or is there something better? I am optimizing SOLELY for compression, time to compress is a non-issue. I also recognize that .npy is already more compressed than, for example, an image so there is a limit to the opimality of the result.

I am dealing with both folders of npy files and single npy files alone.

Edit: The .npy files contain hyperspectral images from the Harvard Real World Hyperspectral Image Dataset stacked together


Solution

  • You should start by using your knowledge of the content of your NumPy arrays before they are even stored in an npy file. Are all of the bits in the hyperspectral image data significant? For example, if they consist of 64-bit floating point numbers, then almost certainly not. In that case a half or more of what you're saving is noise, which can't be compressed, and which also is not useful. You could first transform them to keep only the significant bits. Are adjacent pixels correlated? Almost certainly so. You could store only differences of the pixels, from the last one to the left, or a more sophisticated filter using pixels above and the left (see the PNG format for examples), or you could use a 2D cosine or wavelet transform to isolate higher frequency components, which you may be able to store in fewer bits. Are planes of the images correlated? Are successive images correlated? Anything else you can take advantage of?

    Then you can try to apply various lossless compression methods. Make sure that the format you save them in is not compressed. (As far as I can tell, .npy is not compressed, but .npz is.) Beyond LZMA, you can try PPMd, which may or may not give you better performance. But it will definitely meet your requirement of being slower!