pythoncompressionzliborc

How to use python to create ORC file compressed with ZLIB compression level 9?


I want to create an ORC file compressed with ZLIB compression level 9. Thing is, when using pyarrow.orc, I can only choose between "Speed" and "Compression" mode, and can't control the compression level

E.g.

orc.write_table(table, '{0}_zlib.orc'.format(file_without_ext),
                compression='ZLIB', compression_strategy='COMPRESSION')

Ideally I'm looking for a non existing compression_level parameter, any help would be appreciated.


Solution

  • The Apache ORC library (which is used internally by other libraries for ORC support) doesn't allow to set the compression level freely (neither the C++ nor the Java implementation).

    The C++ library supports only CompressionStrategy_SPEED and CompressionStrategy_COMPRESSION (source):

    enum CompressionStrategy { CompressionStrategy_SPEED = 0, CompressionStrategy_COMPRESSION };
    

    The Java library offers an additional FASTEST option (source):

    enum SpeedModifier {
        /* speed/compression tradeoffs */
        FASTEST,
        FAST,
        DEFAULT
      }
    

    There is an open request in the project about this: Support maximum compression ratio in setSpeed. It was created a year ago but the feature has not been implemented so far.

    So, unless you patch the library yourself, there is no way to set a high compression level.