I want to create an ORC file compressed with ZLIB compression level 9. Thing is, when using pyarrow.orc, I can only choose between "Speed" and "Compression" mode, and can't control the compression level
E.g.
orc.write_table(table, '{0}_zlib.orc'.format(file_without_ext),
compression='ZLIB', compression_strategy='COMPRESSION')
Ideally I'm looking for a non existing compression_level
parameter, any help would be appreciated.
The Apache ORC library (which is used internally by other libraries for ORC support) doesn't allow to set the compression level freely (neither the C++ nor the Java implementation).
The C++ library supports only CompressionStrategy_SPEED
and CompressionStrategy_COMPRESSION
(source):
enum CompressionStrategy { CompressionStrategy_SPEED = 0, CompressionStrategy_COMPRESSION };
The Java library offers an additional FASTEST
option (source):
enum SpeedModifier {
/* speed/compression tradeoffs */
FASTEST,
FAST,
DEFAULT
}
There is an open request in the project about this: Support maximum compression ratio in setSpeed. It was created a year ago but the feature has not been implemented so far.
So, unless you patch the library yourself, there is no way to set a high compression level.