hadoophiveorcapache-tezzstd

"No enum constant org.apache.orc.CompressionKind.ZSTD" When Insert Data to ORC Compress ZSTD Table


I have created a table in hive 3.1.3 as below;

Create external table test_tez_orc_zstd
( 
Id bigint
)stored as orc
Tblproperties(orc.compress=zstd)
Location '...'

It is created, and then I wanted to insert one row;

Insert into test_tez_orc_zstd
Select 1 

Then it throwed following error;

No enum constant org.apache.orc.CompressionKind.ZSTD

Hive is configured to use Tez.

If I do same thing for parquet compress zstd it works.

How can I handle this?


Solution

  • ROOT CAUSE:

    Apache Hive version 3.1.3 uses orc version 1.5.8, please see here. zstd decompression has been supported in orc starting from 1.6.0; https://issues.apache.org/jira/browse/ORC-363.

    You can see 1.5.8 enum constants here and 1.6.0 here. So, in this case we can say that Hive 3.1.3 does not support Tblproperties(orc.compress=zstd).


    POSSIBLE SOLUTION:

    In Hive, orc version has been upgraded to above 1.6.0 in release 4.0.0-alpha-1 here https://issues.apache.org/jira/browse/HIVE-23553.

    This might be challenging, but you can backport related commits on top of release tag 3.1.3, then build the project and replace the related jars in Hive's library.

    Please note that not only orc dependencies are in Hive's library directly, but also they are included into some of the fat jars such as hive-exec.

    So, steps should be as follows;

    1. Clone hive and checkout to release tag 3.1.3.
    2. Backport the commits that upgrade orc to the desired version.
    3. Build the project mvn clean package -DskipTests.
    4. grep orc in hive library where you installed hive to see which orc dependencies directly in the classpath, and which fat jars have orc classes.
    5. Replace the jars that you identified in the previous step.

    The challenging part is that orc upgrade commits can be pretty big, and there might be conflicts.