apache-sparkhive

Differences in the calculation logic of grouping_id in HIVE


I have recently encountered an issue while using Hive 2.3.9 version.

When my group by is set to "a, b, c" and the grouping_sets is set to (a, b, (a, b)).

In last Hive version, the grouping__id result is when "a" then 1 when "b" then 2 when "a,b" then 3. But in the 2.3.9 version, I found the result is when "a" then 1 when "b" then 2 when "a, b" then 0. The new regular look like same as Spark grouping__id Regular.

I remember the Hive Grouping__id regular is from low to high, the defualt all is 0, when the dimension appear then reset to 1, finally to tranfer decimalism.

But now it looks like been channged.

So did Hive make some changes ???


Solution

  • As I mentioned in the comment this is because of aligning its behavior with other query engines like apache spark.

    In older versions of Hive:

    Behavior in Hive 2.3.9:

    Example:

    a b c
    1 1 3
    4 5 5

    Lets try bellow:

    SELECT a, b, c, grouping_id(a, b, c)
    FROM table
    GROUP BY GROUPING SETS ((a), (b), (a, b));
    

    Old Behavior (High-to-Low Bit Order)

    Grouping Set Bit Pattern (a, b, c) grouping_id (Decimal)
    (a) 0 1 1 3
    (b) 1 0 1 5
    (a, b) 0 0 1 1

    New Behavior (Low-to-High Bit Order)

    Grouping Set Bit Pattern (a, b, c) grouping_id (Decimal)
    (a) 1 1 0 6
    (b) 1 0 1 5
    (a, b) 1 0 0 4

    References:

    There is a workaround to reach the old result? Let me know if you are intrested in hearing that. I will update the answer.