hivebigdatahiveqlhadoop-partitioning

Hive Bucketing : Number of distinct column value is greater than Number of bucketing number


In hive, say I have a table employee with 1000 records and I am bucketing with subject column. The total distinct values of the subject column is 20, but my total number of buckets is 6. How the shuffling happens?

While understanding the bucketing I came across two thing 1st :Bucket will group all the same value and 2nd :Bucket will group the data based on Hash function. For the above scenario HIVE uses which method (1st or 2nd) to shuffle the data? Help me to understand.


Solution

  • Hive create buckets based on the values decided by user while table creation. And hive stores the data inside each buckets based on the column value hash function you can think of like

    hash_function(bucketing_col_value) mod num_buckets

    so in your case it will be something like

    bucketing_col_value mod num_buckets

    based on that all records with same hash value will go in same partition. lets say you subject columns have 20 distinct values from 1 to 20. so

    1 mod 6 = 1 = will go in 1st bucket
    2 mod 6 = 2 = will go in 2nd bucket
    3 mod 6 = 3 = will go in 3rd bucket
    4 mod 6 = 4 = will go in 4th bucket
    5 mod 6 = 5 = will go in 5th bucket
    6 mod 6 = 6 = will go in 6th bucket
    7 mod 6 = 1 = will go in 1st bucket
    8 mod 6 = 2 = will go in 2nd bucket
    .
    .
    .
    So on..
    

    so according to your question both cases goes true in bucketing case

    1st :Bucket will group all the same value

    Yes because data with same value will have same hash value which means it will go in same buckets

    2nd :Bucket will group the data based on Hash function

    like i showed earlier buckets will divide data based on hash value of data.

    if you notice during data load phase in your query number of reducers is same as number of buckets in target table.

    I hope you are clear now