In hive, say I have a table employee with 1000 records and I am bucketing with subject column. The total distinct values of the subject column is 20, but my total number of buckets is 6. How the shuffling happens?
While understanding the bucketing I came across two thing 1st :Bucket will group all the same value and 2nd :Bucket will group the data based on Hash function. For the above scenario HIVE uses which method (1st or 2nd) to shuffle the data? Help me to understand.
Hive create buckets based on the values decided by user while table creation. And hive stores the data inside each buckets based on the column value hash function you can think of like
hash_function(bucketing_col_value) mod num_buckets
so in your case it will be something like
bucketing_col_value mod num_buckets
based on that all records with same hash value will go in same partition. lets say you subject columns have 20 distinct values from 1 to 20. so
1 mod 6 = 1 = will go in 1st bucket
2 mod 6 = 2 = will go in 2nd bucket
3 mod 6 = 3 = will go in 3rd bucket
4 mod 6 = 4 = will go in 4th bucket
5 mod 6 = 5 = will go in 5th bucket
6 mod 6 = 6 = will go in 6th bucket
7 mod 6 = 1 = will go in 1st bucket
8 mod 6 = 2 = will go in 2nd bucket
.
.
.
So on..
so according to your question both cases goes true in bucketing case
1st :Bucket will group all the same value
Yes because data with same value will have same hash value which means it will go in same buckets
2nd :Bucket will group the data based on Hash function
like i showed earlier buckets will divide data based on hash value of data.
if you notice during data load phase in your query number of reducers is same as number of buckets in target table.
I hope you are clear now