hadoophivehadoop-partitioninghive-partitionshiveddl

How partitioning and clustered by works in Hive table?


I'm trying to understand below query by using that how data is going to be placed.

CREATE TABLE mytable ( 
         name string,
         city string,
         employee_id int ) 
PARTITIONED BY (year STRING, month STRING, day STRING) 
CLUSTERED BY (employee_id) INTO 256 BUCKETS

The keyword PARTITIONED BY will distribute the data in below like dir structure.

/user/hive/warehouse/mytable/y=2015/m=12/d=02

But am not able to understand, how employee_id will be distributed among these directories ? 256 buckets (files) will be created, and all those files will be having all employee_id but which file will sit under which dir, how that will be decided ?

Can anyone help me to understand this ?


Solution

  • Directories are partitions or table location. Buckets are files inside these directories.

    Complex partitions are hierarchical directories. In your case:

    `/user/hive/warehouse/mytable/` - Table location, contains partition directories:
      `y=2015/` - year partition directory, contains months directories: 
        `m=12/` - month partition, contains days partitions directories:
          `d=02/` - day partition, contains 256 files(buckets)
            00000
            ...
            00255
         `d=03/` -Each day partition will contain 256 files (if you have enough data)
            00000
            ...
            00255     
    

    Each file will contain not all employee_id. Which record will sit in which file is decided using this formula:

     bucket_number=hash_function(employee_id) MOD 256 
    

    Where hash_function is integer, in case of Int employee_id it is equal to employee_id.

    256 - is the number of buckets

    MOD 256 will produce integer values in the range [0..255], corresponding to bucket numbers.

    The same id's will always be in the same buckets. Each daily partition will contain it's own files(buckets), up to 256 buckets in each.

    Say, employee_id = 1024 goes into bucket 0, if the same employee_id exists in many days, it will be in file 00000 in each day directory.

    employee_id=1050 goes into file 000026 because 1050 MOD 256 = 26.

    So, first the data is partitioned by partition key, inside partitions it is bucketed (distributed between files).