I'm trying to understand below query by using that how data is going to be placed.
CREATE TABLE mytable (
name string,
city string,
employee_id int )
PARTITIONED BY (year STRING, month STRING, day STRING)
CLUSTERED BY (employee_id) INTO 256 BUCKETS
The keyword PARTITIONED BY
will distribute the data in below like dir structure.
/user/hive/warehouse/mytable/y=2015/m=12/d=02
But am not able to understand, how employee_id
will be distributed among these directories ? 256 buckets (files) will be created, and all those files will be having all employee_id
but which file will sit under which dir, how that will be decided ?
Can anyone help me to understand this ?
Directories are partitions or table location. Buckets are files inside these directories.
Complex partitions are hierarchical directories. In your case:
`/user/hive/warehouse/mytable/` - Table location, contains partition directories:
`y=2015/` - year partition directory, contains months directories:
`m=12/` - month partition, contains days partitions directories:
`d=02/` - day partition, contains 256 files(buckets)
00000
...
00255
`d=03/` -Each day partition will contain 256 files (if you have enough data)
00000
...
00255
Each file will contain not all employee_id. Which record will sit in which file is decided using this formula:
bucket_number=hash_function(employee_id) MOD 256
Where hash_function
is integer, in case of Int employee_id
it is equal to employee_id.
256
- is the number of buckets
MOD 256
will produce integer values in the range [0..255], corresponding to bucket numbers.
The same id's will always be in the same buckets. Each daily partition will contain it's own files(buckets), up to 256 buckets in each.
Say, employee_id = 1024 goes into bucket 0, if the same employee_id exists in many days, it will be in file 00000 in each day directory.
employee_id=1050
goes into file 000026 because 1050 MOD 256 = 26
.
So, first the data is partitioned by partition key, inside partitions it is bucketed (distributed between files).