Lets say I have a partitioned table with multiple columns as partition keys e.g.
partitioned by (department string,year int, month int,day int)
So does this specific order really matter? All the online resources refer to advantage of scanning only specific sub-directories for search. But ultimately everything is a file in big data, directories seem to be more like logical grouping. And when one specifies a filter on partitioned column, hive just needs to know which files are involved and where they are located, not sure how directory is going to be useful -- it's not like as if directories are loaded in memory -- files are loaded in memory -- and the directory path is more like a label for a given file. If that's the case, no matter which order we specify for partitioning , it shouldn't matter. This is especially more evident in HDInsight where the underlying file system (BLOBs) has no concept of directory.
Although you're right about directories being logical constructs, if you consider the amount of metadata your HiveServer2 has to get and sift through in order to execute an average query, the order does matter. If a query contains ...WHERE department='IT'...
, and the partitions are laid out as you show, given 100 departments total, the partition pruning mechanism will be able to eliminate 99 subdirectories from the tree right away. But if the order of partition columns is reversed, the same query will need to retrieve metadata for (30 days x 12 month x N years) partitions from Hive MetaStore, just to figure out whether partition /department=IT
actually exists in all of them. So the order of partitions can be decided by analyzing predominant query patterns.
Another common factor to consider is devops/maintenance related, especially if data is loaded into a table incrementally. If one needs to backoff/recover from unsuccessful load, will he need to drop a partition (day=08
) in each department subtree individually, or can all department data be cleared at once by dropping partition (day=08
)?