After reading about query optimization techniques I came to know about the below techniques.
1. Indexing - bitmap and BTree
2. Partitioning
3. Bucketing
I got the difference between partitioning and bucketing, and when to use them but I'm still confused how indexes actually work. Where is the metadata for index is stored? Is it the namenode which is storing it? I.e., actually while creating partitions or buckets we can see multiple directories in hdfs which explains the query performance optimization but how to visualize indexes? Are they really used in real life despite partitioning and bucketing being in the picture?
Please help me for the above queries and is there's any dedicated page for hadoop and hive developers community?
But there are light-weight indexes in ORC (well, not actually classic indexes but min, max and Bloom filter, it helps to prune stripes). ORC indexes and bloom filters are efficient if the data is sorted during insert (distribute+sort)
Partitioning is the most efficient if partitioning schema corresponds to how the table is being filtered or how is it being loaded (allows to load partitions in parallel, if the increment data is the whole partition it works efficiently).
Bucketing can help with optimizing joins and group by but sort-merge-bucket-mapjoin has serious restrictions making it also not efficient. Both tables should have the same bucketing schema, which in real life is rare or can be extremely inefficient. Also data should be sorted when loading buckets.
Consider using ORC with built-in indexes and Bloom filters, keep less number of files in your table to avoid metadata overload and avoid mappers copying thousands of files. Read this partitions in hive interview questions and this Sorted Table in Hive
Useful links.
Official documentation: LanguageManual
Cloudera community: https://community.cloudera.com/