In order to speed up ETL queries on large tables, we run many analyze
queries on these tables and date columns in the evening.
but these analyze
queries on columns take lot of memory and time.
we are using tez.
is there any way to optimize analyze
query also like some set commands.
If you are loading tables using insert overwrite then statistics can be gathered automatically by setting hive.stats.autogather=true
during insert overwrite queries.
If the table is partitioned and partitions are being loaded incrementally, then you can analyze only last partitions.
ANALYZE TABLE [db_name.]tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)]
See examples here: https://cwiki.apache.org/confluence/display/Hive/StatsDev
For ORC files it's possible to specify hive.stats.gather.num.threads
to incraase parallelism.
See full list of statistic settings here: https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-Statistics