I'm new to my role and part of it requires creating/inserting data into both managed and external hive tables. We have a few lines of 'set' parameters that we run at the beginning of a hive session, but I've run into a few cases, where, for example, the files are merged for some partitions (few number of files), but not others (many smaller files), seemingly on random days.
My question is: when is it necessary to enter all of my Hive set parameters? Does it need to be done for every single insert/command/statement I'm running? Or just once at the beginning of the Hive session when I've launched Hive?
These are the standard set parameters we've been using:
SET mapred.job.queue.name=yometrics;
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.max.dynamic.partitions=2000;
SET hive.exec.max.dynamic.partitions.pernode=2000;
SET hive.merge.tezfiles=true;
You can put configuration in the beginning of the file, it will work for the whole session.
Alternatively you can put common parameters in the separate file params.hql
and in each script call
source /local/path/to/the/file/params.hql
in the beginning.
Also you can put them in the hive-site.xml
Also you can use bootstrap for the same if you are on Qubole/AWS: https://docs.qubole.com/en/latest/user-guide/hive/bootstrap-script.html