hivemapreduceconcatenationorcapache-tez

hive alter table concatenate command risks


I have been using tez engine to run map reduce jobs. I have a MR job which takes ages to run, because i noticed i have over 20k files with 1 stripe each, and tez does not evenly distributes mappers based on amount of files, rather amount of stripes. And i can have a bunch of mappers with 1 file but a lot of stripes, and some mappers processing 15k files but with same amount of stripes than the other one.

As a workaround test, i used ALTER TALE table PARTITION (...) CONCATENATE in order to bring down the amount of files to process into more evenly distributed stripes per files, and now the map job runs perfectly fine.

My concern is that i didnt find in the documentation if there are any risks in running this command and losing data, since it works on the same files.

Im trying to assess if its better to use concatenate to bring down the amount of files before the MR job versus using bucketing which reads files and drops bucketed output into a separate location. Which in case of failure i dont lose source data.

Concatenate takes 1 minute per partition, versus bucketing taking more time but not risking losing source data.

My question: is there any risk of data loss when running concatenate command?

thanks!


Solution

  • It should work as safe as rewriting the table from query. It uses the same mechanism: result is prepared in staging first, after that staging moved to the table or partition location.

    Concatenation works as a separate MR job, prepares concatenated files in staging directory and only if everything went without errors, moves them to the table location. You shold see something like this in logs:

    INFO  : Loading data to table dbname.tblName partition (bla bla) from /apps/hive/warehouse/dbname.db/tblName/bla bla partition path/.hive-staging_hive_2018-08-16_21-28-01_294_168641035365555493-149145/-ext-10000