hadoop hive hdfs hiveql hadoop-partitioning

Best option for merging multiple files within the same partition in hadoop?

I have a table partitioned on event_date, and for some reason, when I inserted data into my external table, some dates have only one or two files, while some have over 200.

I always use this snippet of code when kicking off the Hive queries to insert the data, so I'm not sure where/how it went out of whack for some dates, but not others. I thought the 'merge.tezfiles' line specifically is what handled the file merging upon insert.

SET mapred.job.queue.name=my_directory;
use this_directory;
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.max.dynamic.partitions=2000;
SET hive.exec.max.dynamic.partitions.pernode=2000;
SET hive.merge.tezfiles=true;

Everything I've found online mentions having to copy the files locally and uploading them again.

Is there a way to merge the multiple files within each date partition in a clean and simple way?

I have tried the following on a few dates which had 4 and 15 files, respectively. The Hive output after it ran confirmed that the extraneous files had been deleted, but when I went back and looked in Hadoop, there were just as many as when I started. Luckily the data was still accurate when I checked it, so I'm not sure what it deleted in the first place? Is this not at all the right command to be using?

alter table table_being_edited PARTITION(event_dt='2017-01-01') CONCATENATE;

Here is one such line where it confirmed the extra files had been removed:

Moved: 'my_hdfs_filepath/event_dt=2019-10-24/000052_0' to trash at: my_trash_directory/.Trash/Current

OK Time taken: 75.321 seconds

For the date that had 15 files, it gave me a similar output 15x.

I'm hoping to narrow down the dates with many files down to just one or two if at all possible, as we are running out of namespace. I'm very new to all of this, so is there any barebones, simple way of merging files within a single date partition?

Solution

By adding this line in addition to my other SET hive paramaters, I was able to consistently merge part files into a single file of size 5 gigabytes or less, upon inserting them into a new table:

set hive.merge.smallfiles.avgsize=5000000000;

It is also possible to use getmerge and then put the files back, but this requires additional steps of pulling the files down locally (must have lots of storage depending on the size of your files) which was more cumbersome than creating a new table, and inserting with this additional SET parameter.

The other alternative was using

set hive.merge.mapfiles=true;

Which seems to be a parameter to create no. of mappers. if we have small numbers of files it has to create that many mappers which is not optimal for hadoop design, thus the tez merge option is more suitable