apachehivecombiners

How does combineInputFormat works in Hive?


I have a Hive table with following properties

Follows the table parameters from "DESCRIBE FORMATTED" command

Table Parameters:

    COLUMN_STATS_ACCURATE   true
    numFiles                50
    totalSize               170774650

I am performing a count(*) operation on this table and it is running with

The max split size for both the Hive sessions is 256MB

I wanted to know how the combine input format works?

On a single machine, the data is clubbed together since all the files/blocks were on the same machine and since the total size of the files combined together is less than max split size, a single split and hence a single mapper is called for.

In the other case, AWS cluster resulted in 4 mappers. I read that CombineInputFormat employs rack/machine locality but precisely how?

Thanks for all your answers in advance.


Solution

  • Ok! No reply!!! I figured it out over time and was visiting my Stack Overflow account today and found this unlucky question sitting unanswered. So follows the details.

    Splits are constructed from the files under the input paths. A split cannot have files from different pools. Each split returned may contain blocks from different files. If a maxSplitSize is specified, then blocks on the same node are combined to form a single split. Blocks that are left over are then combined with other blocks in the same rack. If maxSplitSize is not specified, then blocks from the same rack are combined in a single split; no attempt is made to create node-local splits. If the maxSplitSize is equal to the block size, then this class is similar to the default splitting behavior in Hadoop: each block is a locally processed split. Subclasses implement InputFormat.createRecordReader(InputSplit, TaskAttemptContext) to construct RecordReader's for CombineFileSplit's.

    Hope it helps some one having a similar question!