hadoopapache-pighadoop-streamingvowpalwabbit

Limit number of files(blocks) in a hadoop data set?


I have a problem with a hadoop data set being split into too many data blocks.

  1. Given an already present hadoop data set, is there a way to combine its blocks into fewer but larger blocks?

  2. Is there a way to give pig or hadoop-streaming.jar (cloudera) an upper limit on the number of blocks they split the output into?


Solution

    1. If you want a higher block size, set the desired block size value on the corresponding job only on the pig script

      set dfs.block.size 134217728;

    Alternatively you can also increase minimum split size, because the split size is calculated based on the formula

    max(minsplitsize, min(maxsplitsize, blocksize))
    
    set mapred.min.split.size 67108864
    
    1. Restricting the number of blocks created is not possible it has to be controlled by minsplitsize, maxsplitsize and blocksize parameters only.