I have a problem with a hadoop data set being split into too many data blocks.
Given an already present hadoop data set, is there a way to combine its blocks into fewer but larger blocks?
Is there a way to give pig
or hadoop-streaming.jar
(cloudera) an upper limit on the number of blocks they split the output into?
If you want a higher block size, set the desired block size value on the corresponding job only on the pig script
set dfs.block.size 134217728;
Alternatively you can also increase minimum split size, because the split size is calculated based on the formula
max(minsplitsize, min(maxsplitsize, blocksize))
set mapred.min.split.size 67108864
minsplitsize
, maxsplitsize
and blocksize
parameters only.