In our cluster the dfs.block.size is configured 128M, but I have seen quite a few files which is of the size of 68.8M which is a weird size. I have been confused on how exactly this configuration option affects how files look like on HDFS.
But the situations don't really match mine which makes my confusion remains. Hope anyone could give me some insight on that. Thanks a lot in advandce.
Files can be smaller than block, in this case it does not occupy the whole block size in filesystem. Read this answer: https://stackoverflow.com/a/14109147/2700344
If you are using Hive with dynamic partition load, small files are often produced by reducers which are writing many partitions each.
insert overwrite table mytable partition(event_date)
select col1, col2, event_date
from some_table;
For example if you running above command and there are totally 200 reducers on the last step and 20 different event_date partitions, then each reducer will create file in each partition. It will result in 200x20=4000 files.
Why is it happens? Because the data is being distributed randomly between reducers, each reducer receives all partitions data and creates files in every partition.
If you add distribute by partition key
insert overwrite table mytable partition(event_date)
select col1, col2, event_date
from some_table
distribute by event_date;
Then previous mapper step will group data according to distribute by, and reducers will receive the whole partition file and will create single file in each partition folder.
You may add something else to the distribute by to create more files (and run more reducers for better parallelism). Read these related answers: https://stackoverflow.com/a/59890609/2700344, https://stackoverflow.com/a/38475807/2700344, Specify minimum number of generated files from Hive insert